Skip to main content

Crate baracuda_kernels_sys

Crate baracuda_kernels_sys 

Source
Expand description

§baracuda-kernels-sys

Raw extern "C" entry points for compiled bespoke kernels. You almost certainly want baracuda-kernels instead — that crate wraps these unsafe calls with typed plans, lifetime-checked device buffers, and a proper Rust API.

Functions in this crate take raw void* pointers, integer dimensions, and a cudaStream_t cast as *mut c_void. They are unsafe because:

  • They dereference the pointer arguments without bounds-checking.
  • They assume the pointers are valid device addresses.
  • They assume the workspace pointer (when non-null) points to at least workspace_bytes of writable device memory.
  • They assume the stream is a valid CUDA stream owned by the calling thread’s current context.

§Status codes

All *_run and *_can_implement functions return an i32 status:

  • 0: success.
  • 1: misaligned operand.
  • 2: invalid problem (e.g. M, N, or K is non-positive).
  • 3: not supported (this kernel doesn’t implement the requested shape).
  • 4: workspace too small or null when required.
  • 5: internal kernel error (typically a launch failure).

Structs§

cuComplex
ABI-compatible single-precision complex struct, matching cuComplex from <cuComplex.h> (interleaved real/imag f32). Identical layout to crate::cufftComplex and to the safe-side [Complex32] from baracuda-kernels-types — a DeviceBuffer<Complex32> can be cast to a *mut cuComplex for the cuSOLVER complex APIs without copy.
cuDoubleComplex
ABI-compatible double-precision complex struct, matching cuDoubleComplex from <cuComplex.h>. Sibling to cuComplex.
cufftComplex
Single-precision complex element layout. Interleaved real/imag pairs — #[repr(C)] matches NVIDIA’s cufftComplex struct exactly (which is itself an alias for float2 in <vector_types.h>). The plan layer pairs this with the crate-level Complex32 newtype.
cufftDoubleComplex
Double-precision complex element layout. ABI-compatible with cuFFT’s cufftDoubleComplex (alias for double2).

Constants§

CUBLAS_COMPUTE_32F
CUBLAS_COMPUTE_32F — fp32 accumulator.
CUBLAS_COMPUTE_64F
CUBLAS_COMPUTE_64F — fp64 accumulator.
CUBLAS_DIAG_NON_UNIT
CUBLAS_DIAG_NON_UNITtrsm reads the actual diagonal of A. Used by the LstSq QR-fallback path for the back-substitution R · X = Q^T · B, where R’s diagonal is the meaningful pivots.
CUBLAS_DIAG_UNIT
CUBLAS_DIAG_UNITtrsm treats the diagonal as all-1s (unit-triangular). Not used by the current plan layer; surfaced for completeness.
CUBLAS_FILL_MODE_LOWER
CUBLAS_FILL_MODE_LOWER — pass to potrf to request the lower- triangular Cholesky factor.
CUBLAS_FILL_MODE_UPPER
CUBLAS_FILL_MODE_UPPER — pass to potrf to request the upper- triangular Cholesky factor.
CUBLAS_GEMM_DEFAULT
CUBLAS_GEMM_DEFAULT — let cuBLAS pick the algorithm.
CUBLAS_OP_C
CUBLAS_OP_C — conjugate transpose (only meaningful for complex dtypes). Used by cusolverDn{C,Z}unmqr to apply Q^H.
CUBLAS_OP_N
CUBLAS_OP_N — no transpose. Used by ormqr to control whether to apply Q or Q^T.
CUBLAS_OP_T
CUBLAS_OP_T — transpose.
CUBLAS_SIDE_LEFT
CUBLAS_SIDE_LEFTQ is applied from the left in ormqr (C := Q · C or C := Q^T · C).
CUBLAS_SIDE_RIGHT
CUBLAS_SIDE_RIGHTQ is applied from the right.
CUDA_C_32F
CUDA_C_32F — complex f32 (interleaved real/imag).
CUDA_C_64F
CUDA_C_64F — complex f64 (interleaved real/imag).
CUDA_R_16BF
CUDA_R_16BF — bfloat16 (real). Storage tag for __nv_bfloat16.
CUDA_R_16F
CUDA_R_16F — real f16.
CUDA_R_32F
CUDA_R_32F — real f32.
CUDA_R_64F
CUDA_R_64F — real f64.
CUFFT_C2C
cuFFT plan type: complex-to-complex (single precision). Direction is supplied to cufftExecC2C.
CUFFT_C2R
cuFFT plan type: complex-to-real (single precision). Input is N/2 + 1 complex cells (Hermitian-half), output is N real cells.
CUFFT_D2Z
cuFFT plan type: double-precision real-to-complex.
CUFFT_FORWARD
Forward FFT direction tag for cufftExecC2C / cufftExecZ2Z. cuFFT’s forward transform is unnormalized.
CUFFT_INVERSE
Inverse FFT direction tag for cufftExecC2C / cufftExecZ2Z. cuFFT’s inverse transform is also unnormalized — the safe-plan layer multiplies the output by 1/N after exec to match PyTorch’s norm="backward" (forward unnormalized, inverse normalized by N) convention.
CUFFT_R2C
cuFFT plan type: real-to-complex (single precision). Output buffer size is N/2 + 1 complex cells for an N-long real input (Hermitian symmetry).
CUFFT_SUCCESS
CUFFT_SUCCESS — the only success code.
CUFFT_Z2D
cuFFT plan type: double-precision complex-to-real.
CUFFT_Z2Z
cuFFT plan type: double-precision complex-to-complex.
CURAND_RNG_PSEUDO_DEFAULT
CURAND_RNG_PSEUDO_DEFAULT — XORWOW pseudo-random generator. Adequate for the dropout / sampling use cases this milestone targets; future QRNG / Philox / MT19937 work can extend the descriptor surface.
CURAND_STATUS_SUCCESS
CURAND_STATUS_SUCCESS — only success code. Any non-zero return from the cuRAND host API is mapped to status 5 (“internal kernel error”) at the safe-plan layer.
CUSOLVER_EIG_MODE_NOVECTOR
CUSOLVER_EIG_MODE_NOVECTORgesvdjBatched jobz value for computing singular values only (skip U / V).
CUSOLVER_EIG_MODE_VECTOR
CUSOLVER_EIG_MODE_VECTORgesvdjBatched jobz value for computing both singular values and singular vectors.
CUSOLVER_STATUS_SUCCESS
CUSOLVER_STATUS_SUCCESS — the only success code. Any non-zero return from a cuSOLVER routine is mapped to a negative status at the safe-plan layer for distinct error reporting.

Functions§

baracuda_kernels_adaptive_avg_pool_bf16_bw_can_implement
baracuda_kernels_adaptive_avg_pool_bf16_bw_can_implement (baracuda kernels adaptive avg pool bf16 bw can implement).
baracuda_kernels_adaptive_avg_pool_bf16_bw_run
Adaptive AvgPool BW, bf16.
baracuda_kernels_adaptive_avg_pool_bf16_fw_can_implement
baracuda_kernels_adaptive_avg_pool_bf16_fw_can_implement (baracuda kernels adaptive avg pool bf16 fw can implement).
baracuda_kernels_adaptive_avg_pool_bf16_fw_run
Adaptive AvgPool FW, bf16.
baracuda_kernels_adaptive_avg_pool_f16_bw_can_implement
baracuda_kernels_adaptive_avg_pool_f16_bw_can_implement (baracuda kernels adaptive avg pool f16 bw can implement).
baracuda_kernels_adaptive_avg_pool_f16_bw_run
Adaptive AvgPool BW, f16. Zeros dx internally, then atomic-scatters.
baracuda_kernels_adaptive_avg_pool_f16_fw_can_implement
baracuda_kernels_adaptive_avg_pool_f16_fw_can_implement (baracuda kernels adaptive avg pool f16 fw can implement).
baracuda_kernels_adaptive_avg_pool_f16_fw_run
Adaptive AvgPool FW, f16. Rank-agnostic (spatial_rank ∈ {1,2,3}).
baracuda_kernels_adaptive_avg_pool_f32_bw_can_implement
baracuda_kernels_adaptive_avg_pool_f32_bw_can_implement (baracuda kernels adaptive avg pool f32 bw can implement).
baracuda_kernels_adaptive_avg_pool_f32_bw_run
Adaptive AvgPool BW, f32.
baracuda_kernels_adaptive_avg_pool_f32_fw_can_implement
baracuda_kernels_adaptive_avg_pool_f32_fw_can_implement (baracuda kernels adaptive avg pool f32 fw can implement).
baracuda_kernels_adaptive_avg_pool_f32_fw_run
Adaptive AvgPool FW, f32.
baracuda_kernels_adaptive_avg_pool_f64_bw_can_implement
baracuda_kernels_adaptive_avg_pool_f64_bw_can_implement (baracuda kernels adaptive avg pool f64 bw can implement).
baracuda_kernels_adaptive_avg_pool_f64_bw_run
Adaptive AvgPool BW, f64.
baracuda_kernels_adaptive_avg_pool_f64_fw_can_implement
baracuda_kernels_adaptive_avg_pool_f64_fw_can_implement (baracuda kernels adaptive avg pool f64 fw can implement).
baracuda_kernels_adaptive_avg_pool_f64_fw_run
Adaptive AvgPool FW, f64.
baracuda_kernels_adaptive_max_pool_bf16_bw_can_implement
baracuda_kernels_adaptive_max_pool_bf16_bw_can_implement (baracuda kernels adaptive max pool bf16 bw can implement).
baracuda_kernels_adaptive_max_pool_bf16_bw_run
Adaptive MaxPool BW, bf16.
baracuda_kernels_adaptive_max_pool_bf16_fw_can_implement
baracuda_kernels_adaptive_max_pool_bf16_fw_can_implement (baracuda kernels adaptive max pool bf16 fw can implement).
baracuda_kernels_adaptive_max_pool_bf16_fw_run
Adaptive MaxPool FW, bf16.
baracuda_kernels_adaptive_max_pool_f16_bw_can_implement
baracuda_kernels_adaptive_max_pool_f16_bw_can_implement (baracuda kernels adaptive max pool f16 bw can implement).
baracuda_kernels_adaptive_max_pool_f16_bw_run
Adaptive MaxPool BW, f16. Recomputes the per-window argmax from the saved x, zeros dx internally, then atomic-scatters dy into the argmax positions.
baracuda_kernels_adaptive_max_pool_f16_fw_can_implement
baracuda_kernels_adaptive_max_pool_f16_fw_can_implement (baracuda kernels adaptive max pool f16 fw can implement).
baracuda_kernels_adaptive_max_pool_f16_fw_run
Adaptive MaxPool FW, f16. Writes y only — the matching BW recomputes the argmax internally from the saved x (keeps the Phase 11.8 args shape; no separate indices tensor).
baracuda_kernels_adaptive_max_pool_f32_bw_can_implement
baracuda_kernels_adaptive_max_pool_f32_bw_can_implement (baracuda kernels adaptive max pool f32 bw can implement).
baracuda_kernels_adaptive_max_pool_f32_bw_run
Adaptive MaxPool BW, f32.
baracuda_kernels_adaptive_max_pool_f32_fw_can_implement
baracuda_kernels_adaptive_max_pool_f32_fw_can_implement (baracuda kernels adaptive max pool f32 fw can implement).
baracuda_kernels_adaptive_max_pool_f32_fw_run
Adaptive MaxPool FW, f32.
baracuda_kernels_adaptive_max_pool_f64_bw_can_implement
baracuda_kernels_adaptive_max_pool_f64_bw_can_implement (baracuda kernels adaptive max pool f64 bw can implement).
baracuda_kernels_adaptive_max_pool_f64_bw_run
Adaptive MaxPool BW, f64.
baracuda_kernels_adaptive_max_pool_f64_fw_can_implement
baracuda_kernels_adaptive_max_pool_f64_fw_can_implement (baracuda kernels adaptive max pool f64 fw can implement).
baracuda_kernels_adaptive_max_pool_f64_fw_run
Adaptive MaxPool FW, f64.
baracuda_kernels_affine_bf16_can_implement
Implementability check for affine_bf16. Host-side only.
baracuda_kernels_affine_bf16_run
Affine y = a*x + b, bf16 storage / f32 compute. a / b arrive as f32.
baracuda_kernels_affine_bf16_strided_can_implement
baracuda_kernels_affine_bf16_strided_can_implement (baracuda kernels affine bf16 strided can implement).
baracuda_kernels_affine_bf16_strided_run
Strided affine y = a*x + b, bf16 storage / f32 compute. a / b arrive as f32.
baracuda_kernels_affine_f16_can_implement
Implementability check for affine_f16. Host-side only.
baracuda_kernels_affine_f16_run
Affine y = a*x + b, f16 storage / f32 compute. a / b arrive as f32.
baracuda_kernels_affine_f16_strided_can_implement
baracuda_kernels_affine_f16_strided_can_implement (baracuda kernels affine f16 strided can implement).
baracuda_kernels_affine_f16_strided_run
Strided affine y = a*x + b, f16 storage / f32 compute. a / b arrive as f32.
baracuda_kernels_affine_f32_can_implement
Implementability check for affine_f32. Host-side only.
baracuda_kernels_affine_f32_run
Affine y = a*x + b, f32 dtype.
baracuda_kernels_affine_f32_strided_can_implement
baracuda_kernels_affine_f32_strided_can_implement (baracuda kernels affine f32 strided can implement).
baracuda_kernels_affine_f32_strided_run
Strided affine y = a*x + b, f32 dtype.
baracuda_kernels_affine_f64_can_implement
Implementability check for affine_f64. Host-side only.
baracuda_kernels_affine_f64_run
Affine y = a*x + b, f64 dtype.
baracuda_kernels_affine_f64_strided_can_implement
baracuda_kernels_affine_f64_strided_can_implement (baracuda kernels affine f64 strided can implement).
baracuda_kernels_affine_f64_strided_run
Strided affine y = a*x + b, f64 dtype.
baracuda_kernels_affine_grid_2d_f32_can_implement
baracuda_kernels_affine_grid_2d_f32_can_implement (baracuda kernels affine grid 2d f32 can implement).
baracuda_kernels_affine_grid_2d_f32_run
affine_grid(theta, size) — produce [N, OH, OW, 2] grid from theta: [N, 2, 3]. f32. # Safety: as above.
baracuda_kernels_affine_grid_2d_f64_can_implement
baracuda_kernels_affine_grid_2d_f64_can_implement (baracuda kernels affine grid 2d f64 can implement).
baracuda_kernels_affine_grid_2d_f64_run
affine_grid_2d, f64. # Safety: as f32.
baracuda_kernels_affine_i8_can_implement
Implementability check for affine_i8. Host-side only.
baracuda_kernels_affine_i8_run
Affine y = a*x + b, i8 dtype.
baracuda_kernels_affine_i32_can_implement
Implementability check for affine_i32. Host-side only.
baracuda_kernels_affine_i32_run
Affine y = a*x + b, i32 dtype.
baracuda_kernels_affine_i32_strided_can_implement
baracuda_kernels_affine_i32_strided_can_implement (baracuda kernels affine i32 strided can implement).
baracuda_kernels_affine_i32_strided_run
Strided affine y = a*x + b, i32 dtype.
baracuda_kernels_affine_i64_can_implement
Implementability check for affine_i64. Host-side only.
baracuda_kernels_affine_i64_run
Affine y = a*x + b, i64 dtype.
baracuda_kernels_affine_i64_strided_can_implement
baracuda_kernels_affine_i64_strided_can_implement (baracuda kernels affine i64 strided can implement).
baracuda_kernels_affine_i64_strided_run
Strided affine y = a*x + b, i64 dtype.
baracuda_kernels_affine_inplace_bf16_can_implement
Implementability check for baracuda_kernels_affine_inplace_bf16. Host-side only.
baracuda_kernels_affine_inplace_bf16_run
In-place affine y = scale * y + offset (bf16). Phase 61 — added for Fuel’s INPLACE_AFFINE op family completion (bf16/f16 weight-decay scaling, Op::AddScalar / Op::MulScalar on bf16 model weights).
baracuda_kernels_affine_inplace_bf16_strided_can_implement
Implementability check for baracuda_kernels_affine_inplace_bf16_strided. Host-side only.
baracuda_kernels_affine_inplace_bf16_strided_run
Strided in-place affine (bf16; f32 scalars). Phase 62.
baracuda_kernels_affine_inplace_f16_can_implement
Implementability check for baracuda_kernels_affine_inplace_f16. Host-side only.
baracuda_kernels_affine_inplace_f16_run
In-place affine y = scale * y + offset (f16). Phase 61 — added for Fuel’s INPLACE_AFFINE op family completion.
baracuda_kernels_affine_inplace_f16_strided_can_implement
Implementability check for baracuda_kernels_affine_inplace_f16_strided. Host-side only.
baracuda_kernels_affine_inplace_f16_strided_run
Strided in-place affine (f16; f32 scalars). Phase 62.
baracuda_kernels_affine_inplace_f32_can_implement
Implementability check for baracuda_kernels_affine_inplace_f32. Host-side only.
baracuda_kernels_affine_inplace_f32_run
In-place affine y = scale * y + offset (f32). Used by the safe-plan layer to remap a cuRAND uniform-(0, 1] buffer into Uniform(low, high].
baracuda_kernels_affine_inplace_f32_strided_can_implement
Implementability check for baracuda_kernels_affine_inplace_f32_strided. Host-side only.
baracuda_kernels_affine_inplace_f32_strided_run
In-place affine y[off] = scale * y[off] + offset over a strided view (f32). Phase 62.
baracuda_kernels_affine_inplace_f64_can_implement
Implementability check for baracuda_kernels_affine_inplace_f64. Host-side only.
baracuda_kernels_affine_inplace_f64_run
In-place affine y = scale * y + offset (f64).
baracuda_kernels_affine_inplace_f64_strided_can_implement
Implementability check for baracuda_kernels_affine_inplace_f64_strided. Host-side only.
baracuda_kernels_affine_inplace_f64_strided_run
Strided in-place affine (f64). Phase 62.
baracuda_kernels_affine_inplace_i8_can_implement
Implementability check for baracuda_kernels_affine_inplace_i8. Host-side only.
baracuda_kernels_affine_inplace_i8_run
In-place affine y = scale * y + offset (i8). Phase 62.
baracuda_kernels_affine_inplace_i32_can_implement
Implementability check for baracuda_kernels_affine_inplace_i32. Host-side only.
baracuda_kernels_affine_inplace_i32_run
In-place affine y = scale * y + offset (i32). Phase 62.
baracuda_kernels_affine_inplace_i32_strided_can_implement
Implementability check for baracuda_kernels_affine_inplace_i32_strided. Host-side only.
baracuda_kernels_affine_inplace_i32_strided_run
Strided in-place affine (i32). Phase 62.
baracuda_kernels_affine_inplace_i64_can_implement
Implementability check for baracuda_kernels_affine_inplace_i64. Host-side only.
baracuda_kernels_affine_inplace_i64_run
In-place affine y = scale * y + offset (i64). Phase 62.
baracuda_kernels_affine_inplace_i64_strided_can_implement
Implementability check for baracuda_kernels_affine_inplace_i64_strided. Host-side only.
baracuda_kernels_affine_inplace_i64_strided_run
Strided in-place affine (i64). Phase 62.
baracuda_kernels_affine_inplace_u8_can_implement
Implementability check for baracuda_kernels_affine_inplace_u8. Host-side only.
baracuda_kernels_affine_inplace_u8_run
In-place affine y = scale * y + offset (u8). Phase 62.
baracuda_kernels_affine_inplace_u8_strided_can_implement
Implementability check for baracuda_kernels_affine_inplace_u8_strided. Host-side only.
baracuda_kernels_affine_inplace_u8_strided_run
Strided in-place affine (u8). Phase 62.
baracuda_kernels_affine_u8_can_implement
Implementability check for affine_u8. Host-side only.
baracuda_kernels_affine_u8_run
Affine y = a*x + b, u8 dtype.
baracuda_kernels_affine_u8_strided_can_implement
baracuda_kernels_affine_u8_strided_can_implement (baracuda kernels affine u8 strided can implement).
baracuda_kernels_affine_u8_strided_run
Strided affine y = a*x + b, u8 dtype.
baracuda_kernels_alibi_backward_bf16_can_implement
Implementability check for alibi_backward_bf16. Host-side only.
baracuda_kernels_alibi_backward_bf16_run
ALiBi BW, bf16.
baracuda_kernels_alibi_backward_f16_can_implement
Implementability check for alibi_backward_f16. Host-side only.
baracuda_kernels_alibi_backward_f16_run
ALiBi BW, f16.
baracuda_kernels_alibi_backward_f32_can_implement
Implementability check for alibi_backward_f32. Host-side only.
baracuda_kernels_alibi_backward_f32_run
ALiBi BW, f32. da[b, h, i, j] = dy[b, h, i, j] (pass-through); dslope[h] = Σ_{b, i, j} dy[b, h, i, j] · (j - i). Either da or dslope may be null to skip; both null is rejected.
baracuda_kernels_alibi_backward_f64_can_implement
Implementability check for alibi_backward_f64. Host-side only.
baracuda_kernels_alibi_backward_f64_run
ALiBi BW, f64.
baracuda_kernels_alibi_bf16_can_implement
Implementability check for alibi_bf16. Host-side only.
baracuda_kernels_alibi_bf16_run
ALiBi FW, bf16.
baracuda_kernels_alibi_f16_can_implement
Implementability check for alibi_f16. Host-side only.
baracuda_kernels_alibi_f16_run
ALiBi FW, f16.
baracuda_kernels_alibi_f32_can_implement
Implementability check for alibi_f32. Host-side only.
baracuda_kernels_alibi_f32_run
ALiBi FW, f32. y[b, h, i, j] = scores[b, h, i, j] + slopes[h] · (j - i).
baracuda_kernels_alibi_f64_can_implement
Implementability check for alibi_f64. Host-side only.
baracuda_kernels_alibi_f64_run
ALiBi FW, f64.
baracuda_kernels_apply_token_penalty_f32_can_implement
baracuda_kernels_apply_token_penalty_f32_can_implement (baracuda kernels apply token penalty f32 can implement).
baracuda_kernels_apply_token_penalty_f32_run
baracuda_kernels_apply_token_penalty_f32_run (baracuda kernels apply token penalty f32 run).
baracuda_kernels_arg_reduce_argmax_bf16_can_implement
Pre-launch implementability check for arg_reduce_argmax_bf16.
baracuda_kernels_arg_reduce_argmax_bf16_i32_can_implement
Pre-launch implementability check for arg_reduce_argmax_bf16_i32.
baracuda_kernels_arg_reduce_argmax_bf16_i32_run
argmax(x, axis=k), bf16 input, i32 output.
baracuda_kernels_arg_reduce_argmax_bf16_run
argmax(x, axis=k), bf16 input, i64 output.
baracuda_kernels_arg_reduce_argmax_bf16_u32_can_implement
Pre-launch implementability check for arg_reduce_argmax_bf16_u32.
baracuda_kernels_arg_reduce_argmax_bf16_u32_run
argmax(x, axis=k), bf16 input, u32 output.
baracuda_kernels_arg_reduce_argmax_f16_can_implement
Pre-launch implementability check for arg_reduce_argmax_f16.
baracuda_kernels_arg_reduce_argmax_f16_i32_can_implement
Pre-launch implementability check for arg_reduce_argmax_f16_i32.
baracuda_kernels_arg_reduce_argmax_f16_i32_run
argmax(x, axis=k), f16 input, i32 output.
baracuda_kernels_arg_reduce_argmax_f16_run
argmax(x, axis=k), f16 input, i64 output.
baracuda_kernels_arg_reduce_argmax_f16_u32_can_implement
Pre-launch implementability check for arg_reduce_argmax_f16_u32.
baracuda_kernels_arg_reduce_argmax_f16_u32_run
argmax(x, axis=k), f16 input, u32 output.
baracuda_kernels_arg_reduce_argmax_f32_can_implement
Pre-launch implementability check for arg_reduce_argmax_f32.
baracuda_kernels_arg_reduce_argmax_f32_i32_can_implement
Pre-launch implementability check for arg_reduce_argmax_f32_i32.
baracuda_kernels_arg_reduce_argmax_f32_i32_run
argmax(x, axis=k), f32 input, i32 output.
baracuda_kernels_arg_reduce_argmax_f32_run
argmax(x, axis=k), f32 input, i64 output. Ties broken by first occurrence (smallest index wins).
baracuda_kernels_arg_reduce_argmax_f32_u32_can_implement
Pre-launch implementability check for arg_reduce_argmax_f32_u32.
baracuda_kernels_arg_reduce_argmax_f32_u32_run
argmax(x, axis=k), f32 input, u32 output.
baracuda_kernels_arg_reduce_argmax_f64_can_implement
Pre-launch implementability check for arg_reduce_argmax_f64.
baracuda_kernels_arg_reduce_argmax_f64_i32_can_implement
Pre-launch implementability check for arg_reduce_argmax_f64_i32.
baracuda_kernels_arg_reduce_argmax_f64_i32_run
argmax(x, axis=k), f64 input, i32 output.
baracuda_kernels_arg_reduce_argmax_f64_run
argmax(x, axis=k), f64 input, i64 output.
baracuda_kernels_arg_reduce_argmax_f64_u32_can_implement
Pre-launch implementability check for arg_reduce_argmax_f64_u32.
baracuda_kernels_arg_reduce_argmax_f64_u32_run
argmax(x, axis=k), f64 input, u32 output.
baracuda_kernels_arg_reduce_argmax_i8_i32_can_implement
Pre-launch implementability check for arg_reduce_argmax_i8_i32.
baracuda_kernels_arg_reduce_argmax_i8_i32_run
argmax(x, axis=k) i8 input, i32 idx output.
baracuda_kernels_arg_reduce_argmax_i8_i64_can_implement
Pre-launch implementability check for arg_reduce_argmax_i8_i64.
baracuda_kernels_arg_reduce_argmax_i8_i64_run
argmax(x, axis=k) i8 input, i64 idx output.
baracuda_kernels_arg_reduce_argmax_i16_i32_can_implement
Pre-launch implementability check for arg_reduce_argmax_i16_i32.
baracuda_kernels_arg_reduce_argmax_i16_i32_run
argmax(x, axis=k) i16 input, i32 idx output.
baracuda_kernels_arg_reduce_argmax_i16_i64_can_implement
Pre-launch implementability check for arg_reduce_argmax_i16_i64.
baracuda_kernels_arg_reduce_argmax_i16_i64_run
argmax(x, axis=k) i16 input, i64 idx output.
baracuda_kernels_arg_reduce_argmax_i32_i32_can_implement
Pre-launch implementability check for arg_reduce_argmax_i32_i32.
baracuda_kernels_arg_reduce_argmax_i32_i32_run
argmax(x, axis=k) i32 input, i32 idx output.
baracuda_kernels_arg_reduce_argmax_i32_i64_can_implement
Pre-launch implementability check for arg_reduce_argmax_i32_i64.
baracuda_kernels_arg_reduce_argmax_i32_i64_run
argmax(x, axis=k) i32 input, i64 idx output.
baracuda_kernels_arg_reduce_argmax_i64_i32_can_implement
Pre-launch implementability check for arg_reduce_argmax_i64_i32.
baracuda_kernels_arg_reduce_argmax_i64_i32_run
argmax(x, axis=k) i64 input, i32 idx output.
baracuda_kernels_arg_reduce_argmax_i64_i64_can_implement
Pre-launch implementability check for arg_reduce_argmax_i64_i64.
baracuda_kernels_arg_reduce_argmax_i64_i64_run
argmax(x, axis=k) i64 input, i64 idx output.
baracuda_kernels_arg_reduce_argmax_u8_i32_can_implement
Pre-launch implementability check for arg_reduce_argmax_u8_i32.
baracuda_kernels_arg_reduce_argmax_u8_i32_run
argmax(x, axis=k) u8 input, i32 idx output.
baracuda_kernels_arg_reduce_argmax_u8_i64_can_implement
Pre-launch implementability check for arg_reduce_argmax_u8_i64.
baracuda_kernels_arg_reduce_argmax_u8_i64_run
argmax(x, axis=k) u8 input, i64 idx output.
baracuda_kernels_arg_reduce_argmax_u32_i32_can_implement
Pre-launch implementability check for arg_reduce_argmax_u32_i32.
baracuda_kernels_arg_reduce_argmax_u32_i32_run
argmax(x, axis=k) u32 input, i32 idx output.
baracuda_kernels_arg_reduce_argmax_u32_i64_can_implement
Pre-launch implementability check for arg_reduce_argmax_u32_i64.
baracuda_kernels_arg_reduce_argmax_u32_i64_run
argmax(x, axis=k) u32 input, i64 idx output.
baracuda_kernels_arg_reduce_argmin_bf16_can_implement
Pre-launch implementability check for arg_reduce_argmin_bf16.
baracuda_kernels_arg_reduce_argmin_bf16_i32_can_implement
Pre-launch implementability check for arg_reduce_argmin_bf16_i32.
baracuda_kernels_arg_reduce_argmin_bf16_i32_run
argmin(x, axis=k), bf16 input, i32 output.
baracuda_kernels_arg_reduce_argmin_bf16_run
argmin(x, axis=k), bf16 input, i64 output.
baracuda_kernels_arg_reduce_argmin_bf16_u32_can_implement
Pre-launch implementability check for arg_reduce_argmin_bf16_u32.
baracuda_kernels_arg_reduce_argmin_bf16_u32_run
argmin(x, axis=k), bf16 input, u32 output.
baracuda_kernels_arg_reduce_argmin_f16_can_implement
Pre-launch implementability check for arg_reduce_argmin_f16.
baracuda_kernels_arg_reduce_argmin_f16_i32_can_implement
Pre-launch implementability check for arg_reduce_argmin_f16_i32.
baracuda_kernels_arg_reduce_argmin_f16_i32_run
argmin(x, axis=k), f16 input, i32 output.
baracuda_kernels_arg_reduce_argmin_f16_run
argmin(x, axis=k), f16 input, i64 output.
baracuda_kernels_arg_reduce_argmin_f16_u32_can_implement
Pre-launch implementability check for arg_reduce_argmin_f16_u32.
baracuda_kernels_arg_reduce_argmin_f16_u32_run
argmin(x, axis=k), f16 input, u32 output.
baracuda_kernels_arg_reduce_argmin_f32_can_implement
Pre-launch implementability check for arg_reduce_argmin_f32.
baracuda_kernels_arg_reduce_argmin_f32_i32_can_implement
Pre-launch implementability check for arg_reduce_argmin_f32_i32.
baracuda_kernels_arg_reduce_argmin_f32_i32_run
argmin(x, axis=k), f32 input, i32 output.
baracuda_kernels_arg_reduce_argmin_f32_run
argmin(x, axis=k), f32 input, i64 output.
baracuda_kernels_arg_reduce_argmin_f32_u32_can_implement
Pre-launch implementability check for arg_reduce_argmin_f32_u32.
baracuda_kernels_arg_reduce_argmin_f32_u32_run
argmin(x, axis=k), f32 input, u32 output.
baracuda_kernels_arg_reduce_argmin_f64_can_implement
Pre-launch implementability check for arg_reduce_argmin_f64.
baracuda_kernels_arg_reduce_argmin_f64_i32_can_implement
Pre-launch implementability check for arg_reduce_argmin_f64_i32.
baracuda_kernels_arg_reduce_argmin_f64_i32_run
argmin(x, axis=k), f64 input, i32 output.
baracuda_kernels_arg_reduce_argmin_f64_run
argmin(x, axis=k), f64 input, i64 output.
baracuda_kernels_arg_reduce_argmin_f64_u32_can_implement
Pre-launch implementability check for arg_reduce_argmin_f64_u32.
baracuda_kernels_arg_reduce_argmin_f64_u32_run
argmin(x, axis=k), f64 input, u32 output.
baracuda_kernels_arg_reduce_argmin_i8_i32_can_implement
Pre-launch implementability check for arg_reduce_argmin_i8_i32.
baracuda_kernels_arg_reduce_argmin_i8_i32_run
argmin(x, axis=k) i8 input, i32 idx output.
baracuda_kernels_arg_reduce_argmin_i8_i64_can_implement
Pre-launch implementability check for arg_reduce_argmin_i8_i64.
baracuda_kernels_arg_reduce_argmin_i8_i64_run
argmin(x, axis=k) i8 input, i64 idx output.
baracuda_kernels_arg_reduce_argmin_i16_i32_can_implement
Pre-launch implementability check for arg_reduce_argmin_i16_i32.
baracuda_kernels_arg_reduce_argmin_i16_i32_run
argmin(x, axis=k) i16 input, i32 idx output.
baracuda_kernels_arg_reduce_argmin_i16_i64_can_implement
Pre-launch implementability check for arg_reduce_argmin_i16_i64.
baracuda_kernels_arg_reduce_argmin_i16_i64_run
argmin(x, axis=k) i16 input, i64 idx output.
baracuda_kernels_arg_reduce_argmin_i32_i32_can_implement
Pre-launch implementability check for arg_reduce_argmin_i32_i32.
baracuda_kernels_arg_reduce_argmin_i32_i32_run
argmin(x, axis=k) i32 input, i32 idx output.
baracuda_kernels_arg_reduce_argmin_i32_i64_can_implement
Pre-launch implementability check for arg_reduce_argmin_i32_i64.
baracuda_kernels_arg_reduce_argmin_i32_i64_run
argmin(x, axis=k) i32 input, i64 idx output.
baracuda_kernels_arg_reduce_argmin_i64_i32_can_implement
Pre-launch implementability check for arg_reduce_argmin_i64_i32.
baracuda_kernels_arg_reduce_argmin_i64_i32_run
argmin(x, axis=k) i64 input, i32 idx output.
baracuda_kernels_arg_reduce_argmin_i64_i64_can_implement
Pre-launch implementability check for arg_reduce_argmin_i64_i64.
baracuda_kernels_arg_reduce_argmin_i64_i64_run
argmin(x, axis=k) i64 input, i64 idx output.
baracuda_kernels_arg_reduce_argmin_u8_i32_can_implement
Pre-launch implementability check for arg_reduce_argmin_u8_i32.
baracuda_kernels_arg_reduce_argmin_u8_i32_run
argmin(x, axis=k) u8 input, i32 idx output.
baracuda_kernels_arg_reduce_argmin_u8_i64_can_implement
Pre-launch implementability check for arg_reduce_argmin_u8_i64.
baracuda_kernels_arg_reduce_argmin_u8_i64_run
argmin(x, axis=k) u8 input, i64 idx output.
baracuda_kernels_arg_reduce_argmin_u32_i32_can_implement
Pre-launch implementability check for arg_reduce_argmin_u32_i32.
baracuda_kernels_arg_reduce_argmin_u32_i32_run
argmin(x, axis=k) u32 input, i32 idx output.
baracuda_kernels_arg_reduce_argmin_u32_i64_can_implement
Pre-launch implementability check for arg_reduce_argmin_u32_i64.
baracuda_kernels_arg_reduce_argmin_u32_i64_run
argmin(x, axis=k) u32 input, i64 idx output.
baracuda_kernels_argsort_bf16_can_implement
baracuda_kernels_argsort_bf16_can_implement (baracuda kernels argsort bf16 can implement).
baracuda_kernels_argsort_bf16_run
Block-bitonic argsort, bf16. Comparator uses native __nv_bfloat16 operator< (CUDA device-side intrinsics).
baracuda_kernels_argsort_f16_can_implement
baracuda_kernels_argsort_f16_can_implement (baracuda kernels argsort f16 can implement).
baracuda_kernels_argsort_f16_run
Block-bitonic argsort, f16. Comparator uses native __half operator<.
baracuda_kernels_argsort_f32_big_can_implement
baracuda_kernels_argsort_f32_big_can_implement (baracuda kernels argsort f32 big can implement).
baracuda_kernels_argsort_f32_big_run
Multi-block radix argsort, f32, for row_len > 1024.
baracuda_kernels_argsort_f32_big_workspace_size
baracuda_kernels_argsort_f32_big_workspace_size (baracuda kernels argsort f32 big workspace size).
baracuda_kernels_argsort_f32_can_implement
baracuda_kernels_argsort_f32_can_implement (baracuda kernels argsort f32 can implement).
baracuda_kernels_argsort_f32_run
Block-bitonic argsort, f32. Returns indices; values not written.
baracuda_kernels_argsort_f64_big_can_implement
baracuda_kernels_argsort_f64_big_can_implement (baracuda kernels argsort f64 big can implement).
baracuda_kernels_argsort_f64_big_run
Multi-block radix argsort, f64.
baracuda_kernels_argsort_f64_big_workspace_size
baracuda_kernels_argsort_f64_big_workspace_size (baracuda kernels argsort f64 big workspace size).
baracuda_kernels_argsort_f64_can_implement
baracuda_kernels_argsort_f64_can_implement (baracuda kernels argsort f64 can implement).
baracuda_kernels_argsort_f64_run
Block-bitonic argsort, f64.
baracuda_kernels_argsort_fp8e4m3_can_implement
baracuda_kernels_argsort_fp8e4m3_can_implement (baracuda kernels argsort fp8e4m3 can implement).
baracuda_kernels_argsort_fp8e4m3_run
Block-bitonic argsort, FP8 E4M3. Storage is byte-identical to raw u8; the kernel wraps it in an Fp8E4M3Sort struct that decodes to float in the comparator. Raw-byte buffer in, i32 index buffer out.
baracuda_kernels_argsort_i8_can_implement
baracuda_kernels_argsort_i8_can_implement (baracuda kernels argsort i8 can implement).
baracuda_kernels_argsort_i8_run
Block-bitonic argsort, i8.
baracuda_kernels_argsort_i16_can_implement
baracuda_kernels_argsort_i16_can_implement (baracuda kernels argsort i16 can implement).
baracuda_kernels_argsort_i16_run
Block-bitonic argsort, i16.
baracuda_kernels_argsort_i32_big_can_implement
baracuda_kernels_argsort_i32_big_can_implement (baracuda kernels argsort i32 big can implement).
baracuda_kernels_argsort_i32_big_run
Multi-block radix argsort, i32.
baracuda_kernels_argsort_i32_big_workspace_size
baracuda_kernels_argsort_i32_big_workspace_size (baracuda kernels argsort i32 big workspace size).
baracuda_kernels_argsort_i32_can_implement
baracuda_kernels_argsort_i32_can_implement (baracuda kernels argsort i32 can implement).
baracuda_kernels_argsort_i32_run
Block-bitonic argsort, i32.
baracuda_kernels_argsort_i64_big_can_implement
baracuda_kernels_argsort_i64_big_can_implement (baracuda kernels argsort i64 big can implement).
baracuda_kernels_argsort_i64_big_run
Multi-block radix argsort, i64.
baracuda_kernels_argsort_i64_big_workspace_size
baracuda_kernels_argsort_i64_big_workspace_size (baracuda kernels argsort i64 big workspace size).
baracuda_kernels_argsort_i64_can_implement
baracuda_kernels_argsort_i64_can_implement (baracuda kernels argsort i64 can implement).
baracuda_kernels_argsort_i64_run
Block-bitonic argsort, i64.
baracuda_kernels_argsort_u8_can_implement
baracuda_kernels_argsort_u8_can_implement (baracuda kernels argsort u8 can implement).
baracuda_kernels_argsort_u8_run
Block-bitonic argsort, u8.
baracuda_kernels_argsort_u32_can_implement
baracuda_kernels_argsort_u32_can_implement (baracuda kernels argsort u32 can implement).
baracuda_kernels_argsort_u32_run
Block-bitonic argsort, u32.
baracuda_kernels_batch_norm_backward_bf16_can_implement
baracuda_kernels_batch_norm_backward_bf16_can_implement (baracuda kernels batch norm backward bf16 can implement).
baracuda_kernels_batch_norm_backward_bf16_run
BatchNorm BW, bf16.
baracuda_kernels_batch_norm_backward_f16_can_implement
baracuda_kernels_batch_norm_backward_f16_can_implement (baracuda kernels batch norm backward f16 can implement).
baracuda_kernels_batch_norm_backward_f16_run
BatchNorm BW, f16.
baracuda_kernels_batch_norm_backward_f32_can_implement
baracuda_kernels_batch_norm_backward_f32_can_implement (baracuda kernels batch norm backward f32 can implement).
baracuda_kernels_batch_norm_backward_f32_run
BatchNorm BW, f32. Three-stage: per-group sum_dxh / sum_dxhxh, per-cell dx, per-channel dgamma / dbeta. Requires workspace of 2 * group_count * sizeof(float) bytes for the stage-1 partial sums (group_count = c_extent for BN).
baracuda_kernels_batch_norm_backward_f64_can_implement
baracuda_kernels_batch_norm_backward_f64_can_implement (baracuda kernels batch norm backward f64 can implement).
baracuda_kernels_batch_norm_backward_f64_run
BatchNorm BW, f64.
baracuda_kernels_batch_norm_bf16_can_implement
baracuda_kernels_batch_norm_bf16_can_implement (baracuda kernels batch norm bf16 can implement).
baracuda_kernels_batch_norm_bf16_run
BatchNorm FW, bf16.
baracuda_kernels_batch_norm_f16_can_implement
baracuda_kernels_batch_norm_f16_can_implement (baracuda kernels batch norm f16 can implement).
baracuda_kernels_batch_norm_f16_run
BatchNorm FW, f16.
baracuda_kernels_batch_norm_f32_can_implement
baracuda_kernels_batch_norm_f32_can_implement (baracuda kernels batch norm f32 can implement).
baracuda_kernels_batch_norm_f32_run
BatchNorm FW, f32. Training mode: computes per-channel (mean, inv_std) from the batch + spatial cells, writes them to saved_mean / saved_rstd for BW. gamma / beta optional (both supplied together per PyTorch convention).
baracuda_kernels_batch_norm_f64_can_implement
baracuda_kernels_batch_norm_f64_can_implement (baracuda kernels batch norm f64 can implement).
baracuda_kernels_batch_norm_f64_run
BatchNorm FW, f64.
baracuda_kernels_batched_ormqr_complex32_can_implement
baracuda_kernels_batched_ormqr_complex32_can_implement (baracuda kernels batched ormqr complex32 can implement).
baracuda_kernels_batched_ormqr_complex32_run
Batched-unmqr, Complex32. Same shape/contract as the f32 variant but with cuFloatComplex storage. op = 2 (C — conjugate transpose) is supported; op = 1 (T — plain transpose) is rejected by the Rust safe layer for complex (mathematically unusual for Householder).
baracuda_kernels_batched_ormqr_complex64_can_implement
baracuda_kernels_batched_ormqr_complex64_can_implement (baracuda kernels batched ormqr complex64 can implement).
baracuda_kernels_batched_ormqr_complex64_run
Batched-unmqr, Complex64. Same as the complex32 variant with cuDoubleComplex storage.
baracuda_kernels_batched_ormqr_f32_can_implement
baracuda_kernels_batched_ormqr_f32_can_implement (baracuda kernels batched ormqr f32 can implement).
baracuda_kernels_batched_ormqr_f32_run
Batched-ormqr, f32. Applies the implicit Q (or Q^T) from a BatchedQrPlan packed output (A_packed [B, M, K] column-major
baracuda_kernels_batched_ormqr_f64_can_implement
baracuda_kernels_batched_ormqr_f64_can_implement (baracuda kernels batched ormqr f64 can implement).
baracuda_kernels_batched_ormqr_f64_run
Batched-ormqr, f64. Same contract as the f32 variant.
baracuda_kernels_batched_ormqr_wy_build_t_complex32_can_implement
baracuda_kernels_batched_ormqr_wy_build_t_complex32_can_implement (baracuda kernels batched ormqr wy build t complex32 can implement).
baracuda_kernels_batched_ormqr_wy_build_t_complex32_run
WY block T-build, Complex32. f32-complex analogue of the f32 variant. Storage is cuFloatComplex (== Complex32, ABI-compatible).
baracuda_kernels_batched_ormqr_wy_build_t_complex64_can_implement
baracuda_kernels_batched_ormqr_wy_build_t_complex64_can_implement (baracuda kernels batched ormqr wy build t complex64 can implement).
baracuda_kernels_batched_ormqr_wy_build_t_complex64_run
WY block T-build, Complex64. f64-complex analogue.
baracuda_kernels_batched_ormqr_wy_build_t_f32_can_implement
baracuda_kernels_batched_ormqr_wy_build_t_f32_can_implement (baracuda kernels batched ormqr wy build t f32 can implement).
baracuda_kernels_batched_ormqr_wy_build_t_f32_run
WY block T-build, f32. For each (batch_slot, block_index), builds the [nb, nb] upper-triangular block-reflector matrix T such that H_0 · ... · H_{nb-1} = I - V·T·V^T. One CUDA block per (batch, num_blocks) cell. Status codes: 0 success, 2 invalid problem, 5 launch failure.
baracuda_kernels_batched_ormqr_wy_build_t_f64_can_implement
baracuda_kernels_batched_ormqr_wy_build_t_f64_can_implement (baracuda kernels batched ormqr wy build t f64 can implement).
baracuda_kernels_batched_ormqr_wy_build_t_f64_run
WY block T-build, f64 analogue.
baracuda_kernels_batched_ormqr_wy_extract_v_complex32_can_implement
baracuda_kernels_batched_ormqr_wy_extract_v_complex32_can_implement (baracuda kernels batched ormqr wy extract v complex32 can implement).
baracuda_kernels_batched_ormqr_wy_extract_v_complex32_run
WY V-extraction, Complex32. f32-complex analogue. Pure copy kernel — sets the implicit-1 (as (1, 0)), zeroes above the diagonal (as (0, 0)), copies the strict lower below.
baracuda_kernels_batched_ormqr_wy_extract_v_complex64_can_implement
baracuda_kernels_batched_ormqr_wy_extract_v_complex64_can_implement (baracuda kernels batched ormqr wy extract v complex64 can implement).
baracuda_kernels_batched_ormqr_wy_extract_v_complex64_run
WY V-extraction, Complex64. f64-complex analogue.
baracuda_kernels_batched_ormqr_wy_extract_v_f32_can_implement
baracuda_kernels_batched_ormqr_wy_extract_v_f32_can_implement (baracuda kernels batched ormqr wy extract v f32 can implement).
baracuda_kernels_batched_ormqr_wy_extract_v_f32_run
WY V-extraction, f32. Materializes the dense V [B, M, nb] panel for one block of reflectors (block_start = block_start, block_k = min(nb, K - block_start)) into a contiguous workspace buffer. Sets the implicit-1 at each reflector’s diagonal, copies the packed-A strict lower below, zeros above the diagonal, and zeros entire columns past block_k (handles the partial-last- block case).
baracuda_kernels_batched_ormqr_wy_extract_v_f64_can_implement
baracuda_kernels_batched_ormqr_wy_extract_v_f64_can_implement (baracuda kernels batched ormqr wy extract v f64 can implement).
baracuda_kernels_batched_ormqr_wy_extract_v_f64_run
WY V-extraction, f64 analogue.
baracuda_kernels_batched_qr_materialize_identity_f32_can_implement
baracuda_kernels_batched_qr_materialize_identity_f32_can_implement (baracuda kernels batched qr materialize identity f32 can implement).
baracuda_kernels_batched_qr_materialize_identity_f32_run
Stage a column-major identity Q [B, M, M] (one identity per batch slot) into a freshly allocated buffer. Caller then chains baracuda_kernels_batched_ormqr_*_run with op = 0 (N) to overwrite Q in place with the dense Q matrix from the geqrf-packed input. f32.
baracuda_kernels_batched_qr_materialize_identity_f64_can_implement
baracuda_kernels_batched_qr_materialize_identity_f64_can_implement (baracuda kernels batched qr materialize identity f64 can implement).
baracuda_kernels_batched_qr_materialize_identity_f64_run
Stage identity, f64 analogue.
baracuda_kernels_batched_qr_materialize_r_f32_can_implement
baracuda_kernels_batched_qr_materialize_r_f32_can_implement (baracuda kernels batched qr materialize r f32 can implement).
baracuda_kernels_batched_qr_materialize_r_f32_run
Materialize dense R [B, K, N] from a geqrf-packed A [B, M, N] (column-major). K = min(M, N). Cell R[b, i, j] = A[b, i, j] if i ≤ j, else 0. One CUDA block per (batch_slot, column). f32.
baracuda_kernels_batched_qr_materialize_r_f64_can_implement
baracuda_kernels_batched_qr_materialize_r_f64_can_implement (baracuda kernels batched qr materialize r f64 can implement).
baracuda_kernels_batched_qr_materialize_r_f64_run
Materialize dense R, f64 analogue.
baracuda_kernels_bernoulli_can_implement
baracuda_kernels_bernoulli_can_implement (baracuda kernels bernoulli can implement).
baracuda_kernels_bernoulli_run
bernoulli over a float uniform-rand buffer.
baracuda_kernels_binary_add_backward_bf16_can_implement
baracuda_kernels_binary_add_backward_bf16_can_implement (baracuda kernels binary add backward bf16 can implement).
baracuda_kernels_binary_add_backward_bf16_run
Add backward, bf16.
baracuda_kernels_binary_add_backward_f16_can_implement
baracuda_kernels_binary_add_backward_f16_can_implement (baracuda kernels binary add backward f16 can implement).
baracuda_kernels_binary_add_backward_f16_run
Add backward, f16.
baracuda_kernels_binary_add_backward_f32_can_implement
baracuda_kernels_binary_add_backward_f32_can_implement (baracuda kernels binary add backward f32 can implement).
baracuda_kernels_binary_add_backward_f32_run
Add backward, f32. Writes da = dy and db = dy.
baracuda_kernels_binary_add_backward_f64_can_implement
baracuda_kernels_binary_add_backward_f64_can_implement (baracuda kernels binary add backward f64 can implement).
baracuda_kernels_binary_add_backward_f64_run
Add backward, f64.
baracuda_kernels_binary_add_bf16_can_implement
Pre-launch implementability check for binary_add_bf16.
baracuda_kernels_binary_add_bf16_run
Binary elementwise add, bf16 dtype, contiguous fast path.
baracuda_kernels_binary_add_bf16_strided_can_implement
Pre-launch implementability check for binary_add_bf16_strided.
baracuda_kernels_binary_add_bf16_strided_run
Binary elementwise add, bf16 dtype, strided / broadcast path.
baracuda_kernels_binary_add_f16_can_implement
Pre-launch implementability check for binary_add_f16.
baracuda_kernels_binary_add_f16_run
Binary elementwise add, f16 dtype, contiguous fast path.
baracuda_kernels_binary_add_f16_strided_can_implement
Pre-launch implementability check for binary_add_f16_strided.
baracuda_kernels_binary_add_f16_strided_run
Binary elementwise add, f16 dtype, strided / broadcast path.
baracuda_kernels_binary_add_f32_can_implement
Pre-launch implementability check for binary_add_f32. Validates the problem size without launching a kernel. Returns the standard status code mapping.
baracuda_kernels_binary_add_f32_run
Binary elementwise add, f32 dtype, contiguous fast path. This is the binary-pointwise trailblazer — its safety contract carries over to every other binary contig launcher (add, sub, mul, div, min, max, pow, comparison ops, etc.) across all dtypes.
baracuda_kernels_binary_add_f32_strided_can_implement
Pre-launch implementability check for binary_add_f32_strided.
baracuda_kernels_binary_add_f32_strided_run
Binary elementwise add, f32 dtype, strided / broadcast path. This is the binary-strided trailblazer — its safety contract (including aliasing) carries over to every other binary strided launcher across all dtypes.
baracuda_kernels_binary_add_f64_can_implement
Pre-launch implementability check for binary_add_f64.
baracuda_kernels_binary_add_f64_run
Binary elementwise add, f64 dtype, contiguous fast path.
baracuda_kernels_binary_add_f64_strided_can_implement
Pre-launch implementability check for binary_add_f64_strided.
baracuda_kernels_binary_add_f64_strided_run
Binary elementwise add, f64 dtype, strided / broadcast path.
baracuda_kernels_binary_atan2_backward_bf16_can_implement
baracuda_kernels_binary_atan2_backward_bf16_can_implement (baracuda kernels binary atan2 backward bf16 can implement).
baracuda_kernels_binary_atan2_backward_bf16_run
Atan2 backward, bf16.
baracuda_kernels_binary_atan2_backward_f16_can_implement
baracuda_kernels_binary_atan2_backward_f16_can_implement (baracuda kernels binary atan2 backward f16 can implement).
baracuda_kernels_binary_atan2_backward_f16_run
Atan2 backward, f16.
baracuda_kernels_binary_atan2_backward_f32_can_implement
baracuda_kernels_binary_atan2_backward_f32_can_implement (baracuda kernels binary atan2 backward f32 can implement).
baracuda_kernels_binary_atan2_backward_f32_run
Atan2 backward, f32. denom = a²+b², da = dy*b/denom, db = -dy*a/denom. Caller responsible for guarding against a == 0 && b == 0 (denom == 0).
baracuda_kernels_binary_atan2_backward_f64_can_implement
baracuda_kernels_binary_atan2_backward_f64_can_implement (baracuda kernels binary atan2 backward f64 can implement).
baracuda_kernels_binary_atan2_backward_f64_run
Atan2 backward, f64.
baracuda_kernels_binary_atan2_bf16_can_implement
Binary atan2, bf16, can-implement.
baracuda_kernels_binary_atan2_bf16_run
Binary atan2, bf16, contig.
baracuda_kernels_binary_atan2_bf16_strided_can_implement
Pre-launch implementability check for binary_atan2_bf16_strided.
baracuda_kernels_binary_atan2_bf16_strided_run
Binary atan2, bf16, strided.
baracuda_kernels_binary_atan2_f16_can_implement
Binary atan2, f16, can-implement.
baracuda_kernels_binary_atan2_f16_run
Binary atan2, f16, contig.
baracuda_kernels_binary_atan2_f16_strided_can_implement
Pre-launch implementability check for binary_atan2_f16_strided.
baracuda_kernels_binary_atan2_f16_strided_run
Binary atan2, f16, strided.
baracuda_kernels_binary_atan2_f32_can_implement
Binary atan2, f32, can-implement.
baracuda_kernels_binary_atan2_f32_run
Binary atan2, f32, contig.
baracuda_kernels_binary_atan2_f32_strided_can_implement
Pre-launch implementability check for binary_atan2_f32_strided.
baracuda_kernels_binary_atan2_f32_strided_run
Binary atan2, f32, strided.
baracuda_kernels_binary_atan2_f64_can_implement
Binary atan2, f64, can-implement.
baracuda_kernels_binary_atan2_f64_run
Binary atan2, f64, contig.
baracuda_kernels_binary_atan2_f64_strided_can_implement
Pre-launch implementability check for binary_atan2_f64_strided.
baracuda_kernels_binary_atan2_f64_strided_run
Binary atan2, f64, strided.
baracuda_kernels_binary_bitwise_and_i32_can_implement
Binary bitwise and, i32 dtype, can-implement.
baracuda_kernels_binary_bitwise_and_i32_run
Binary bitwise and, i32 dtype, contig.
baracuda_kernels_binary_bitwise_and_i64_can_implement
Binary bitwise and, i64 dtype, can-implement.
baracuda_kernels_binary_bitwise_and_i64_run
Binary bitwise and, i64 dtype, contig.
baracuda_kernels_binary_bitwise_left_shift_i32_can_implement
Binary bitwise left_shift, i32 dtype, can-implement.
baracuda_kernels_binary_bitwise_left_shift_i32_run
Binary bitwise left_shift, i32 dtype, contig.
baracuda_kernels_binary_bitwise_left_shift_i64_can_implement
Binary bitwise left_shift, i64 dtype, can-implement.
baracuda_kernels_binary_bitwise_left_shift_i64_run
Binary bitwise left_shift, i64 dtype, contig.
baracuda_kernels_binary_bitwise_or_i32_can_implement
Binary bitwise or, i32 dtype, can-implement.
baracuda_kernels_binary_bitwise_or_i32_run
Binary bitwise or, i32 dtype, contig.
baracuda_kernels_binary_bitwise_or_i64_can_implement
Binary bitwise or, i64 dtype, can-implement.
baracuda_kernels_binary_bitwise_or_i64_run
Binary bitwise or, i64 dtype, contig.
baracuda_kernels_binary_bitwise_right_shift_i32_can_implement
Binary bitwise right_shift, i32 dtype, can-implement.
baracuda_kernels_binary_bitwise_right_shift_i32_run
Binary bitwise right_shift, i32 dtype, contig. Arithmetic shift (sign-extending), matching PyTorch.
baracuda_kernels_binary_bitwise_right_shift_i64_can_implement
Binary bitwise right_shift, i64 dtype, can-implement.
baracuda_kernels_binary_bitwise_right_shift_i64_run
Binary bitwise right_shift, i64 dtype, contig. Arithmetic shift (sign-extending), matching PyTorch.
baracuda_kernels_binary_bitwise_xor_i32_can_implement
Binary bitwise xor, i32 dtype, can-implement.
baracuda_kernels_binary_bitwise_xor_i32_run
Binary bitwise xor, i32 dtype, contig.
baracuda_kernels_binary_bitwise_xor_i64_can_implement
Binary bitwise xor, i64 dtype, can-implement.
baracuda_kernels_binary_bitwise_xor_i64_run
Binary bitwise xor, i64 dtype, contig.
baracuda_kernels_binary_cmp_eq_bf16_can_implement
Pre-launch implementability check for binary_cmp_eq_bf16.
baracuda_kernels_binary_cmp_eq_bf16_run
Binary elementwise eq, bf16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_eq_bf16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_eq_bf16_strided_run
Binary elementwise eq, bf16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_eq_f16_can_implement
Pre-launch implementability check for binary_cmp_eq_f16.
baracuda_kernels_binary_cmp_eq_f16_run
Binary elementwise eq, f16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_eq_f16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_eq_f16_strided_run
Binary elementwise eq, f16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_eq_f32_can_implement
Pre-launch implementability check for binary_cmp_eq_f32.
baracuda_kernels_binary_cmp_eq_f32_run
Binary elementwise eq, f32 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_eq_f32_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_eq_f32_strided_run
Binary elementwise eq, f32 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_eq_f64_can_implement
Pre-launch implementability check for binary_cmp_eq_f64.
baracuda_kernels_binary_cmp_eq_f64_run
Binary elementwise eq, f64 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_eq_f64_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_eq_f64_strided_run
Binary elementwise eq, f64 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ge_bf16_can_implement
Pre-launch implementability check for binary_cmp_ge_bf16.
baracuda_kernels_binary_cmp_ge_bf16_run
Binary elementwise ge, bf16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ge_bf16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_ge_bf16_strided_run
Binary elementwise ge, bf16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ge_f16_can_implement
Pre-launch implementability check for binary_cmp_ge_f16.
baracuda_kernels_binary_cmp_ge_f16_run
Binary elementwise ge, f16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ge_f16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_ge_f16_strided_run
Binary elementwise ge, f16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ge_f32_can_implement
Pre-launch implementability check for binary_cmp_ge_f32.
baracuda_kernels_binary_cmp_ge_f32_run
Binary elementwise ge (a >= b), f32 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ge_f32_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_ge_f32_strided_run
Binary elementwise ge, f32 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ge_f64_can_implement
Pre-launch implementability check for binary_cmp_ge_f64.
baracuda_kernels_binary_cmp_ge_f64_run
Binary elementwise ge, f64 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ge_f64_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_ge_f64_strided_run
Binary elementwise ge, f64 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_gt_bf16_can_implement
Pre-launch implementability check for binary_cmp_gt_bf16.
baracuda_kernels_binary_cmp_gt_bf16_run
Binary elementwise gt, bf16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_gt_bf16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_gt_bf16_strided_run
Binary elementwise gt, bf16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_gt_f16_can_implement
Pre-launch implementability check for binary_cmp_gt_f16.
baracuda_kernels_binary_cmp_gt_f16_run
Binary elementwise gt, f16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_gt_f16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_gt_f16_strided_run
Binary elementwise gt, f16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_gt_f32_can_implement
Pre-launch implementability check for binary_cmp_gt_f32.
baracuda_kernels_binary_cmp_gt_f32_run
Binary elementwise gt (a > b), f32 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_gt_f32_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_gt_f32_strided_run
Binary elementwise gt, f32 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_gt_f64_can_implement
Pre-launch implementability check for binary_cmp_gt_f64.
baracuda_kernels_binary_cmp_gt_f64_run
Binary elementwise gt, f64 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_gt_f64_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_gt_f64_strided_run
Binary elementwise gt, f64 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_le_bf16_can_implement
Pre-launch implementability check for binary_cmp_le_bf16.
baracuda_kernels_binary_cmp_le_bf16_run
Binary elementwise le, bf16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_le_bf16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_le_bf16_strided_run
Binary elementwise le, bf16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_le_f16_can_implement
Pre-launch implementability check for binary_cmp_le_f16.
baracuda_kernels_binary_cmp_le_f16_run
Binary elementwise le, f16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_le_f16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_le_f16_strided_run
Binary elementwise le, f16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_le_f32_can_implement
Pre-launch implementability check for binary_cmp_le_f32.
baracuda_kernels_binary_cmp_le_f32_run
Binary elementwise le (a <= b), f32 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_le_f32_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_le_f32_strided_run
Binary elementwise le, f32 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_le_f64_can_implement
Pre-launch implementability check for binary_cmp_le_f64.
baracuda_kernels_binary_cmp_le_f64_run
Binary elementwise le, f64 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_le_f64_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_le_f64_strided_run
Binary elementwise le, f64 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_lt_bf16_can_implement
Pre-launch implementability check for binary_cmp_lt_bf16.
baracuda_kernels_binary_cmp_lt_bf16_run
Binary elementwise lt, bf16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_lt_bf16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_lt_bf16_strided_run
Binary elementwise lt, bf16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_lt_f16_can_implement
Pre-launch implementability check for binary_cmp_lt_f16.
baracuda_kernels_binary_cmp_lt_f16_run
Binary elementwise lt, f16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_lt_f16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_lt_f16_strided_run
Binary elementwise lt, f16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_lt_f32_can_implement
Pre-launch implementability check for binary_cmp_lt_f32.
baracuda_kernels_binary_cmp_lt_f32_run
Binary elementwise lt (a < b), f32 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_lt_f32_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_lt_f32_strided_run
Binary elementwise lt, f32 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_lt_f64_can_implement
Pre-launch implementability check for binary_cmp_lt_f64.
baracuda_kernels_binary_cmp_lt_f64_run
Binary elementwise lt, f64 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_lt_f64_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_lt_f64_strided_run
Binary elementwise lt, f64 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ne_bf16_can_implement
Pre-launch implementability check for binary_cmp_ne_bf16.
baracuda_kernels_binary_cmp_ne_bf16_run
Binary elementwise ne, bf16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ne_bf16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_ne_bf16_strided_run
Binary elementwise ne, bf16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ne_f16_can_implement
Pre-launch implementability check for binary_cmp_ne_f16.
baracuda_kernels_binary_cmp_ne_f16_run
Binary elementwise ne, f16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ne_f16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_ne_f16_strided_run
Binary elementwise ne, f16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ne_f32_can_implement
Pre-launch implementability check for binary_cmp_ne_f32.
baracuda_kernels_binary_cmp_ne_f32_run
Binary elementwise ne, f32 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ne_f32_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_ne_f32_strided_run
Binary elementwise ne, f32 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ne_f64_can_implement
Pre-launch implementability check for binary_cmp_ne_f64.
baracuda_kernels_binary_cmp_ne_f64_run
Binary elementwise ne, f64 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ne_f64_strided_can_implement
Pre-launch check companion.
baracuda_kernels_binary_cmp_ne_f64_strided_run
Binary elementwise ne, f64 inputs, u8 output, strided path.
baracuda_kernels_binary_copysign_bf16_can_implement
Binary copysign, bf16, can-implement.
baracuda_kernels_binary_copysign_bf16_run
Binary copysign, bf16, contig.
baracuda_kernels_binary_copysign_bf16_strided_can_implement
Pre-launch implementability check for binary_copysign_bf16_strided.
baracuda_kernels_binary_copysign_bf16_strided_run
Binary copysign, bf16, strided.
baracuda_kernels_binary_copysign_f16_can_implement
Binary copysign, f16, can-implement.
baracuda_kernels_binary_copysign_f16_run
Binary copysign, f16, contig.
baracuda_kernels_binary_copysign_f16_strided_can_implement
Pre-launch implementability check for binary_copysign_f16_strided.
baracuda_kernels_binary_copysign_f16_strided_run
Binary copysign, f16, strided.
baracuda_kernels_binary_copysign_f32_can_implement
Binary copysign, f32, can-implement.
baracuda_kernels_binary_copysign_f32_run
Binary copysign, f32, contig.
baracuda_kernels_binary_copysign_f32_strided_can_implement
Pre-launch implementability check for binary_copysign_f32_strided.
baracuda_kernels_binary_copysign_f32_strided_run
Binary copysign, f32, strided.
baracuda_kernels_binary_copysign_f64_can_implement
Binary copysign, f64, can-implement.
baracuda_kernels_binary_copysign_f64_run
Binary copysign, f64, contig.
baracuda_kernels_binary_copysign_f64_strided_can_implement
Pre-launch implementability check for binary_copysign_f64_strided.
baracuda_kernels_binary_copysign_f64_strided_run
Binary copysign, f64, strided.
baracuda_kernels_binary_div_backward_bf16_can_implement
baracuda_kernels_binary_div_backward_bf16_can_implement (baracuda kernels binary div backward bf16 can implement).
baracuda_kernels_binary_div_backward_bf16_run
Div backward, bf16.
baracuda_kernels_binary_div_backward_f16_can_implement
baracuda_kernels_binary_div_backward_f16_can_implement (baracuda kernels binary div backward f16 can implement).
baracuda_kernels_binary_div_backward_f16_run
Div backward, f16.
baracuda_kernels_binary_div_backward_f32_can_implement
baracuda_kernels_binary_div_backward_f32_can_implement (baracuda kernels binary div backward f32 can implement).
baracuda_kernels_binary_div_backward_f32_run
Div backward, f32. Writes da = dy / b and db = -dy * a / b². Both saved tensors a and b must be non-null; callers must also ensure b[i] != 0 for every cell.
baracuda_kernels_binary_div_backward_f64_can_implement
baracuda_kernels_binary_div_backward_f64_can_implement (baracuda kernels binary div backward f64 can implement).
baracuda_kernels_binary_div_backward_f64_run
Div backward, f64.
baracuda_kernels_binary_div_bf16_can_implement
Pre-launch implementability check for binary_div_bf16.
baracuda_kernels_binary_div_bf16_run
Binary elementwise div, bf16 dtype, contiguous fast path.
baracuda_kernels_binary_div_bf16_strided_can_implement
Pre-launch implementability check for binary_div_bf16_strided.
baracuda_kernels_binary_div_bf16_strided_run
Binary elementwise div, bf16 dtype, strided / broadcast path.
baracuda_kernels_binary_div_f16_can_implement
Pre-launch implementability check for binary_div_f16.
baracuda_kernels_binary_div_f16_run
Binary elementwise div, f16 dtype, contiguous fast path.
baracuda_kernels_binary_div_f16_strided_can_implement
Pre-launch implementability check for binary_div_f16_strided.
baracuda_kernels_binary_div_f16_strided_run
Binary elementwise div, f16 dtype, strided / broadcast path.
baracuda_kernels_binary_div_f32_can_implement
Pre-launch implementability check for binary_div_f32. Validates the problem size without launching a kernel. Returns the standard status code mapping.
baracuda_kernels_binary_div_f32_run
Binary elementwise div, f32 dtype, contiguous fast path.
baracuda_kernels_binary_div_f32_strided_can_implement
Pre-launch implementability check for binary_div_f32_strided.
baracuda_kernels_binary_div_f32_strided_run
Binary elementwise div, f32 dtype, strided / broadcast path.
baracuda_kernels_binary_div_f64_can_implement
Pre-launch implementability check for binary_div_f64.
baracuda_kernels_binary_div_f64_run
Binary elementwise div, f64 dtype, contiguous fast path.
baracuda_kernels_binary_div_f64_strided_can_implement
Pre-launch implementability check for binary_div_f64_strided.
baracuda_kernels_binary_div_f64_strided_run
Binary elementwise div, f64 dtype, strided / broadcast path.
baracuda_kernels_binary_floor_divide_bf16_can_implement
Binary floor_divide, bf16, can-implement.
baracuda_kernels_binary_floor_divide_bf16_run
Binary floor_divide, bf16, contig.
baracuda_kernels_binary_floor_divide_bf16_strided_can_implement
Pre-launch implementability check for binary_floor_divide_bf16_strided.
baracuda_kernels_binary_floor_divide_bf16_strided_run
Binary floor_divide, bf16, strided.
baracuda_kernels_binary_floor_divide_f16_can_implement
Binary floor_divide, f16, can-implement.
baracuda_kernels_binary_floor_divide_f16_run
Binary floor_divide, f16, contig.
baracuda_kernels_binary_floor_divide_f16_strided_can_implement
Pre-launch implementability check for binary_floor_divide_f16_strided.
baracuda_kernels_binary_floor_divide_f16_strided_run
Binary floor_divide, f16, strided.
baracuda_kernels_binary_floor_divide_f32_can_implement
Binary floor_divide, f32, can-implement.
baracuda_kernels_binary_floor_divide_f32_run
Binary floor_divide, f32, contig.
baracuda_kernels_binary_floor_divide_f32_strided_can_implement
Pre-launch implementability check for binary_floor_divide_f32_strided.
baracuda_kernels_binary_floor_divide_f32_strided_run
Binary floor_divide, f32, strided.
baracuda_kernels_binary_floor_divide_f64_can_implement
Binary floor_divide, f64, can-implement.
baracuda_kernels_binary_floor_divide_f64_run
Binary floor_divide, f64, contig.
baracuda_kernels_binary_floor_divide_f64_strided_can_implement
Pre-launch implementability check for binary_floor_divide_f64_strided.
baracuda_kernels_binary_floor_divide_f64_strided_run
Binary floor_divide, f64, strided.
baracuda_kernels_binary_fmax_bf16_can_implement
Binary fmax, bf16, can-implement.
baracuda_kernels_binary_fmax_bf16_run
Binary fmax, bf16, contig.
baracuda_kernels_binary_fmax_bf16_strided_can_implement
Pre-launch implementability check for binary_fmax_bf16_strided.
baracuda_kernels_binary_fmax_bf16_strided_run
Binary fmax, bf16, strided.
baracuda_kernels_binary_fmax_f16_can_implement
Binary fmax, f16, can-implement.
baracuda_kernels_binary_fmax_f16_run
Binary fmax, f16, contig.
baracuda_kernels_binary_fmax_f16_strided_can_implement
Pre-launch implementability check for binary_fmax_f16_strided.
baracuda_kernels_binary_fmax_f16_strided_run
Binary fmax, f16, strided.
baracuda_kernels_binary_fmax_f32_can_implement
Binary fmax, f32, can-implement.
baracuda_kernels_binary_fmax_f32_run
Binary fmax, f32, contig.
baracuda_kernels_binary_fmax_f32_strided_can_implement
Pre-launch implementability check for binary_fmax_f32_strided.
baracuda_kernels_binary_fmax_f32_strided_run
Binary fmax, f32, strided.
baracuda_kernels_binary_fmax_f64_can_implement
Binary fmax, f64, can-implement.
baracuda_kernels_binary_fmax_f64_run
Binary fmax, f64, contig.
baracuda_kernels_binary_fmax_f64_strided_can_implement
Pre-launch implementability check for binary_fmax_f64_strided.
baracuda_kernels_binary_fmax_f64_strided_run
Binary fmax, f64, strided.
baracuda_kernels_binary_fmin_bf16_can_implement
Binary fmin, bf16, can-implement.
baracuda_kernels_binary_fmin_bf16_run
Binary fmin, bf16, contig.
baracuda_kernels_binary_fmin_bf16_strided_can_implement
Pre-launch implementability check for binary_fmin_bf16_strided.
baracuda_kernels_binary_fmin_bf16_strided_run
Binary fmin, bf16, strided.
baracuda_kernels_binary_fmin_f16_can_implement
Binary fmin, f16, can-implement.
baracuda_kernels_binary_fmin_f16_run
Binary fmin, f16, contig.
baracuda_kernels_binary_fmin_f16_strided_can_implement
Pre-launch implementability check for binary_fmin_f16_strided.
baracuda_kernels_binary_fmin_f16_strided_run
Binary fmin, f16, strided.
baracuda_kernels_binary_fmin_f32_can_implement
Binary fmin, f32, can-implement.
baracuda_kernels_binary_fmin_f32_run
Binary fmin, f32, contig.
baracuda_kernels_binary_fmin_f32_strided_can_implement
Pre-launch implementability check for binary_fmin_f32_strided.
baracuda_kernels_binary_fmin_f32_strided_run
Binary fmin, f32, strided.
baracuda_kernels_binary_fmin_f64_can_implement
Binary fmin, f64, can-implement.
baracuda_kernels_binary_fmin_f64_run
Binary fmin, f64, contig.
baracuda_kernels_binary_fmin_f64_strided_can_implement
Pre-launch implementability check for binary_fmin_f64_strided.
baracuda_kernels_binary_fmin_f64_strided_run
Binary fmin, f64, strided.
baracuda_kernels_binary_hypot_backward_bf16_can_implement
baracuda_kernels_binary_hypot_backward_bf16_can_implement (baracuda kernels binary hypot backward bf16 can implement).
baracuda_kernels_binary_hypot_backward_bf16_run
Hypot backward, bf16.
baracuda_kernels_binary_hypot_backward_f16_can_implement
baracuda_kernels_binary_hypot_backward_f16_can_implement (baracuda kernels binary hypot backward f16 can implement).
baracuda_kernels_binary_hypot_backward_f16_run
Hypot backward, f16.
baracuda_kernels_binary_hypot_backward_f32_can_implement
baracuda_kernels_binary_hypot_backward_f32_can_implement (baracuda kernels binary hypot backward f32 can implement).
baracuda_kernels_binary_hypot_backward_f32_run
Hypot backward, f32. y = sqrt(a²+b²) is reconstructed inside the kernel from saved a and b (no saved-y slot in BinaryBackwardArgs); da = dy*a/y, db = dy*b/y. Caller responsible for guarding against a == 0 && b == 0 (y == 0).
baracuda_kernels_binary_hypot_backward_f64_can_implement
baracuda_kernels_binary_hypot_backward_f64_can_implement (baracuda kernels binary hypot backward f64 can implement).
baracuda_kernels_binary_hypot_backward_f64_run
Hypot backward, f64.
baracuda_kernels_binary_hypot_bf16_can_implement
Binary hypot, bf16, can-implement.
baracuda_kernels_binary_hypot_bf16_run
Binary hypot, bf16, contig.
baracuda_kernels_binary_hypot_bf16_strided_can_implement
Pre-launch implementability check for binary_hypot_bf16_strided.
baracuda_kernels_binary_hypot_bf16_strided_run
Binary hypot, bf16, strided.
baracuda_kernels_binary_hypot_f16_can_implement
Binary hypot, f16, can-implement.
baracuda_kernels_binary_hypot_f16_run
Binary hypot, f16, contig.
baracuda_kernels_binary_hypot_f16_strided_can_implement
Pre-launch implementability check for binary_hypot_f16_strided.
baracuda_kernels_binary_hypot_f16_strided_run
Binary hypot, f16, strided.
baracuda_kernels_binary_hypot_f32_can_implement
Binary hypot, f32, can-implement.
baracuda_kernels_binary_hypot_f32_run
Binary hypot, f32, contig.
baracuda_kernels_binary_hypot_f32_strided_can_implement
Pre-launch implementability check for binary_hypot_f32_strided.
baracuda_kernels_binary_hypot_f32_strided_run
Binary hypot, f32, strided.
baracuda_kernels_binary_hypot_f64_can_implement
Binary hypot, f64, can-implement.
baracuda_kernels_binary_hypot_f64_run
Binary hypot, f64, contig.
baracuda_kernels_binary_hypot_f64_strided_can_implement
Pre-launch implementability check for binary_hypot_f64_strided.
baracuda_kernels_binary_hypot_f64_strided_run
Binary hypot, f64, strided.
baracuda_kernels_binary_lerp_backward_bf16_can_implement
baracuda_kernels_binary_lerp_backward_bf16_can_implement (baracuda kernels binary lerp backward bf16 can implement).
baracuda_kernels_binary_lerp_backward_bf16_run
lerp BW, bf16.
baracuda_kernels_binary_lerp_backward_f16_can_implement
baracuda_kernels_binary_lerp_backward_f16_can_implement (baracuda kernels binary lerp backward f16 can implement).
baracuda_kernels_binary_lerp_backward_f16_run
lerp BW, f16.
baracuda_kernels_binary_lerp_backward_f32_can_implement
baracuda_kernels_binary_lerp_backward_f32_can_implement (baracuda kernels binary lerp backward f32 can implement).
baracuda_kernels_binary_lerp_backward_f32_run
lerp backward: da = (1 - weight)·dy, db = weight·dy, f32. No saves.
baracuda_kernels_binary_lerp_backward_f64_can_implement
baracuda_kernels_binary_lerp_backward_f64_can_implement (baracuda kernels binary lerp backward f64 can implement).
baracuda_kernels_binary_lerp_backward_f64_run
lerp BW, f64.
baracuda_kernels_binary_lerp_bf16_can_implement
baracuda_kernels_binary_lerp_bf16_can_implement (baracuda kernels binary lerp bf16 can implement).
baracuda_kernels_binary_lerp_bf16_run
lerp FW, bf16.
baracuda_kernels_binary_lerp_f16_can_implement
baracuda_kernels_binary_lerp_f16_can_implement (baracuda kernels binary lerp f16 can implement).
baracuda_kernels_binary_lerp_f16_run
lerp FW, f16.
baracuda_kernels_binary_lerp_f32_can_implement
baracuda_kernels_binary_lerp_f32_can_implement (baracuda kernels binary lerp f32 can implement).
baracuda_kernels_binary_lerp_f32_run
Binary elementwise lerp(a, b; weight) = a + weight·(b - a), f32, contig.
baracuda_kernels_binary_lerp_f64_can_implement
baracuda_kernels_binary_lerp_f64_can_implement (baracuda kernels binary lerp f64 can implement).
baracuda_kernels_binary_lerp_f64_run
lerp FW, f64. The f32 weight widens to f64 losslessly.
baracuda_kernels_binary_logical_and_bool_can_implement
Binary logical and, Bool dtype, can-implement.
baracuda_kernels_binary_logical_and_bool_run
Binary logical and, Bool dtype (1-byte storage), contig.
baracuda_kernels_binary_logical_or_bool_can_implement
Binary logical or, Bool dtype, can-implement.
baracuda_kernels_binary_logical_or_bool_run
Binary logical or, Bool dtype, contig.
baracuda_kernels_binary_logical_xor_bool_can_implement
Binary logical xor, Bool dtype, can-implement.
baracuda_kernels_binary_logical_xor_bool_run
Binary logical xor, Bool dtype, contig.
baracuda_kernels_binary_maximum_backward_bf16_can_implement
baracuda_kernels_binary_maximum_backward_bf16_can_implement (baracuda kernels binary maximum backward bf16 can implement).
baracuda_kernels_binary_maximum_backward_bf16_run
Maximum backward, bf16.
baracuda_kernels_binary_maximum_backward_f16_can_implement
baracuda_kernels_binary_maximum_backward_f16_can_implement (baracuda kernels binary maximum backward f16 can implement).
baracuda_kernels_binary_maximum_backward_f16_run
Maximum backward, f16.
baracuda_kernels_binary_maximum_backward_f32_can_implement
baracuda_kernels_binary_maximum_backward_f32_can_implement (baracuda kernels binary maximum backward f32 can implement).
baracuda_kernels_binary_maximum_backward_f32_run
Maximum backward, f32. Tie-break: split dy evenly on a == b; NaN inputs propagate dy to both. Saved a and b are used purely as references for the comparison.
baracuda_kernels_binary_maximum_backward_f64_can_implement
baracuda_kernels_binary_maximum_backward_f64_can_implement (baracuda kernels binary maximum backward f64 can implement).
baracuda_kernels_binary_maximum_backward_f64_run
Maximum backward, f64.
baracuda_kernels_binary_maximum_bf16_can_implement
Binary maximum, bf16, can-implement.
baracuda_kernels_binary_maximum_bf16_run
Binary maximum, bf16, contig.
baracuda_kernels_binary_maximum_bf16_strided_can_implement
Pre-launch implementability check for binary_maximum_bf16_strided.
baracuda_kernels_binary_maximum_bf16_strided_run
Binary maximum, bf16, strided.
baracuda_kernels_binary_maximum_f16_can_implement
Binary maximum, f16, can-implement.
baracuda_kernels_binary_maximum_f16_run
Binary maximum, f16, contig.
baracuda_kernels_binary_maximum_f16_strided_can_implement
Pre-launch implementability check for binary_maximum_f16_strided.
baracuda_kernels_binary_maximum_f16_strided_run
Binary maximum, f16, strided.
baracuda_kernels_binary_maximum_f32_can_implement
Binary maximum, f32, can-implement.
baracuda_kernels_binary_maximum_f32_run
Binary maximum, f32, contig.
baracuda_kernels_binary_maximum_f32_strided_can_implement
Pre-launch implementability check for binary_maximum_f32_strided.
baracuda_kernels_binary_maximum_f32_strided_run
Binary maximum, f32, strided.
baracuda_kernels_binary_maximum_f64_can_implement
Binary maximum, f64, can-implement.
baracuda_kernels_binary_maximum_f64_run
Binary maximum, f64, contig.
baracuda_kernels_binary_maximum_f64_strided_can_implement
Pre-launch implementability check for binary_maximum_f64_strided.
baracuda_kernels_binary_maximum_f64_strided_run
Binary maximum, f64, strided.
baracuda_kernels_binary_minimum_backward_bf16_can_implement
baracuda_kernels_binary_minimum_backward_bf16_can_implement (baracuda kernels binary minimum backward bf16 can implement).
baracuda_kernels_binary_minimum_backward_bf16_run
Minimum backward, bf16.
baracuda_kernels_binary_minimum_backward_f16_can_implement
baracuda_kernels_binary_minimum_backward_f16_can_implement (baracuda kernels binary minimum backward f16 can implement).
baracuda_kernels_binary_minimum_backward_f16_run
Minimum backward, f16.
baracuda_kernels_binary_minimum_backward_f32_can_implement
baracuda_kernels_binary_minimum_backward_f32_can_implement (baracuda kernels binary minimum backward f32 can implement).
baracuda_kernels_binary_minimum_backward_f32_run
Minimum backward, f32. Tie-break: split dy evenly on a == b; NaN inputs propagate dy to both. Saved a and b are used purely as references for the comparison.
baracuda_kernels_binary_minimum_backward_f64_can_implement
baracuda_kernels_binary_minimum_backward_f64_can_implement (baracuda kernels binary minimum backward f64 can implement).
baracuda_kernels_binary_minimum_backward_f64_run
Minimum backward, f64.
baracuda_kernels_binary_minimum_bf16_can_implement
Binary minimum, bf16, can-implement.
baracuda_kernels_binary_minimum_bf16_run
Binary minimum, bf16, contig.
baracuda_kernels_binary_minimum_bf16_strided_can_implement
Pre-launch implementability check for binary_minimum_bf16_strided.
baracuda_kernels_binary_minimum_bf16_strided_run
Binary minimum, bf16, strided.
baracuda_kernels_binary_minimum_f16_can_implement
Binary minimum, f16, can-implement.
baracuda_kernels_binary_minimum_f16_run
Binary minimum, f16, contig.
baracuda_kernels_binary_minimum_f16_strided_can_implement
Pre-launch implementability check for binary_minimum_f16_strided.
baracuda_kernels_binary_minimum_f16_strided_run
Binary minimum, f16, strided.
baracuda_kernels_binary_minimum_f32_can_implement
Binary minimum, f32, can-implement.
baracuda_kernels_binary_minimum_f32_run
Binary minimum, f32, contig.
baracuda_kernels_binary_minimum_f32_strided_can_implement
Pre-launch implementability check for binary_minimum_f32_strided.
baracuda_kernels_binary_minimum_f32_strided_run
Binary minimum, f32, strided.
baracuda_kernels_binary_minimum_f64_can_implement
Binary minimum, f64, can-implement.
baracuda_kernels_binary_minimum_f64_run
Binary minimum, f64, contig.
baracuda_kernels_binary_minimum_f64_strided_can_implement
Pre-launch implementability check for binary_minimum_f64_strided.
baracuda_kernels_binary_minimum_f64_strided_run
Binary minimum, f64, strided.
baracuda_kernels_binary_mod_bf16_can_implement
Binary mod, bf16, can-implement.
baracuda_kernels_binary_mod_bf16_run
Binary mod, bf16, contig.
baracuda_kernels_binary_mod_bf16_strided_can_implement
Pre-launch implementability check for binary_mod_bf16_strided.
baracuda_kernels_binary_mod_bf16_strided_run
Binary mod, bf16, strided.
baracuda_kernels_binary_mod_f16_can_implement
Binary mod, f16, can-implement.
baracuda_kernels_binary_mod_f16_run
Binary mod, f16, contig.
baracuda_kernels_binary_mod_f16_strided_can_implement
Pre-launch implementability check for binary_mod_f16_strided.
baracuda_kernels_binary_mod_f16_strided_run
Binary mod, f16, strided.
baracuda_kernels_binary_mod_f32_can_implement
Binary mod, f32, can-implement.
baracuda_kernels_binary_mod_f32_run
Binary mod, f32, contig.
baracuda_kernels_binary_mod_f32_strided_can_implement
Pre-launch implementability check for binary_mod_f32_strided.
baracuda_kernels_binary_mod_f32_strided_run
Binary mod, f32, strided.
baracuda_kernels_binary_mod_f64_can_implement
Binary mod, f64, can-implement.
baracuda_kernels_binary_mod_f64_run
Binary mod, f64, contig.
baracuda_kernels_binary_mod_f64_strided_can_implement
Pre-launch implementability check for binary_mod_f64_strided.
baracuda_kernels_binary_mod_f64_strided_run
Binary mod, f64, strided.
baracuda_kernels_binary_mul_backward_bf16_can_implement
baracuda_kernels_binary_mul_backward_bf16_can_implement (baracuda kernels binary mul backward bf16 can implement).
baracuda_kernels_binary_mul_backward_bf16_run
Mul backward, bf16.
baracuda_kernels_binary_mul_backward_f16_can_implement
baracuda_kernels_binary_mul_backward_f16_can_implement (baracuda kernels binary mul backward f16 can implement).
baracuda_kernels_binary_mul_backward_f16_run
Mul backward, f16.
baracuda_kernels_binary_mul_backward_f32_can_implement
baracuda_kernels_binary_mul_backward_f32_can_implement (baracuda kernels binary mul backward f32 can implement).
baracuda_kernels_binary_mul_backward_f32_run
Mul backward, f32. Writes da = dy * b and db = dy * a. Both saved tensors a and b must be non-null.
baracuda_kernels_binary_mul_backward_f64_can_implement
baracuda_kernels_binary_mul_backward_f64_can_implement (baracuda kernels binary mul backward f64 can implement).
baracuda_kernels_binary_mul_backward_f64_run
Mul backward, f64.
baracuda_kernels_binary_mul_bf16_can_implement
Pre-launch implementability check for binary_mul_bf16.
baracuda_kernels_binary_mul_bf16_run
Binary elementwise mul, bf16 dtype, contiguous fast path.
baracuda_kernels_binary_mul_bf16_strided_can_implement
Pre-launch implementability check for binary_mul_bf16_strided.
baracuda_kernels_binary_mul_bf16_strided_run
Binary elementwise mul, bf16 dtype, strided / broadcast path.
baracuda_kernels_binary_mul_f16_can_implement
Pre-launch implementability check for binary_mul_f16.
baracuda_kernels_binary_mul_f16_run
Binary elementwise mul, f16 dtype, contiguous fast path.
baracuda_kernels_binary_mul_f16_strided_can_implement
Pre-launch implementability check for binary_mul_f16_strided.
baracuda_kernels_binary_mul_f16_strided_run
Binary elementwise mul, f16 dtype, strided / broadcast path.
baracuda_kernels_binary_mul_f32_can_implement
Pre-launch implementability check for binary_mul_f32. Validates the problem size without launching a kernel. Returns the standard status code mapping.
baracuda_kernels_binary_mul_f32_run
Binary elementwise mul, f32 dtype, contiguous fast path.
baracuda_kernels_binary_mul_f32_strided_can_implement
Pre-launch implementability check for binary_mul_f32_strided.
baracuda_kernels_binary_mul_f32_strided_run
Binary elementwise mul, f32 dtype, strided / broadcast path.
baracuda_kernels_binary_mul_f64_can_implement
Pre-launch implementability check for binary_mul_f64.
baracuda_kernels_binary_mul_f64_run
Binary elementwise mul, f64 dtype, contiguous fast path.
baracuda_kernels_binary_mul_f64_strided_can_implement
Pre-launch implementability check for binary_mul_f64_strided.
baracuda_kernels_binary_mul_f64_strided_run
Binary elementwise mul, f64 dtype, strided / broadcast path.
baracuda_kernels_binary_nextafter_bf16_can_implement
Binary nextafter, bf16, can-implement.
baracuda_kernels_binary_nextafter_bf16_run
Binary nextafter, bf16, contig.
baracuda_kernels_binary_nextafter_bf16_strided_can_implement
Pre-launch implementability check for binary_nextafter_bf16_strided.
baracuda_kernels_binary_nextafter_bf16_strided_run
Binary nextafter, bf16, strided.
baracuda_kernels_binary_nextafter_f16_can_implement
Binary nextafter, f16, can-implement.
baracuda_kernels_binary_nextafter_f16_run
Binary nextafter, f16, contig.
baracuda_kernels_binary_nextafter_f16_strided_can_implement
Pre-launch implementability check for binary_nextafter_f16_strided.
baracuda_kernels_binary_nextafter_f16_strided_run
Binary nextafter, f16, strided.
baracuda_kernels_binary_nextafter_f32_can_implement
Binary nextafter, f32, can-implement.
baracuda_kernels_binary_nextafter_f32_run
Binary nextafter, f32, contig.
baracuda_kernels_binary_nextafter_f32_strided_can_implement
Pre-launch implementability check for binary_nextafter_f32_strided.
baracuda_kernels_binary_nextafter_f32_strided_run
Binary nextafter, f32, strided.
baracuda_kernels_binary_nextafter_f64_can_implement
Binary nextafter, f64, can-implement.
baracuda_kernels_binary_nextafter_f64_run
Binary nextafter, f64, contig.
baracuda_kernels_binary_nextafter_f64_strided_can_implement
Pre-launch implementability check for binary_nextafter_f64_strided.
baracuda_kernels_binary_nextafter_f64_strided_run
Binary nextafter, f64, strided.
baracuda_kernels_binary_pow_backward_bf16_can_implement
baracuda_kernels_binary_pow_backward_bf16_can_implement (baracuda kernels binary pow backward bf16 can implement).
baracuda_kernels_binary_pow_backward_bf16_run
Pow backward, bf16.
baracuda_kernels_binary_pow_backward_f16_can_implement
baracuda_kernels_binary_pow_backward_f16_can_implement (baracuda kernels binary pow backward f16 can implement).
baracuda_kernels_binary_pow_backward_f16_run
Pow backward, f16.
baracuda_kernels_binary_pow_backward_f32_can_implement
baracuda_kernels_binary_pow_backward_f32_can_implement (baracuda kernels binary pow backward f32 can implement).
baracuda_kernels_binary_pow_backward_f32_run
Pow backward, f32. da = dy * b * a^(b-1), db = dy * a^b * ln(a). Caller responsible for guarding against undefined regions (a < 0 non-integer b, or a == 0 with b < 1).
baracuda_kernels_binary_pow_backward_f64_can_implement
baracuda_kernels_binary_pow_backward_f64_can_implement (baracuda kernels binary pow backward f64 can implement).
baracuda_kernels_binary_pow_backward_f64_run
Pow backward, f64.
baracuda_kernels_binary_pow_bf16_can_implement
Binary pow, bf16, can-implement.
baracuda_kernels_binary_pow_bf16_run
Binary pow, bf16, contig.
baracuda_kernels_binary_pow_bf16_strided_can_implement
Pre-launch implementability check for binary_pow_bf16_strided.
baracuda_kernels_binary_pow_bf16_strided_run
Binary pow, bf16, strided.
baracuda_kernels_binary_pow_f16_can_implement
Binary pow, f16, can-implement.
baracuda_kernels_binary_pow_f16_run
Binary pow, f16, contig.
baracuda_kernels_binary_pow_f16_strided_can_implement
Pre-launch implementability check for binary_pow_f16_strided.
baracuda_kernels_binary_pow_f16_strided_run
Binary pow, f16, strided.
baracuda_kernels_binary_pow_f32_can_implement
Binary pow, f32, can-implement.
baracuda_kernels_binary_pow_f32_run
Binary pow, f32, contig.
baracuda_kernels_binary_pow_f32_strided_can_implement
Pre-launch implementability check for binary_pow_f32_strided.
baracuda_kernels_binary_pow_f32_strided_run
Binary pow, f32, strided.
baracuda_kernels_binary_pow_f64_can_implement
Binary pow, f64, can-implement.
baracuda_kernels_binary_pow_f64_run
Binary pow, f64, contig.
baracuda_kernels_binary_pow_f64_strided_can_implement
Pre-launch implementability check for binary_pow_f64_strided.
baracuda_kernels_binary_pow_f64_strided_run
Binary pow, f64, strided.
baracuda_kernels_binary_remainder_bf16_can_implement
Binary remainder, bf16, can-implement.
baracuda_kernels_binary_remainder_bf16_run
Binary remainder, bf16, contig.
baracuda_kernels_binary_remainder_bf16_strided_can_implement
Pre-launch implementability check for binary_remainder_bf16_strided.
baracuda_kernels_binary_remainder_bf16_strided_run
Binary remainder, bf16, strided.
baracuda_kernels_binary_remainder_f16_can_implement
Binary remainder, f16, can-implement.
baracuda_kernels_binary_remainder_f16_run
Binary remainder, f16, contig.
baracuda_kernels_binary_remainder_f16_strided_can_implement
Pre-launch implementability check for binary_remainder_f16_strided.
baracuda_kernels_binary_remainder_f16_strided_run
Binary remainder, f16, strided.
baracuda_kernels_binary_remainder_f32_can_implement
Binary remainder, f32, can-implement.
baracuda_kernels_binary_remainder_f32_run
Binary remainder, f32, contig.
baracuda_kernels_binary_remainder_f32_strided_can_implement
Pre-launch implementability check for binary_remainder_f32_strided.
baracuda_kernels_binary_remainder_f32_strided_run
Binary remainder, f32, strided.
baracuda_kernels_binary_remainder_f64_can_implement
Binary remainder, f64, can-implement.
baracuda_kernels_binary_remainder_f64_run
Binary remainder, f64, contig.
baracuda_kernels_binary_remainder_f64_strided_can_implement
Pre-launch implementability check for binary_remainder_f64_strided.
baracuda_kernels_binary_remainder_f64_strided_run
Binary remainder, f64, strided.
baracuda_kernels_binary_sub_backward_bf16_can_implement
baracuda_kernels_binary_sub_backward_bf16_can_implement (baracuda kernels binary sub backward bf16 can implement).
baracuda_kernels_binary_sub_backward_bf16_run
Sub backward, bf16.
baracuda_kernels_binary_sub_backward_f16_can_implement
baracuda_kernels_binary_sub_backward_f16_can_implement (baracuda kernels binary sub backward f16 can implement).
baracuda_kernels_binary_sub_backward_f16_run
Sub backward, f16.
baracuda_kernels_binary_sub_backward_f32_can_implement
baracuda_kernels_binary_sub_backward_f32_can_implement (baracuda kernels binary sub backward f32 can implement).
baracuda_kernels_binary_sub_backward_f32_run
Sub backward, f32. Writes da = dy and db = -dy.
baracuda_kernels_binary_sub_backward_f64_can_implement
baracuda_kernels_binary_sub_backward_f64_can_implement (baracuda kernels binary sub backward f64 can implement).
baracuda_kernels_binary_sub_backward_f64_run
Sub backward, f64.
baracuda_kernels_binary_sub_bf16_can_implement
Pre-launch implementability check for binary_sub_bf16.
baracuda_kernels_binary_sub_bf16_run
Binary elementwise sub, bf16 dtype, contiguous fast path.
baracuda_kernels_binary_sub_bf16_strided_can_implement
Pre-launch implementability check for binary_sub_bf16_strided.
baracuda_kernels_binary_sub_bf16_strided_run
Binary elementwise sub, bf16 dtype, strided / broadcast path.
baracuda_kernels_binary_sub_f16_can_implement
Pre-launch implementability check for binary_sub_f16.
baracuda_kernels_binary_sub_f16_run
Binary elementwise sub, f16 dtype, contiguous fast path.
baracuda_kernels_binary_sub_f16_strided_can_implement
Pre-launch implementability check for binary_sub_f16_strided.
baracuda_kernels_binary_sub_f16_strided_run
Binary elementwise sub, f16 dtype, strided / broadcast path.
baracuda_kernels_binary_sub_f32_can_implement
Pre-launch implementability check for binary_sub_f32. Validates the problem size without launching a kernel. Returns the standard status code mapping.
baracuda_kernels_binary_sub_f32_run
Binary elementwise sub, f32 dtype, contiguous fast path.
baracuda_kernels_binary_sub_f32_strided_can_implement
Pre-launch implementability check for binary_sub_f32_strided.
baracuda_kernels_binary_sub_f32_strided_run
Binary elementwise sub, f32 dtype, strided / broadcast path.
baracuda_kernels_binary_sub_f64_can_implement
Pre-launch implementability check for binary_sub_f64.
baracuda_kernels_binary_sub_f64_run
Binary elementwise sub, f64 dtype, contiguous fast path.
baracuda_kernels_binary_sub_f64_strided_can_implement
Pre-launch implementability check for binary_sub_f64_strided.
baracuda_kernels_binary_sub_f64_strided_run
Binary elementwise sub, f64 dtype, strided / broadcast path.
baracuda_kernels_bincount_i32_can_implement
baracuda_kernels_bincount_i32_can_implement (baracuda kernels bincount i32 can implement).
baracuda_kernels_bincount_i32_run
bincount, i32 input. Out-of-range (< 0 or >= num_bins) silently dropped.
baracuda_kernels_bincount_i64_can_implement
baracuda_kernels_bincount_i64_can_implement (baracuda kernels bincount i64 can implement).
baracuda_kernels_bincount_i64_run
bincount, i64 input.
baracuda_kernels_cast_bf16_bf16_can_implement
Implementability check for cast_bf16_bf16.
baracuda_kernels_cast_bf16_bf16_run
Cast bf16 -> bf16.
baracuda_kernels_cast_bf16_bool_can_implement
Implementability check for cast_bf16_bool.
baracuda_kernels_cast_bf16_bool_run
Cast bf16 -> Bool.
baracuda_kernels_cast_bf16_f16_can_implement
Implementability check for cast_bf16_f16.
baracuda_kernels_cast_bf16_f16_run
Cast bf16 -> f16.
baracuda_kernels_cast_bf16_f32_can_implement
Implementability check for cast_bf16_f32.
baracuda_kernels_cast_bf16_f32_run
Cast bf16 -> f32.
baracuda_kernels_cast_bf16_f64_can_implement
Implementability check for cast_bf16_f64.
baracuda_kernels_cast_bf16_f64_run
Cast bf16 -> f64.
baracuda_kernels_cast_bf16_fp8e4m3_can_implement
baracuda_kernels_cast_bf16_fp8e4m3_can_implement (baracuda kernels cast bf16 fp8e4m3 can implement).
baracuda_kernels_cast_bf16_fp8e4m3_run
Cast bf16 -> Fp8E4M3.
baracuda_kernels_cast_bf16_fp8e5m2_can_implement
baracuda_kernels_cast_bf16_fp8e5m2_can_implement (baracuda kernels cast bf16 fp8e5m2 can implement).
baracuda_kernels_cast_bf16_fp8e5m2_run
Cast bf16 -> Fp8E5M2.
baracuda_kernels_cast_bf16_i8_can_implement
Implementability check for cast_bf16_i8.
baracuda_kernels_cast_bf16_i8_run
Cast bf16 -> i8.
baracuda_kernels_cast_bf16_i16_can_implement
baracuda_kernels_cast_bf16_i16_can_implement (baracuda kernels cast bf16 i16 can implement).
baracuda_kernels_cast_bf16_i16_run
Cast bf16 -> i16. Phase 31.
baracuda_kernels_cast_bf16_i32_can_implement
Implementability check for cast_bf16_i32.
baracuda_kernels_cast_bf16_i32_run
Cast bf16 -> i32.
baracuda_kernels_cast_bf16_i64_can_implement
Implementability check for cast_bf16_i64.
baracuda_kernels_cast_bf16_i64_run
Cast bf16 -> i64.
baracuda_kernels_cast_bf16_u8_can_implement
Implementability check for cast_bf16_u8.
baracuda_kernels_cast_bf16_u8_run
Cast bf16 -> u8.
baracuda_kernels_cast_bf16_u32_can_implement
baracuda_kernels_cast_bf16_u32_can_implement (baracuda kernels cast bf16 u32 can implement).
baracuda_kernels_cast_bf16_u32_run
Cast bf16 -> u32. Phase 31.
baracuda_kernels_cast_bool_bf16_can_implement
Implementability check for cast_bool_bf16.
baracuda_kernels_cast_bool_bf16_run
Cast Bool -> bf16.
baracuda_kernels_cast_bool_f16_can_implement
Implementability check for cast_bool_f16.
baracuda_kernels_cast_bool_f16_run
Cast Bool -> f16.
baracuda_kernels_cast_bool_f32_can_implement
Implementability check for cast_bool_f32.
baracuda_kernels_cast_bool_f32_run
Cast Bool -> f32.
baracuda_kernels_cast_bool_i32_can_implement
Implementability check for cast_bool_i32.
baracuda_kernels_cast_bool_i32_run
Cast Bool -> i32. x != 0 → 1.
baracuda_kernels_cast_bool_i64_can_implement
Implementability check for cast_bool_i64.
baracuda_kernels_cast_bool_i64_run
Cast Bool -> i64.
baracuda_kernels_cast_f16_bf16_can_implement
Implementability check for cast_f16_bf16.
baracuda_kernels_cast_f16_bf16_run
Cast f16 -> bf16.
baracuda_kernels_cast_f16_bool_can_implement
Implementability check for cast_f16_bool.
baracuda_kernels_cast_f16_bool_run
Cast f16 -> Bool.
baracuda_kernels_cast_f16_f16_can_implement
Implementability check for cast_f16_f16.
baracuda_kernels_cast_f16_f16_run
Cast f16 -> f16.
baracuda_kernels_cast_f16_f32_can_implement
Implementability check for cast_f16_f32.
baracuda_kernels_cast_f16_f32_run
Cast f16 -> f32.
baracuda_kernels_cast_f16_f64_can_implement
Implementability check for cast_f16_f64.
baracuda_kernels_cast_f16_f64_run
Cast f16 -> f64.
baracuda_kernels_cast_f16_fp8e4m3_can_implement
baracuda_kernels_cast_f16_fp8e4m3_can_implement (baracuda kernels cast f16 fp8e4m3 can implement).
baracuda_kernels_cast_f16_fp8e4m3_run
Cast f16 -> Fp8E4M3.
baracuda_kernels_cast_f16_fp8e5m2_can_implement
baracuda_kernels_cast_f16_fp8e5m2_can_implement (baracuda kernels cast f16 fp8e5m2 can implement).
baracuda_kernels_cast_f16_fp8e5m2_run
Cast f16 -> Fp8E5M2.
baracuda_kernels_cast_f16_i8_can_implement
Implementability check for cast_f16_i8.
baracuda_kernels_cast_f16_i8_run
Cast f16 -> i8.
baracuda_kernels_cast_f16_i16_can_implement
baracuda_kernels_cast_f16_i16_can_implement (baracuda kernels cast f16 i16 can implement).
baracuda_kernels_cast_f16_i16_run
Cast f16 -> i16. Phase 31.
baracuda_kernels_cast_f16_i32_can_implement
Implementability check for cast_f16_i32.
baracuda_kernels_cast_f16_i32_run
Cast f16 -> i32.
baracuda_kernels_cast_f16_i64_can_implement
Implementability check for cast_f16_i64.
baracuda_kernels_cast_f16_i64_run
Cast f16 -> i64.
baracuda_kernels_cast_f16_u8_can_implement
Implementability check for cast_f16_u8.
baracuda_kernels_cast_f16_u8_run
Cast f16 -> u8.
baracuda_kernels_cast_f16_u32_can_implement
baracuda_kernels_cast_f16_u32_can_implement (baracuda kernels cast f16 u32 can implement).
baracuda_kernels_cast_f16_u32_run
Cast f16 -> u32. Phase 31.
baracuda_kernels_cast_f32_bf16_can_implement
Implementability check for cast_f32_bf16.
baracuda_kernels_cast_f32_bf16_run
Cast f32 -> bf16.
baracuda_kernels_cast_f32_bool_can_implement
Implementability check for cast_f32_bool.
baracuda_kernels_cast_f32_bool_run
Cast f32 -> Bool.
baracuda_kernels_cast_f32_f16_can_implement
Implementability check for cast_f32_f16.
baracuda_kernels_cast_f32_f16_run
Cast f32 -> f16.
baracuda_kernels_cast_f32_f32_can_implement
Implementability check for cast_f32_f32.
baracuda_kernels_cast_f32_f32_run
Cast f32 -> f32. See LICENSE-thirdparty.md.
baracuda_kernels_cast_f32_f64_can_implement
Implementability check for cast_f32_f64.
baracuda_kernels_cast_f32_f64_run
Cast f32 -> f64.
baracuda_kernels_cast_f32_fp8e4m3_can_implement
baracuda_kernels_cast_f32_fp8e4m3_can_implement (baracuda kernels cast f32 fp8e4m3 can implement).
baracuda_kernels_cast_f32_fp8e4m3_run
Cast f32 -> Fp8E4M3 (saturates to ±448).
baracuda_kernels_cast_f32_fp8e5m2_can_implement
baracuda_kernels_cast_f32_fp8e5m2_can_implement (baracuda kernels cast f32 fp8e5m2 can implement).
baracuda_kernels_cast_f32_fp8e5m2_run
Cast f32 -> Fp8E5M2 (saturates to ±57344).
baracuda_kernels_cast_f32_i8_can_implement
Implementability check for cast_f32_i8.
baracuda_kernels_cast_f32_i8_run
Cast f32 -> i8.
baracuda_kernels_cast_f32_i16_can_implement
baracuda_kernels_cast_f32_i16_can_implement (baracuda kernels cast f32 i16 can implement).
baracuda_kernels_cast_f32_i16_run
Cast f32 -> i16. Phase 31.
baracuda_kernels_cast_f32_i32_can_implement
Implementability check for cast_f32_i32.
baracuda_kernels_cast_f32_i32_run
Cast f32 -> i32.
baracuda_kernels_cast_f32_i64_can_implement
Implementability check for cast_f32_i64.
baracuda_kernels_cast_f32_i64_run
Cast f32 -> i64.
baracuda_kernels_cast_f32_s4_can_implement
baracuda_kernels_cast_f32_s4_can_implement (baracuda kernels cast f32 s4 can implement).
baracuda_kernels_cast_f32_s4_run
Cast f32 -> S4 (round-to-nearest then saturate).
baracuda_kernels_cast_f32_u4_can_implement
baracuda_kernels_cast_f32_u4_can_implement (baracuda kernels cast f32 u4 can implement).
baracuda_kernels_cast_f32_u4_run
Cast f32 -> U4 (round-to-nearest then saturate).
baracuda_kernels_cast_f32_u8_can_implement
Implementability check for cast_f32_u8.
baracuda_kernels_cast_f32_u8_run
Cast f32 -> u8.
baracuda_kernels_cast_f32_u32_can_implement
baracuda_kernels_cast_f32_u32_can_implement (baracuda kernels cast f32 u32 can implement).
baracuda_kernels_cast_f32_u32_run
Cast f32 -> u32. Negative inputs are undefined per C++ rules (typical NVCC behaviour: saturates toward 0). Phase 31.
baracuda_kernels_cast_f64_bf16_can_implement
Implementability check for cast_f64_bf16.
baracuda_kernels_cast_f64_bf16_run
Cast f64 -> bf16.
baracuda_kernels_cast_f64_f16_can_implement
Implementability check for cast_f64_f16.
baracuda_kernels_cast_f64_f16_run
Cast f64 -> f16.
baracuda_kernels_cast_f64_f32_can_implement
Implementability check for cast_f64_f32.
baracuda_kernels_cast_f64_f32_run
Cast f64 -> f32.
baracuda_kernels_cast_f64_f64_can_implement
Implementability check for cast_f64_f64.
baracuda_kernels_cast_f64_f64_run
Cast f64 -> f64.
baracuda_kernels_cast_f64_i8_can_implement
Implementability check for cast_f64_i8.
baracuda_kernels_cast_f64_i8_run
Cast f64 -> i8.
baracuda_kernels_cast_f64_i16_can_implement
baracuda_kernels_cast_f64_i16_can_implement (baracuda kernels cast f64 i16 can implement).
baracuda_kernels_cast_f64_i16_run
Cast f64 -> i16. Phase 31.
baracuda_kernels_cast_f64_i32_can_implement
Implementability check for cast_f64_i32.
baracuda_kernels_cast_f64_i32_run
Cast f64 -> i32.
baracuda_kernels_cast_f64_i64_can_implement
Implementability check for cast_f64_i64.
baracuda_kernels_cast_f64_i64_run
Cast f64 -> i64.
baracuda_kernels_cast_f64_u8_can_implement
Implementability check for cast_f64_u8.
baracuda_kernels_cast_f64_u8_run
Cast f64 -> u8.
baracuda_kernels_cast_f64_u32_can_implement
baracuda_kernels_cast_f64_u32_can_implement (baracuda kernels cast f64 u32 can implement).
baracuda_kernels_cast_f64_u32_run
Cast f64 -> u32. Phase 31.
baracuda_kernels_cast_fp8e4m3_bf16_can_implement
baracuda_kernels_cast_fp8e4m3_bf16_can_implement (baracuda kernels cast fp8e4m3 bf16 can implement).
baracuda_kernels_cast_fp8e4m3_bf16_run
Cast Fp8E4M3 -> bf16.
baracuda_kernels_cast_fp8e4m3_f16_can_implement
baracuda_kernels_cast_fp8e4m3_f16_can_implement (baracuda kernels cast fp8e4m3 f16 can implement).
baracuda_kernels_cast_fp8e4m3_f16_run
Cast Fp8E4M3 -> f16.
baracuda_kernels_cast_fp8e4m3_f32_can_implement
baracuda_kernels_cast_fp8e4m3_f32_can_implement (baracuda kernels cast fp8e4m3 f32 can implement).
baracuda_kernels_cast_fp8e4m3_f32_run
Cast Fp8E4M3 -> f32.
baracuda_kernels_cast_fp8e5m2_bf16_can_implement
baracuda_kernels_cast_fp8e5m2_bf16_can_implement (baracuda kernels cast fp8e5m2 bf16 can implement).
baracuda_kernels_cast_fp8e5m2_bf16_run
Cast Fp8E5M2 -> bf16.
baracuda_kernels_cast_fp8e5m2_f16_can_implement
baracuda_kernels_cast_fp8e5m2_f16_can_implement (baracuda kernels cast fp8e5m2 f16 can implement).
baracuda_kernels_cast_fp8e5m2_f16_run
Cast Fp8E5M2 -> f16.
baracuda_kernels_cast_fp8e5m2_f32_can_implement
baracuda_kernels_cast_fp8e5m2_f32_can_implement (baracuda kernels cast fp8e5m2 f32 can implement).
baracuda_kernels_cast_fp8e5m2_f32_run
Cast Fp8E5M2 -> f32.
baracuda_kernels_cast_i8_bf16_can_implement
Implementability check for cast_i8_bf16.
baracuda_kernels_cast_i8_bf16_run
Cast i8 -> bf16.
baracuda_kernels_cast_i8_f16_can_implement
Implementability check for cast_i8_f16.
baracuda_kernels_cast_i8_f16_run
Cast i8 -> f16.
baracuda_kernels_cast_i8_f32_can_implement
Implementability check for cast_i8_f32.
baracuda_kernels_cast_i8_f32_run
Cast i8 -> f32.
baracuda_kernels_cast_i8_f64_can_implement
Implementability check for cast_i8_f64.
baracuda_kernels_cast_i8_f64_run
Cast i8 -> f64.
baracuda_kernels_cast_i8_i8_can_implement
Implementability check for cast_i8_i8.
baracuda_kernels_cast_i8_i8_run
Cast i8 -> i8.
baracuda_kernels_cast_i8_i16_can_implement
baracuda_kernels_cast_i8_i16_can_implement (baracuda kernels cast i8 i16 can implement).
baracuda_kernels_cast_i8_i16_run
Cast i8 -> i16. Sign-extends. Phase 31.
baracuda_kernels_cast_i8_i32_can_implement
Implementability check for cast_i8_i32.
baracuda_kernels_cast_i8_i32_run
Cast i8 -> i32.
baracuda_kernels_cast_i8_i64_can_implement
Implementability check for cast_i8_i64.
baracuda_kernels_cast_i8_i64_run
Cast i8 -> i64.
baracuda_kernels_cast_i8_u8_can_implement
Implementability check for cast_i8_u8.
baracuda_kernels_cast_i8_u8_run
Cast i8 -> u8.
baracuda_kernels_cast_i8_u32_can_implement
baracuda_kernels_cast_i8_u32_can_implement (baracuda kernels cast i8 u32 can implement).
baracuda_kernels_cast_i8_u32_run
Cast i8 -> u32. Sign-extends then reinterprets. Phase 31.
baracuda_kernels_cast_i16_bf16_can_implement
baracuda_kernels_cast_i16_bf16_can_implement (baracuda kernels cast i16 bf16 can implement).
baracuda_kernels_cast_i16_bf16_run
Cast i16 -> bf16. Phase 31.
baracuda_kernels_cast_i16_f16_can_implement
baracuda_kernels_cast_i16_f16_can_implement (baracuda kernels cast i16 f16 can implement).
baracuda_kernels_cast_i16_f16_run
Cast i16 -> f16. Phase 31.
baracuda_kernels_cast_i16_f32_can_implement
baracuda_kernels_cast_i16_f32_can_implement (baracuda kernels cast i16 f32 can implement).
baracuda_kernels_cast_i16_f32_run
Cast i16 -> f32. Phase 31.
baracuda_kernels_cast_i16_f64_can_implement
baracuda_kernels_cast_i16_f64_can_implement (baracuda kernels cast i16 f64 can implement).
baracuda_kernels_cast_i16_f64_run
Cast i16 -> f64. Phase 31.
baracuda_kernels_cast_i16_i8_can_implement
baracuda_kernels_cast_i16_i8_can_implement (baracuda kernels cast i16 i8 can implement).
baracuda_kernels_cast_i16_i8_run
Cast i16 -> i8. Truncates to low byte. Phase 31.
baracuda_kernels_cast_i16_i16_can_implement
baracuda_kernels_cast_i16_i16_can_implement (baracuda kernels cast i16 i16 can implement).
baracuda_kernels_cast_i16_i16_run
Cast i16 -> i16 (identity). Phase 31.
baracuda_kernels_cast_i16_i32_can_implement
baracuda_kernels_cast_i16_i32_can_implement (baracuda kernels cast i16 i32 can implement).
baracuda_kernels_cast_i16_i32_run
Cast i16 -> i32. Sign-extends. Phase 31.
baracuda_kernels_cast_i16_i64_can_implement
baracuda_kernels_cast_i16_i64_can_implement (baracuda kernels cast i16 i64 can implement).
baracuda_kernels_cast_i16_i64_run
Cast i16 -> i64. Sign-extends. Phase 31.
baracuda_kernels_cast_i16_u8_can_implement
baracuda_kernels_cast_i16_u8_can_implement (baracuda kernels cast i16 u8 can implement).
baracuda_kernels_cast_i16_u8_run
Cast i16 -> u8. Truncates to low byte then reinterprets. Phase 31.
baracuda_kernels_cast_i16_u32_can_implement
baracuda_kernels_cast_i16_u32_can_implement (baracuda kernels cast i16 u32 can implement).
baracuda_kernels_cast_i16_u32_run
Cast i16 -> u32. Sign-extends to i32 then reinterprets. Phase 31.
baracuda_kernels_cast_i32_bf16_can_implement
Implementability check for cast_i32_bf16.
baracuda_kernels_cast_i32_bf16_run
Cast i32 -> bf16.
baracuda_kernels_cast_i32_bool_can_implement
Implementability check for cast_i32_bool.
baracuda_kernels_cast_i32_bool_run
Cast i32 -> Bool. x != 0 → 1.
baracuda_kernels_cast_i32_f16_can_implement
Implementability check for cast_i32_f16.
baracuda_kernels_cast_i32_f16_run
Cast i32 -> f16.
baracuda_kernels_cast_i32_f32_can_implement
Implementability check for cast_i32_f32.
baracuda_kernels_cast_i32_f32_run
Cast i32 -> f32.
baracuda_kernels_cast_i32_f64_can_implement
Implementability check for cast_i32_f64.
baracuda_kernels_cast_i32_f64_run
Cast i32 -> f64.
baracuda_kernels_cast_i32_i8_can_implement
Implementability check for cast_i32_i8.
baracuda_kernels_cast_i32_i8_run
Cast i32 -> i8.
baracuda_kernels_cast_i32_i16_can_implement
baracuda_kernels_cast_i32_i16_can_implement (baracuda kernels cast i32 i16 can implement).
baracuda_kernels_cast_i32_i16_run
Cast i32 -> i16. Truncates to low 16 bits. Phase 31.
baracuda_kernels_cast_i32_i32_can_implement
Implementability check for cast_i32_i32.
baracuda_kernels_cast_i32_i32_run
Cast i32 -> i32.
baracuda_kernels_cast_i32_i64_can_implement
Implementability check for cast_i32_i64.
baracuda_kernels_cast_i32_i64_run
Cast i32 -> i64.
baracuda_kernels_cast_i32_s4_can_implement
baracuda_kernels_cast_i32_s4_can_implement (baracuda kernels cast i32 s4 can implement).
baracuda_kernels_cast_i32_s4_run
Cast i32 -> S4 (pack: saturate to [-8, +7] then nibble-mask).
baracuda_kernels_cast_i32_u4_can_implement
baracuda_kernels_cast_i32_u4_can_implement (baracuda kernels cast i32 u4 can implement).
baracuda_kernels_cast_i32_u4_run
Cast i32 -> U4 (pack: saturate to [0, 15] then nibble-mask).
baracuda_kernels_cast_i32_u8_can_implement
Implementability check for cast_i32_u8.
baracuda_kernels_cast_i32_u8_run
Cast i32 -> u8.
baracuda_kernels_cast_i32_u32_can_implement
baracuda_kernels_cast_i32_u32_can_implement (baracuda kernels cast i32 u32 can implement).
baracuda_kernels_cast_i32_u32_run
Cast i32 -> u32. Bitwise reinterpret for the common case (x >= 0); two’s-complement wraparound otherwise. Phase 31.
baracuda_kernels_cast_i64_bf16_can_implement
Implementability check for cast_i64_bf16.
baracuda_kernels_cast_i64_bf16_run
Cast i64 -> bf16.
baracuda_kernels_cast_i64_bool_can_implement
Implementability check for cast_i64_bool.
baracuda_kernels_cast_i64_bool_run
Cast i64 -> Bool.
baracuda_kernels_cast_i64_f16_can_implement
Implementability check for cast_i64_f16.
baracuda_kernels_cast_i64_f16_run
Cast i64 -> f16.
baracuda_kernels_cast_i64_f32_can_implement
Implementability check for cast_i64_f32.
baracuda_kernels_cast_i64_f32_run
Cast i64 -> f32.
baracuda_kernels_cast_i64_f64_can_implement
Implementability check for cast_i64_f64.
baracuda_kernels_cast_i64_f64_run
Cast i64 -> f64.
baracuda_kernels_cast_i64_i8_can_implement
Implementability check for cast_i64_i8.
baracuda_kernels_cast_i64_i8_run
Cast i64 -> i8.
baracuda_kernels_cast_i64_i16_can_implement
baracuda_kernels_cast_i64_i16_can_implement (baracuda kernels cast i64 i16 can implement).
baracuda_kernels_cast_i64_i16_run
Cast i64 -> i16. Truncates to low 16 bits. Phase 31.
baracuda_kernels_cast_i64_i32_can_implement
Implementability check for cast_i64_i32.
baracuda_kernels_cast_i64_i32_run
Cast i64 -> i32.
baracuda_kernels_cast_i64_i64_can_implement
Implementability check for cast_i64_i64.
baracuda_kernels_cast_i64_i64_run
Cast i64 -> i64.
baracuda_kernels_cast_i64_s4_can_implement
baracuda_kernels_cast_i64_s4_can_implement (baracuda kernels cast i64 s4 can implement).
baracuda_kernels_cast_i64_s4_run
Cast i64 -> S4.
baracuda_kernels_cast_i64_u4_can_implement
baracuda_kernels_cast_i64_u4_can_implement (baracuda kernels cast i64 u4 can implement).
baracuda_kernels_cast_i64_u4_run
Cast i64 -> U4.
baracuda_kernels_cast_i64_u8_can_implement
Implementability check for cast_i64_u8.
baracuda_kernels_cast_i64_u8_run
Cast i64 -> u8.
baracuda_kernels_cast_i64_u32_can_implement
baracuda_kernels_cast_i64_u32_can_implement (baracuda kernels cast i64 u32 can implement).
baracuda_kernels_cast_i64_u32_run
Cast i64 -> u32. Truncates the top 32 bits. Phase 31.
baracuda_kernels_cast_s4_f32_can_implement
baracuda_kernels_cast_s4_f32_can_implement (baracuda kernels cast s4 f32 can implement).
baracuda_kernels_cast_s4_f32_run
Cast S4 -> f32.
baracuda_kernels_cast_s4_i32_can_implement
baracuda_kernels_cast_s4_i32_can_implement (baracuda kernels cast s4 i32 can implement).
baracuda_kernels_cast_s4_i32_run
Cast S4 -> i32 (unpack: sign-extend nibble to int32).
baracuda_kernels_cast_s4_i64_can_implement
baracuda_kernels_cast_s4_i64_can_implement (baracuda kernels cast s4 i64 can implement).
baracuda_kernels_cast_s4_i64_run
Cast S4 -> i64.
baracuda_kernels_cast_u4_f32_can_implement
baracuda_kernels_cast_u4_f32_can_implement (baracuda kernels cast u4 f32 can implement).
baracuda_kernels_cast_u4_f32_run
Cast U4 -> f32.
baracuda_kernels_cast_u4_i32_can_implement
baracuda_kernels_cast_u4_i32_can_implement (baracuda kernels cast u4 i32 can implement).
baracuda_kernels_cast_u4_i32_run
Cast U4 -> i32 (unpack: zero-extend nibble to int32).
baracuda_kernels_cast_u4_i64_can_implement
baracuda_kernels_cast_u4_i64_can_implement (baracuda kernels cast u4 i64 can implement).
baracuda_kernels_cast_u4_i64_run
Cast U4 -> i64.
baracuda_kernels_cast_u8_bf16_can_implement
Implementability check for cast_u8_bf16.
baracuda_kernels_cast_u8_bf16_run
Cast u8 -> bf16.
baracuda_kernels_cast_u8_f16_can_implement
Implementability check for cast_u8_f16.
baracuda_kernels_cast_u8_f16_run
Cast u8 -> f16.
baracuda_kernels_cast_u8_f32_can_implement
Implementability check for cast_u8_f32.
baracuda_kernels_cast_u8_f32_run
Cast u8 -> f32.
baracuda_kernels_cast_u8_f64_can_implement
Implementability check for cast_u8_f64.
baracuda_kernels_cast_u8_f64_run
Cast u8 -> f64.
baracuda_kernels_cast_u8_i8_can_implement
Implementability check for cast_u8_i8.
baracuda_kernels_cast_u8_i8_run
Cast u8 -> i8.
baracuda_kernels_cast_u8_i16_can_implement
baracuda_kernels_cast_u8_i16_can_implement (baracuda kernels cast u8 i16 can implement).
baracuda_kernels_cast_u8_i16_run
Cast u8 -> i16. Zero-extends. Phase 31.
baracuda_kernels_cast_u8_i32_can_implement
Implementability check for cast_u8_i32.
baracuda_kernels_cast_u8_i32_run
Cast u8 -> i32.
baracuda_kernels_cast_u8_i64_can_implement
Implementability check for cast_u8_i64.
baracuda_kernels_cast_u8_i64_run
Cast u8 -> i64.
baracuda_kernels_cast_u8_u8_can_implement
Implementability check for cast_u8_u8.
baracuda_kernels_cast_u8_u8_run
Cast u8 -> u8.
baracuda_kernels_cast_u8_u32_can_implement
baracuda_kernels_cast_u8_u32_can_implement (baracuda kernels cast u8 u32 can implement).
baracuda_kernels_cast_u8_u32_run
Cast u8 -> u32. Zero-extends. Phase 31.
baracuda_kernels_cast_u32_bf16_can_implement
baracuda_kernels_cast_u32_bf16_can_implement (baracuda kernels cast u32 bf16 can implement).
baracuda_kernels_cast_u32_bf16_run
Cast u32 -> bf16. Phase 31.
baracuda_kernels_cast_u32_f16_can_implement
baracuda_kernels_cast_u32_f16_can_implement (baracuda kernels cast u32 f16 can implement).
baracuda_kernels_cast_u32_f16_run
Cast u32 -> f16. Phase 31.
baracuda_kernels_cast_u32_f32_can_implement
baracuda_kernels_cast_u32_f32_can_implement (baracuda kernels cast u32 f32 can implement).
baracuda_kernels_cast_u32_f32_run
Cast u32 -> f32. Phase 31.
baracuda_kernels_cast_u32_f64_can_implement
baracuda_kernels_cast_u32_f64_can_implement (baracuda kernels cast u32 f64 can implement).
baracuda_kernels_cast_u32_f64_run
Cast u32 -> f64. Phase 31.
baracuda_kernels_cast_u32_i8_can_implement
baracuda_kernels_cast_u32_i8_can_implement (baracuda kernels cast u32 i8 can implement).
baracuda_kernels_cast_u32_i8_run
Cast u32 -> i8. Truncates to low byte then reinterprets. Phase 31.
baracuda_kernels_cast_u32_i16_can_implement
baracuda_kernels_cast_u32_i16_can_implement (baracuda kernels cast u32 i16 can implement).
baracuda_kernels_cast_u32_i16_run
Cast u32 -> i16. Truncates to low 16 bits then reinterprets. Phase 31.
baracuda_kernels_cast_u32_i32_can_implement
baracuda_kernels_cast_u32_i32_can_implement (baracuda kernels cast u32 i32 can implement).
baracuda_kernels_cast_u32_i32_run
Cast u32 -> i32. Bitwise reinterpret. Phase 31.
baracuda_kernels_cast_u32_i64_can_implement
baracuda_kernels_cast_u32_i64_can_implement (baracuda kernels cast u32 i64 can implement).
baracuda_kernels_cast_u32_i64_run
Cast u32 -> i64. Zero-extends. Phase 31.
baracuda_kernels_cast_u32_u8_can_implement
baracuda_kernels_cast_u32_u8_can_implement (baracuda kernels cast u32 u8 can implement).
baracuda_kernels_cast_u32_u8_run
Cast u32 -> u8. Truncates to low byte. Phase 31.
baracuda_kernels_cast_u32_u32_can_implement
baracuda_kernels_cast_u32_u32_can_implement (baracuda kernels cast u32 u32 can implement).
baracuda_kernels_cast_u32_u32_run
Cast u32 -> u32 (identity). Phase 31.
baracuda_kernels_cholesky_batched_f32_run
Cholesky factorization (batched). Each a_array[b] is overwritten with the requested triangular factor. cuSOLVER’s potrfBatched is workspace-free internally but needs a device-resident array of device pointers — caller responsibility.
baracuda_kernels_cholesky_batched_f64_run
Cholesky factorization (batched). Each a_array[b] is overwritten with the requested triangular factor. cuSOLVER’s potrfBatched is workspace-free internally but needs a device-resident array of device pointers — caller responsibility.
baracuda_kernels_cholesky_f32_run
Cholesky factorization (non-batched). Overwrites a_inout in place with the requested triangular factor. uplo is 0 (lower, CUBLAS_FILL_MODE_LOWER) or 1 (upper, CUBLAS_FILL_MODE_UPPER).
baracuda_kernels_cholesky_f32_workspace_size
Cholesky factorization workspace size in bytes for the non-batched potrf path. Returns 0 on success and writes the byte count to *out_bytes; non-zero status on cuSOLVER failure (handle allocation / bufferSize query). Batched potrfBatched is workspace-free and has no equivalent query.
baracuda_kernels_cholesky_f64_run
Cholesky factorization (non-batched). Overwrites a_inout in place with the requested triangular factor. uplo is 0 (lower, CUBLAS_FILL_MODE_LOWER) or 1 (upper, CUBLAS_FILL_MODE_UPPER).
baracuda_kernels_cholesky_f64_workspace_size
Cholesky factorization workspace size in bytes for the non-batched potrf path. Returns 0 on success and writes the byte count to *out_bytes; non-zero status on cuSOLVER failure (handle allocation / bufferSize query). Batched potrfBatched is workspace-free and has no equivalent query.
baracuda_kernels_col2im_1d_bf16_can_implement
baracuda_kernels_col2im_1d_bf16_can_implement (baracuda kernels col2im 1d bf16 can implement).
baracuda_kernels_col2im_1d_bf16_run
col2im 1-D, bf16. Caller must zero output first.
baracuda_kernels_col2im_1d_f16_can_implement
baracuda_kernels_col2im_1d_f16_can_implement (baracuda kernels col2im 1d f16 can implement).
baracuda_kernels_col2im_1d_f16_run
col2im 1-D, f16. Caller must zero output first.
baracuda_kernels_col2im_1d_f32_can_implement
baracuda_kernels_col2im_1d_f32_can_implement (baracuda kernels col2im 1d f32 can implement).
baracuda_kernels_col2im_1d_f32_run
col2im 1-D, f32. Caller must zero output first.
baracuda_kernels_col2im_1d_f64_can_implement
baracuda_kernels_col2im_1d_f64_can_implement (baracuda kernels col2im 1d f64 can implement).
baracuda_kernels_col2im_1d_f64_run
col2im 1-D, f64. Caller must zero output first.
baracuda_kernels_concat2_backward_bf16_can_implement
baracuda_kernels_concat2_backward_bf16_can_implement (baracuda kernels concat2 backward bf16 can implement).
baracuda_kernels_concat2_backward_bf16_run
Concat2 backward (slice-split), bf16. See f32 variant.
baracuda_kernels_concat2_backward_f16_can_implement
baracuda_kernels_concat2_backward_f16_can_implement (baracuda kernels concat2 backward f16 can implement).
baracuda_kernels_concat2_backward_f16_run
Concat2 backward (slice-split), f16. See f32 variant.
baracuda_kernels_concat2_backward_f32_can_implement
baracuda_kernels_concat2_backward_f32_can_implement (baracuda kernels concat2 backward f32 can implement).
baracuda_kernels_concat2_backward_f32_run
Concat2 backward (slice-split), f32. Bit-exact, no arithmetic.
baracuda_kernels_concat2_backward_f64_can_implement
baracuda_kernels_concat2_backward_f64_can_implement (baracuda kernels concat2 backward f64 can implement).
baracuda_kernels_concat2_backward_f64_run
Concat2 backward (slice-split), f64. See f32 variant.
baracuda_kernels_concat2_bf16_can_implement
baracuda_kernels_concat2_bf16_can_implement (baracuda kernels concat2 bf16 can implement).
baracuda_kernels_concat2_bf16_run
cat(a, b, dim), bf16, contig output. See f32 variant.
baracuda_kernels_concat2_f16_can_implement
baracuda_kernels_concat2_f16_can_implement (baracuda kernels concat2 f16 can implement).
baracuda_kernels_concat2_f16_run
cat(a, b, dim), f16, contig output. See f32 variant.
baracuda_kernels_concat2_f32_can_implement
baracuda_kernels_concat2_f32_can_implement (baracuda kernels concat2 f32 can implement).
baracuda_kernels_concat2_f32_run
cat(a, b, dim), f32, contig output.
baracuda_kernels_concat2_f64_can_implement
baracuda_kernels_concat2_f64_can_implement (baracuda kernels concat2 f64 can implement).
baracuda_kernels_concat2_f64_run
cat(a, b, dim), f64, contig output. See f32 variant.
baracuda_kernels_contiguize_b1_can_implement
baracuda_kernels_contiguize_b1_can_implement (baracuda kernels contiguize b1 can implement).
baracuda_kernels_contiguize_b1_run
Contiguize, 1-byte element (Bool, S8, U8, Fp8E4M3, Fp8E5M2).
baracuda_kernels_contiguize_b2_can_implement
baracuda_kernels_contiguize_b2_can_implement (baracuda kernels contiguize b2 can implement).
baracuda_kernels_contiguize_b2_run
Contiguize, 2-byte element (f16, bf16).
baracuda_kernels_contiguize_b4_can_implement
baracuda_kernels_contiguize_b4_can_implement (baracuda kernels contiguize b4 can implement).
baracuda_kernels_contiguize_b4_run
Contiguize, 4-byte element (f32, F32Strict, i32).
baracuda_kernels_contiguize_b8_can_implement
baracuda_kernels_contiguize_b8_can_implement (baracuda kernels contiguize b8 can implement).
baracuda_kernels_contiguize_b8_run
Contiguize, 8-byte element (f64, i64, Complex32).
baracuda_kernels_contiguize_b16_can_implement
baracuda_kernels_contiguize_b16_can_implement (baracuda kernels contiguize b16 can implement).
baracuda_kernels_contiguize_b16_run
Contiguize, 16-byte element (Complex64).
baracuda_kernels_contiguize_nibble_can_implement
baracuda_kernels_contiguize_nibble_can_implement (baracuda kernels contiguize nibble can implement).
baracuda_kernels_contiguize_nibble_run
Contiguize, nibble-packed (S4 / U4). Returns status 3 (Unsupported) when the source’s innermost stride is not one of {1, -1, 2} — i.e. when the source layout breaks nibble alignment.
baracuda_kernels_curand_normal_f32_run
Sample numel f32 cells from Normal(mean, stddev).
baracuda_kernels_curand_normal_f32_workspace_size
Normal-sampler workspace size in bytes for f32 — always 0.
baracuda_kernels_curand_normal_f64_run
Sample numel f64 cells from Normal(mean, stddev).
baracuda_kernels_curand_normal_f64_workspace_size
Normal-sampler workspace size in bytes for f64 — always 0.
baracuda_kernels_curand_uniform_f32_run
Sample numel f32 cells from Uniform(low, high].
baracuda_kernels_curand_uniform_f32_workspace_size
Uniform-sampler workspace size in bytes for f32 — always 0.
baracuda_kernels_curand_uniform_f64_run
Sample numel f64 cells from Uniform(low, high].
baracuda_kernels_curand_uniform_f64_workspace_size
Uniform-sampler workspace size in bytes for f64 — always 0.
baracuda_kernels_dequantize_per_channel_backward_bf16_can_implement
Implementability check for dequantize_per_channel_backward_bf16.
baracuda_kernels_dequantize_per_channel_backward_bf16_run
dequantize_per_channel_backward — bf16.
baracuda_kernels_dequantize_per_channel_backward_f16_can_implement
Implementability check for dequantize_per_channel_backward_f16.
baracuda_kernels_dequantize_per_channel_backward_f16_run
dequantize_per_channel_backward — f16.
baracuda_kernels_dequantize_per_channel_backward_f32_can_implement
Implementability check for dequantize_per_channel_backward_f32.
baracuda_kernels_dequantize_per_channel_backward_f32_run
dq[i] = dy[i] * scale[c]. f32.
baracuda_kernels_dequantize_per_channel_backward_f64_can_implement
Implementability check for dequantize_per_channel_backward_f64.
baracuda_kernels_dequantize_per_channel_backward_f64_run
dequantize_per_channel_backward — f64.
baracuda_kernels_dequantize_per_channel_bf16_s8_can_implement
Implementability check for dequantize_per_channel_bf16_s8.
baracuda_kernels_dequantize_per_channel_bf16_s8_run
dequantize_per_channel — s8 → bf16.
baracuda_kernels_dequantize_per_channel_bf16_u8_can_implement
Implementability check for dequantize_per_channel_bf16_u8.
baracuda_kernels_dequantize_per_channel_bf16_u8_run
dequantize_per_channel — u8 → bf16.
baracuda_kernels_dequantize_per_channel_f16_s8_can_implement
Implementability check for dequantize_per_channel_f16_s8.
baracuda_kernels_dequantize_per_channel_f16_s8_run
dequantize_per_channel — s8 → f16.
baracuda_kernels_dequantize_per_channel_f16_u8_can_implement
Implementability check for dequantize_per_channel_f16_u8.
baracuda_kernels_dequantize_per_channel_f16_u8_run
dequantize_per_channel — u8 → f16.
baracuda_kernels_dequantize_per_channel_f32_s8_can_implement
Implementability check for dequantize_per_channel_f32_s8.
baracuda_kernels_dequantize_per_channel_f32_s8_run
x[i] = scale[c] * (q[i] - zp[c]). s8 → f32.
baracuda_kernels_dequantize_per_channel_f32_u8_can_implement
Implementability check for dequantize_per_channel_f32_u8.
baracuda_kernels_dequantize_per_channel_f32_u8_run
dequantize_per_channel — u8 → f32.
baracuda_kernels_dequantize_per_channel_f64_s8_can_implement
Implementability check for dequantize_per_channel_f64_s8.
baracuda_kernels_dequantize_per_channel_f64_s8_run
dequantize_per_channel — s8 → f64.
baracuda_kernels_dequantize_per_channel_f64_u8_can_implement
Implementability check for dequantize_per_channel_f64_u8.
baracuda_kernels_dequantize_per_channel_f64_u8_run
dequantize_per_channel — u8 → f64.
baracuda_kernels_dequantize_per_group_backward_bf16_can_implement
Implementability check for dequantize_per_group_backward_bf16.
baracuda_kernels_dequantize_per_group_backward_bf16_run
Dequant BW — bf16.
baracuda_kernels_dequantize_per_group_backward_f16_can_implement
Implementability check for dequantize_per_group_backward_f16.
baracuda_kernels_dequantize_per_group_backward_f16_run
Dequant BW — f16.
baracuda_kernels_dequantize_per_group_backward_f32_can_implement
Implementability check for dequantize_per_group_backward_f32.
baracuda_kernels_dequantize_per_group_backward_f32_run
Dequant BW — f32.
baracuda_kernels_dequantize_per_group_backward_f64_can_implement
Implementability check for dequantize_per_group_backward_f64.
baracuda_kernels_dequantize_per_group_backward_f64_run
Dequant BW — f64.
baracuda_kernels_dequantize_per_group_bf16_s8_can_implement
Implementability check for dequantize_per_group_bf16_s8.
baracuda_kernels_dequantize_per_group_bf16_s8_run
Dequant — bf16, s8.
baracuda_kernels_dequantize_per_group_bf16_u8_can_implement
Implementability check for dequantize_per_group_bf16_u8.
baracuda_kernels_dequantize_per_group_bf16_u8_run
Dequant — bf16, u8.
baracuda_kernels_dequantize_per_group_f16_s8_can_implement
Implementability check for dequantize_per_group_f16_s8.
baracuda_kernels_dequantize_per_group_f16_s8_run
Dequant — f16, s8.
baracuda_kernels_dequantize_per_group_f16_u8_can_implement
Implementability check for dequantize_per_group_f16_u8.
baracuda_kernels_dequantize_per_group_f16_u8_run
Dequant — f16, u8.
baracuda_kernels_dequantize_per_group_f32_s8_can_implement
Implementability check for dequantize_per_group_f32_s8.
baracuda_kernels_dequantize_per_group_f32_s8_run
Dequant — f32, s8.
baracuda_kernels_dequantize_per_group_f32_u8_can_implement
Implementability check for dequantize_per_group_f32_u8.
baracuda_kernels_dequantize_per_group_f32_u8_run
Dequant — f32, u8.
baracuda_kernels_dequantize_per_group_f64_s8_can_implement
Implementability check for dequantize_per_group_f64_s8.
baracuda_kernels_dequantize_per_group_f64_s8_run
Dequant — f64, s8.
baracuda_kernels_dequantize_per_group_f64_u8_can_implement
Implementability check for dequantize_per_group_f64_u8.
baracuda_kernels_dequantize_per_group_f64_u8_run
Dequant — f64, u8.
baracuda_kernels_dequantize_per_tensor_backward_bf16_can_implement
Implementability check for dequantize_per_tensor_backward_bf16.
baracuda_kernels_dequantize_per_tensor_backward_bf16_run
dequantize_per_tensor_backward — bf16.
baracuda_kernels_dequantize_per_tensor_backward_f16_can_implement
Implementability check for dequantize_per_tensor_backward_f16.
baracuda_kernels_dequantize_per_tensor_backward_f16_run
dequantize_per_tensor_backward — f16.
baracuda_kernels_dequantize_per_tensor_backward_f32_can_implement
Implementability check for dequantize_per_tensor_backward_f32.
baracuda_kernels_dequantize_per_tensor_backward_f32_run
dq = dy * scale. f32.
baracuda_kernels_dequantize_per_tensor_backward_f64_can_implement
Implementability check for dequantize_per_tensor_backward_f64.
baracuda_kernels_dequantize_per_tensor_backward_f64_run
dequantize_per_tensor_backward — f64.
baracuda_kernels_dequantize_per_tensor_bf16_s8_can_implement
Implementability check for dequantize_per_tensor_bf16_s8.
baracuda_kernels_dequantize_per_tensor_bf16_s8_run
dequantize_per_tensor — s8 → bf16.
baracuda_kernels_dequantize_per_tensor_bf16_u8_can_implement
Implementability check for dequantize_per_tensor_bf16_u8.
baracuda_kernels_dequantize_per_tensor_bf16_u8_run
dequantize_per_tensor — u8 → bf16.
baracuda_kernels_dequantize_per_tensor_f16_s8_can_implement
Implementability check for dequantize_per_tensor_f16_s8.
baracuda_kernels_dequantize_per_tensor_f16_s8_run
dequantize_per_tensor — s8 → f16.
baracuda_kernels_dequantize_per_tensor_f16_u8_can_implement
Implementability check for dequantize_per_tensor_f16_u8.
baracuda_kernels_dequantize_per_tensor_f16_u8_run
dequantize_per_tensor — u8 → f16.
baracuda_kernels_dequantize_per_tensor_f32_s8_can_implement
Implementability check for dequantize_per_tensor_f32_s8.
baracuda_kernels_dequantize_per_tensor_f32_s8_run
x = scale * (q - zp). s8 → f32.
baracuda_kernels_dequantize_per_tensor_f32_u8_can_implement
Implementability check for dequantize_per_tensor_f32_u8.
baracuda_kernels_dequantize_per_tensor_f32_u8_run
dequantize_per_tensor — u8 → f32.
baracuda_kernels_dequantize_per_tensor_f64_s8_can_implement
Implementability check for dequantize_per_tensor_f64_s8.
baracuda_kernels_dequantize_per_tensor_f64_s8_run
dequantize_per_tensor — s8 → f64.
baracuda_kernels_dequantize_per_tensor_f64_u8_can_implement
Implementability check for dequantize_per_tensor_f64_u8.
baracuda_kernels_dequantize_per_tensor_f64_u8_run
dequantize_per_tensor — u8 → f64.
baracuda_kernels_dequantize_per_token_backward_bf16_can_implement
Implementability check for dequantize_per_token_backward_bf16.
baracuda_kernels_dequantize_per_token_backward_bf16_run
Dequant BW — bf16.
baracuda_kernels_dequantize_per_token_backward_f16_can_implement
Implementability check for dequantize_per_token_backward_f16.
baracuda_kernels_dequantize_per_token_backward_f16_run
Dequant BW — f16.
baracuda_kernels_dequantize_per_token_backward_f32_can_implement
Implementability check for dequantize_per_token_backward_f32.
baracuda_kernels_dequantize_per_token_backward_f32_run
Dequant BW — f32.
baracuda_kernels_dequantize_per_token_backward_f64_can_implement
Implementability check for dequantize_per_token_backward_f64.
baracuda_kernels_dequantize_per_token_backward_f64_run
Dequant BW — f64.
baracuda_kernels_dequantize_per_token_bf16_s8_can_implement
Implementability check for dequantize_per_token_bf16_s8.
baracuda_kernels_dequantize_per_token_bf16_s8_run
dequantize_per_token — q s8 → y bf16.
baracuda_kernels_dequantize_per_token_bf16_u8_can_implement
Implementability check for dequantize_per_token_bf16_u8.
baracuda_kernels_dequantize_per_token_bf16_u8_run
dequantize_per_token — q u8 → y bf16.
baracuda_kernels_dequantize_per_token_f16_s8_can_implement
Implementability check for dequantize_per_token_f16_s8.
baracuda_kernels_dequantize_per_token_f16_s8_run
dequantize_per_token — q s8 → y f16.
baracuda_kernels_dequantize_per_token_f16_u8_can_implement
Implementability check for dequantize_per_token_f16_u8.
baracuda_kernels_dequantize_per_token_f16_u8_run
dequantize_per_token — q u8 → y f16.
baracuda_kernels_dequantize_per_token_f32_s8_can_implement
Implementability check for dequantize_per_token_f32_s8.
baracuda_kernels_dequantize_per_token_f32_s8_run
dequantize_per_token — q s8 → y f32.
baracuda_kernels_dequantize_per_token_f32_u8_can_implement
Implementability check for dequantize_per_token_f32_u8.
baracuda_kernels_dequantize_per_token_f32_u8_run
dequantize_per_token — q u8 → y f32.
baracuda_kernels_dequantize_per_token_f64_s8_can_implement
Implementability check for dequantize_per_token_f64_s8.
baracuda_kernels_dequantize_per_token_f64_s8_run
dequantize_per_token — q s8 → y f64.
baracuda_kernels_dequantize_per_token_f64_u8_can_implement
Implementability check for dequantize_per_token_f64_u8.
baracuda_kernels_dequantize_per_token_f64_u8_run
dequantize_per_token — q u8 → y f64.
baracuda_kernels_dequantize_q2_K_can_implement
baracuda_kernels_dequantize_q2_K_can_implement (baracuda kernels dequantize q2 k can implement).
baracuda_kernels_dequantize_q2_K_run
GGUF Q2_K dequantize → f32. numel must be a multiple of 256.
baracuda_kernels_dequantize_q3_K_can_implement
baracuda_kernels_dequantize_q3_K_can_implement (baracuda kernels dequantize q3 k can implement).
baracuda_kernels_dequantize_q3_K_run
GGUF Q3_K dequantize → f32. # Safety: as Q2_K.
baracuda_kernels_dequantize_q4_0_can_implement
baracuda_kernels_dequantize_q4_0_can_implement (baracuda kernels dequantize q4 0 can implement).
baracuda_kernels_dequantize_q4_0_run
GGUF Q4_0 block-format dequantize → f32. numel must be a multiple of 32. # Safety: device-resident x, y; valid stream.
baracuda_kernels_dequantize_q4_1_can_implement
baracuda_kernels_dequantize_q4_1_can_implement (baracuda kernels dequantize q4 1 can implement).
baracuda_kernels_dequantize_q4_1_run
GGUF Q4_1 dequantize → f32. # Safety: as Q4_0.
baracuda_kernels_dequantize_q4_K_can_implement
baracuda_kernels_dequantize_q4_K_can_implement (baracuda kernels dequantize q4 k can implement).
baracuda_kernels_dequantize_q4_K_run
GGUF Q4_K dequantize → f32. # Safety: as Q2_K.
baracuda_kernels_dequantize_q5_0_can_implement
baracuda_kernels_dequantize_q5_0_can_implement (baracuda kernels dequantize q5 0 can implement).
baracuda_kernels_dequantize_q5_0_run
GGUF Q5_0 dequantize → f32. # Safety: as Q4_0.
baracuda_kernels_dequantize_q5_1_can_implement
baracuda_kernels_dequantize_q5_1_can_implement (baracuda kernels dequantize q5 1 can implement).
baracuda_kernels_dequantize_q5_1_run
GGUF Q5_1 dequantize → f32. # Safety: as Q4_0.
baracuda_kernels_dequantize_q5_K_can_implement
baracuda_kernels_dequantize_q5_K_can_implement (baracuda kernels dequantize q5 k can implement).
baracuda_kernels_dequantize_q5_K_run
GGUF Q5_K dequantize → f32. # Safety: as Q2_K.
baracuda_kernels_dequantize_q6_K_can_implement
baracuda_kernels_dequantize_q6_K_can_implement (baracuda kernels dequantize q6 k can implement).
baracuda_kernels_dequantize_q6_K_run
GGUF Q6_K dequantize → f32. # Safety: as Q2_K.
baracuda_kernels_dequantize_q8_0_can_implement
baracuda_kernels_dequantize_q8_0_can_implement (baracuda kernels dequantize q8 0 can implement).
baracuda_kernels_dequantize_q8_0_run
GGUF Q8_0 dequantize → f32. # Safety: as Q4_0.
baracuda_kernels_dequantize_q8_K_can_implement
baracuda_kernels_dequantize_q8_K_can_implement (baracuda kernels dequantize q8 k can implement).
baracuda_kernels_dequantize_q8_K_run
GGUF Q8_K dequantize → f32. # Safety: as Q2_K.
baracuda_kernels_dropout_backward_f32_can_implement
baracuda_kernels_dropout_backward_f32_can_implement (baracuda kernels dropout backward f32 can implement).
baracuda_kernels_dropout_backward_f32_run
Dropout backward (f32). Writes dx[i] = dy[i] * mask[i] * scale where scale = 1 / (1 - p).
baracuda_kernels_dropout_backward_f64_can_implement
baracuda_kernels_dropout_backward_f64_can_implement (baracuda kernels dropout backward f64 can implement).
baracuda_kernels_dropout_backward_f64_run
Dropout backward (f64).
baracuda_kernels_dropout_f32_can_implement
baracuda_kernels_dropout_f32_can_implement (baracuda kernels dropout f32 can implement).
baracuda_kernels_dropout_f32_run
Dropout forward (f32). Writes:
baracuda_kernels_dropout_f64_can_implement
baracuda_kernels_dropout_f64_can_implement (baracuda kernels dropout f64 can implement).
baracuda_kernels_dropout_f64_run
Dropout forward (f64). Same shape as the f32 variant.
baracuda_kernels_dynamic_range_quantize_per_token_sym_f32_s8_can_implement
Implementability check for dynamic_range_quantize_per_token_sym_f32_s8.
baracuda_kernels_dynamic_range_quantize_per_token_sym_f32_s8_run
dynamic_range_quantize_per_token_sym — f32 → s8.
baracuda_kernels_dynamic_range_quantize_per_token_sym_f64_s8_can_implement
Implementability check for dynamic_range_quantize_per_token_sym_f64_s8.
baracuda_kernels_dynamic_range_quantize_per_token_sym_f64_s8_run
dynamic_range_quantize_per_token_sym — f64 → s8.
baracuda_kernels_eig_run
General eigendecomposition via Xgeev. a_inout is destroyed in place. dtype_tag selects between f32 / f64 / Complex32 / Complex64 (matches the input dtype; outputs use the same dtype). For real input, w_out is [2 * n] (packed wr/wi); for complex input, [n]. Workspace is split host + device per cuSOLVER’s 64-bit API convention.
baracuda_kernels_eig_workspace_size
Eig workspace sizes (Xgeev). Writes two byte counts — device + host. Caller must size both.
baracuda_kernels_eigh_c32_run
Hermitian eigendecomposition (Complex32). Eigenvalues are real f32 (the Hermitian eigenvalue spectrum is always real); the eigenvalues_out buffer is f32[n], not Complex32[n].
baracuda_kernels_eigh_c32_workspace_size
Hermitian eigendecomposition workspace size (Complex32).
baracuda_kernels_eigh_c64_run
Hermitian eigendecomposition (Complex64). Eigenvalues are real f64; eigenvalues_out is f64[n], not Complex64[n].
baracuda_kernels_eigh_c64_workspace_size
Hermitian eigendecomposition workspace size (Complex64).
baracuda_kernels_eigh_f32_run
Symmetric eigendecomposition A · v = λ · v. a_inout is overwritten with the eigenvector matrix (column-major); eigenvalues_out receives the n eigenvalues sorted ascending.
baracuda_kernels_eigh_f32_workspace_size
Eigh workspace size in bytes for the real symmetric syevd path.
baracuda_kernels_eigh_f64_run
Symmetric eigendecomposition A · v = λ · v. a_inout is overwritten with the eigenvector matrix (column-major); eigenvalues_out receives the n eigenvalues sorted ascending.
baracuda_kernels_eigh_f64_workspace_size
Eigh workspace size in bytes for the real symmetric syevd path.
baracuda_kernels_embedding_backward_f32_can_implement
Implementability check for embedding_backward_f32.
baracuda_kernels_embedding_backward_f32_run
embedding BW — dweight[indices[n], :] += dout[n, :] (atomicAdd), skipping rows where indices[n] == padding_idx. f32.
baracuda_kernels_embedding_backward_f64_can_implement
Implementability check for embedding_backward_f64.
baracuda_kernels_embedding_backward_f64_run
embedding BW — f64.
baracuda_kernels_embedding_backward_i64idx_f32_can_implement
Implementability check for embedding_backward_i64idx_f32.
baracuda_kernels_embedding_backward_i64idx_f32_run
embedding BW — f32, i64 indices.
baracuda_kernels_embedding_backward_i64idx_f64_can_implement
Implementability check for embedding_backward_i64idx_f64.
baracuda_kernels_embedding_backward_i64idx_f64_run
embedding BW — f64, i64 indices.
baracuda_kernels_embedding_bag_backward_f32_can_implement
Implementability check for embedding_bag_backward_f32.
baracuda_kernels_embedding_bag_backward_f32_run
embedding_bag BW — atomicAdd into dweight. f32.
baracuda_kernels_embedding_bag_backward_f64_can_implement
Implementability check for embedding_bag_backward_f64.
baracuda_kernels_embedding_bag_backward_f64_run
embedding_bag BW — f64.
baracuda_kernels_embedding_bag_backward_i64idx_f32_can_implement
Implementability check for embedding_bag_backward_i64idx_f32.
baracuda_kernels_embedding_bag_backward_i64idx_f32_run
embedding_bag BW — f32, i64 indices.
baracuda_kernels_embedding_bag_backward_i64idx_f64_can_implement
Implementability check for embedding_bag_backward_i64idx_f64.
baracuda_kernels_embedding_bag_backward_i64idx_f64_run
embedding_bag BW — f64, i64 indices.
baracuda_kernels_embedding_bag_bf16_can_implement
Implementability check for embedding_bag_bf16.
baracuda_kernels_embedding_bag_bf16_run
embedding_bag FW — bf16.
baracuda_kernels_embedding_bag_f16_can_implement
Implementability check for embedding_bag_f16.
baracuda_kernels_embedding_bag_f16_run
embedding_bag FW — f16.
baracuda_kernels_embedding_bag_f32_can_implement
Implementability check for embedding_bag_f32.
baracuda_kernels_embedding_bag_f32_run
embedding_bag FW — f32.
baracuda_kernels_embedding_bag_f64_can_implement
Implementability check for embedding_bag_f64.
baracuda_kernels_embedding_bag_f64_run
embedding_bag FW — f64.
baracuda_kernels_embedding_bag_i64idx_bf16_can_implement
Implementability check for embedding_bag_i64idx_bf16.
baracuda_kernels_embedding_bag_i64idx_bf16_run
embedding_bag FW — bf16, i64 indices.
baracuda_kernels_embedding_bag_i64idx_f16_can_implement
Implementability check for embedding_bag_i64idx_f16.
baracuda_kernels_embedding_bag_i64idx_f16_run
embedding_bag FW — f16, i64 indices.
baracuda_kernels_embedding_bag_i64idx_f32_can_implement
Implementability check for embedding_bag_i64idx_f32.
baracuda_kernels_embedding_bag_i64idx_f32_run
embedding_bag FW — f32, i64 indices.
baracuda_kernels_embedding_bag_i64idx_f64_can_implement
Implementability check for embedding_bag_i64idx_f64.
baracuda_kernels_embedding_bag_i64idx_f64_run
embedding_bag FW — f64, i64 indices.
baracuda_kernels_embedding_bag_max_backward_f32_can_implement
Implementability check for embedding_bag_max_backward_f32.
baracuda_kernels_embedding_bag_max_backward_f32_run
embedding_bag_max BW — f32. Index dtype is fixed at i32 (set by the FW’s out_index output).
baracuda_kernels_embedding_bag_max_backward_f64_can_implement
Implementability check for embedding_bag_max_backward_f64.
baracuda_kernels_embedding_bag_max_backward_f64_run
embedding_bag_max BW — f64.
baracuda_kernels_embedding_bag_max_bf16_can_implement
Implementability check for embedding_bag_max_bf16.
baracuda_kernels_embedding_bag_max_bf16_run
embedding_bag_max FW — bf16.
baracuda_kernels_embedding_bag_max_f16_can_implement
Implementability check for embedding_bag_max_f16.
baracuda_kernels_embedding_bag_max_f16_run
embedding_bag_max FW — f16.
baracuda_kernels_embedding_bag_max_f32_can_implement
Implementability check for embedding_bag_max_f32.
baracuda_kernels_embedding_bag_max_f32_run
embedding_bag Max-mode FW — f32 (i32 indices).
baracuda_kernels_embedding_bag_max_f64_can_implement
Implementability check for embedding_bag_max_f64.
baracuda_kernels_embedding_bag_max_f64_run
embedding_bag_max FW — f64.
baracuda_kernels_embedding_bag_max_i64idx_bf16_can_implement
Implementability check for embedding_bag_max_i64idx_bf16.
baracuda_kernels_embedding_bag_max_i64idx_bf16_run
embedding_bag_max FW — bf16, i64 indices.
baracuda_kernels_embedding_bag_max_i64idx_f16_can_implement
Implementability check for embedding_bag_max_i64idx_f16.
baracuda_kernels_embedding_bag_max_i64idx_f16_run
embedding_bag_max FW — f16, i64 indices.
baracuda_kernels_embedding_bag_max_i64idx_f32_can_implement
Implementability check for embedding_bag_max_i64idx_f32.
baracuda_kernels_embedding_bag_max_i64idx_f32_run
embedding_bag_max FW — f32, i64 indices.
baracuda_kernels_embedding_bag_max_i64idx_f64_can_implement
Implementability check for embedding_bag_max_i64idx_f64.
baracuda_kernels_embedding_bag_max_i64idx_f64_run
embedding_bag_max FW — f64, i64 indices.
baracuda_kernels_embedding_bf16_can_implement
Implementability check for embedding_bf16.
baracuda_kernels_embedding_bf16_run
embedding FW — bf16.
baracuda_kernels_embedding_f16_can_implement
Implementability check for embedding_f16.
baracuda_kernels_embedding_f16_run
embedding FW — f16.
baracuda_kernels_embedding_f32_can_implement
Implementability check for embedding_f32.
baracuda_kernels_embedding_f32_run
embedding FW — f32 (pure copy).
baracuda_kernels_embedding_f64_can_implement
Implementability check for embedding_f64.
baracuda_kernels_embedding_f64_run
embedding FW — f64.
baracuda_kernels_embedding_i64idx_bf16_can_implement
Implementability check for embedding_i64idx_bf16.
baracuda_kernels_embedding_i64idx_bf16_run
embedding FW — bf16, i64 indices.
baracuda_kernels_embedding_i64idx_f16_can_implement
Implementability check for embedding_i64idx_f16.
baracuda_kernels_embedding_i64idx_f16_run
embedding FW — f16, i64 indices.
baracuda_kernels_embedding_i64idx_f32_can_implement
Implementability check for embedding_i64idx_f32.
baracuda_kernels_embedding_i64idx_f32_run
embedding FW — f32, i64 indices.
baracuda_kernels_embedding_i64idx_f64_can_implement
Implementability check for embedding_i64idx_f64.
baracuda_kernels_embedding_i64idx_f64_run
embedding FW — f64, i64 indices.
baracuda_kernels_fake_quantize_backward_bf16_can_implement
Implementability check for fake_quantize_backward_bf16.
baracuda_kernels_fake_quantize_backward_bf16_run
fake_quantize_backward — bf16.
baracuda_kernels_fake_quantize_backward_f16_can_implement
Implementability check for fake_quantize_backward_f16.
baracuda_kernels_fake_quantize_backward_f16_run
fake_quantize_backward — f16.
baracuda_kernels_fake_quantize_backward_f32_can_implement
Implementability check for fake_quantize_backward_f32.
baracuda_kernels_fake_quantize_backward_f32_run
dx = dy * in_range_mask(x). STE, no 1/scale factor. f32.
baracuda_kernels_fake_quantize_backward_f64_can_implement
Implementability check for fake_quantize_backward_f64.
baracuda_kernels_fake_quantize_backward_f64_run
fake_quantize_backward — f64.
baracuda_kernels_fake_quantize_bf16_can_implement
Implementability check for fake_quantize_bf16.
baracuda_kernels_fake_quantize_bf16_run
fake_quantize — bf16.
baracuda_kernels_fake_quantize_f16_can_implement
Implementability check for fake_quantize_f16.
baracuda_kernels_fake_quantize_f16_run
fake_quantize — f16.
baracuda_kernels_fake_quantize_f32_can_implement
Implementability check for fake_quantize_f32.
baracuda_kernels_fake_quantize_f32_run
y = scale * (clamp(round(x/scale)+zp, qmin, qmax) - zp). f32.
baracuda_kernels_fake_quantize_f64_can_implement
Implementability check for fake_quantize_f64.
baracuda_kernels_fake_quantize_f64_run
fake_quantize — f64 (f64 scale).
baracuda_kernels_fft_1d_c32_run
1-D C2C FFT (forward + inverse via flag). Wraps cuFFT’s cufftExecC2C (c32) / cufftExecZ2Z (c64). For inverse, applies 1/n normalization in-place after exec.
baracuda_kernels_fft_1d_c32_workspace_size
1-D C2C FFT workspace size in bytes. cuFFT manages its own internal workspace; this entry always writes 0.
baracuda_kernels_fft_1d_c64_run
1-D C2C FFT (forward + inverse via flag). Wraps cuFFT’s cufftExecC2C (c32) / cufftExecZ2Z (c64). For inverse, applies 1/n normalization in-place after exec.
baracuda_kernels_fft_1d_c64_workspace_size
1-D C2C FFT workspace size in bytes. cuFFT manages its own internal workspace; this entry always writes 0.
baracuda_kernels_fft_nd_c32_run
ND C2C FFT (forward + inverse via flag).
baracuda_kernels_fft_nd_c32_workspace_size
ND C2C FFT workspace size in bytes — always 0.
baracuda_kernels_fft_nd_c64_run
ND C2C FFT (forward + inverse via flag).
baracuda_kernels_fft_nd_c64_workspace_size
ND C2C FFT workspace size in bytes — always 0.
baracuda_kernels_fftshift_4_can_implement
baracuda_kernels_fftshift_4_can_implement (baracuda kernels fftshift 4 can implement).
baracuda_kernels_fftshift_4_run
fftshift along the last axis of a [batch, n] tensor: y[b, i] = x[b, (i + n/2) % n]. Element-width specialization (4 bytes per element) — used for Bool / f32 / packed-Bool shifts; the same kernel re-instantiated at 8 / 16 bytes covers f64 / Complex32 and Complex64.
baracuda_kernels_fftshift_8_can_implement
baracuda_kernels_fftshift_8_can_implement (baracuda kernels fftshift 8 can implement).
baracuda_kernels_fftshift_8_run
8-byte-element fftshift (covers f64 and Complex32).
baracuda_kernels_fftshift_16_can_implement
baracuda_kernels_fftshift_16_can_implement (baracuda kernels fftshift 16 can implement).
baracuda_kernels_fftshift_16_run
16-byte-element fftshift (covers Complex64).
baracuda_kernels_fftshift_nd_4_can_implement
baracuda_kernels_fftshift_nd_4_can_implement (baracuda kernels fftshift nd 4 can implement).
baracuda_kernels_fftshift_nd_4_run
N-D fftshift / ifftshift — single-pass general-permutation kernel covering up to rank-8 tensors. The caller passes a per- axis shape, per-axis shift_amt (0 for pass-through axes; n/2 for fftshift / n - n/2 for ifftshift on shifted axes), and per-axis contiguous stride (in elements). The same kernel covers both directions — the direction lives entirely in the shift_amt array.
baracuda_kernels_fftshift_nd_8_can_implement
baracuda_kernels_fftshift_nd_8_can_implement (baracuda kernels fftshift nd 8 can implement).
baracuda_kernels_fftshift_nd_8_run
8-byte-cell N-D fftshift (covers f64 and Complex32).
baracuda_kernels_fftshift_nd_16_can_implement
baracuda_kernels_fftshift_nd_16_can_implement (baracuda kernels fftshift nd 16 can implement).
baracuda_kernels_fftshift_nd_16_run
16-byte-cell N-D fftshift (covers Complex64).
baracuda_kernels_fill_bf16_can_implement
Implementability check for fill_bf16. Host-side only.
baracuda_kernels_fill_bf16_run
Fill y with value, bf16 dtype. value_bits is the raw 16-bit pattern of a bf16 value.
baracuda_kernels_fill_bf16_strided_can_implement
baracuda_kernels_fill_bf16_strided_can_implement (baracuda kernels fill bf16 strided can implement).
baracuda_kernels_fill_bf16_strided_run
Strided fill, bf16. value_bits is the raw 16-bit pattern of a bf16 value.
baracuda_kernels_fill_f16_can_implement
Implementability check for fill_f16. Host-side only.
baracuda_kernels_fill_f16_run
Fill y with value, f16 dtype. value_bits is the raw 16-bit pattern of an f16 value (transport convention shared with the Pad-constant family).
baracuda_kernels_fill_f16_strided_can_implement
baracuda_kernels_fill_f16_strided_can_implement (baracuda kernels fill f16 strided can implement).
baracuda_kernels_fill_f16_strided_run
Strided fill, f16. value_bits is the raw 16-bit pattern of an f16 value.
baracuda_kernels_fill_f32_can_implement
Implementability check for fill_f32. Host-side only.
baracuda_kernels_fill_f32_run
Fill y with value, f32 dtype. This is the fill trailblazer — every fill_<dt>_run (and _strided_run) variant follows the same write-only contract.
baracuda_kernels_fill_f32_strided_can_implement
baracuda_kernels_fill_f32_strided_can_implement (baracuda kernels fill f32 strided can implement).
baracuda_kernels_fill_f32_strided_run
baracuda_kernels_fill_f32_strided_run (baracuda kernels fill f32 strided run).
baracuda_kernels_fill_f64_can_implement
Implementability check for fill_f64. Host-side only.
baracuda_kernels_fill_f64_run
Fill y with value, f64 dtype.
baracuda_kernels_fill_f64_strided_can_implement
baracuda_kernels_fill_f64_strided_can_implement (baracuda kernels fill f64 strided can implement).
baracuda_kernels_fill_f64_strided_run
baracuda_kernels_fill_f64_strided_run (baracuda kernels fill f64 strided run).
baracuda_kernels_fill_fp8e4m3_can_implement
baracuda_kernels_fill_fp8e4m3_can_implement (baracuda kernels fill fp8e4m3 can implement).
baracuda_kernels_fill_fp8e4m3_run
Fill y with value, FP8 E4M3 dtype. value is the raw 8-bit E4M3 encoding (storage is byte-identical to u8); callers compute the encoding via the cast family or __nv_cvt_float_to_fp8.
baracuda_kernels_fill_fp8e4m3_strided_can_implement
baracuda_kernels_fill_fp8e4m3_strided_can_implement (baracuda kernels fill fp8e4m3 strided can implement).
baracuda_kernels_fill_fp8e4m3_strided_run
baracuda_kernels_fill_fp8e4m3_strided_run (baracuda kernels fill fp8e4m3 strided run).
baracuda_kernels_fill_i8_can_implement
Implementability check for fill_i8. Host-side only.
baracuda_kernels_fill_i8_run
Fill y with value, i8 dtype.
baracuda_kernels_fill_i8_strided_can_implement
baracuda_kernels_fill_i8_strided_can_implement (baracuda kernels fill i8 strided can implement).
baracuda_kernels_fill_i8_strided_run
baracuda_kernels_fill_i8_strided_run (baracuda kernels fill i8 strided run).
baracuda_kernels_fill_i16_can_implement
baracuda_kernels_fill_i16_can_implement (baracuda kernels fill i16 can implement).
baracuda_kernels_fill_i16_run
Fill y with value, i16 dtype.
baracuda_kernels_fill_i16_strided_can_implement
baracuda_kernels_fill_i16_strided_can_implement (baracuda kernels fill i16 strided can implement).
baracuda_kernels_fill_i16_strided_run
baracuda_kernels_fill_i16_strided_run (baracuda kernels fill i16 strided run).
baracuda_kernels_fill_i32_can_implement
Implementability check for fill_i32. Host-side only.
baracuda_kernels_fill_i32_run
Fill y with value, i32 dtype.
baracuda_kernels_fill_i32_strided_can_implement
baracuda_kernels_fill_i32_strided_can_implement (baracuda kernels fill i32 strided can implement).
baracuda_kernels_fill_i32_strided_run
baracuda_kernels_fill_i32_strided_run (baracuda kernels fill i32 strided run).
baracuda_kernels_fill_i64_can_implement
Implementability check for fill_i64. Host-side only.
baracuda_kernels_fill_i64_run
Fill y with value, i64 dtype.
baracuda_kernels_fill_i64_strided_can_implement
baracuda_kernels_fill_i64_strided_can_implement (baracuda kernels fill i64 strided can implement).
baracuda_kernels_fill_i64_strided_run
baracuda_kernels_fill_i64_strided_run (baracuda kernels fill i64 strided run).
baracuda_kernels_fill_u8_can_implement
Implementability check for fill_u8. Host-side only.
baracuda_kernels_fill_u8_run
Fill y with value, u8 dtype.
baracuda_kernels_fill_u8_strided_can_implement
baracuda_kernels_fill_u8_strided_can_implement (baracuda kernels fill u8 strided can implement).
baracuda_kernels_fill_u8_strided_run
baracuda_kernels_fill_u8_strided_run (baracuda kernels fill u8 strided run).
baracuda_kernels_fill_u32_can_implement
baracuda_kernels_fill_u32_can_implement (baracuda kernels fill u32 can implement).
baracuda_kernels_fill_u32_run
Fill y with value, u32 dtype.
baracuda_kernels_fill_u32_strided_can_implement
baracuda_kernels_fill_u32_strided_can_implement (baracuda kernels fill u32 strided can implement).
baracuda_kernels_fill_u32_strided_run
baracuda_kernels_fill_u32_strided_run (baracuda kernels fill u32 strided run).
baracuda_kernels_flash_decoding_bf16_can_implement
Implementability check for flash_decoding_bf16. Host-side only.
baracuda_kernels_flash_decoding_bf16_run
FlashDecoding FW, bf16 (f32 accumulators).
baracuda_kernels_flash_decoding_bf16_workspace_bytes
Workspace requirement for flash_decoding_bf16 in bytes.
baracuda_kernels_flash_decoding_f16_can_implement
Implementability check for flash_decoding_f16. Host-side only.
baracuda_kernels_flash_decoding_f16_run
FlashDecoding FW, f16 (f32 accumulators). seq_q = 1; split-K over chunks of 256 K-rows each, combined via a second kernel.
baracuda_kernels_flash_decoding_f16_workspace_bytes
Workspace requirement for flash_decoding_f16 in bytes.
baracuda_kernels_flash_sdpa_backward_bf16_can_implement
Implementability check for flash_sdpa_backward_bf16. Host-side only.
baracuda_kernels_flash_sdpa_backward_bf16_run
Flash SDPA BW, bf16.
baracuda_kernels_flash_sdpa_backward_f16_can_implement
Implementability check for flash_sdpa_backward_f16. Host-side only.
baracuda_kernels_flash_sdpa_backward_f16_run
Flash SDPA BW, f16.
baracuda_kernels_flash_sdpa_backward_f32_can_implement
Implementability check for flash_sdpa_backward_f32. Host-side only.
baracuda_kernels_flash_sdpa_backward_f32_run
Flash SDPA BW, f32. Given the FW-saved y, lse, plus upstream dy, computes dQ, dK, dV. The d_ws argument is a caller-allocated [B, H, Q] scratch buffer (overwritten with the per-row D = rowsum(y ⊙ dy) intermediate; element type matches T).
baracuda_kernels_flash_sdpa_backward_f64_can_implement
Implementability check for flash_sdpa_backward_f64. Host-side only.
baracuda_kernels_flash_sdpa_backward_f64_run
Flash SDPA BW, f64.
baracuda_kernels_flash_sdpa_bf16_can_implement
Implementability check for flash_sdpa_bf16. Host-side only.
baracuda_kernels_flash_sdpa_bf16_run
Flash SDPA FW, bf16 (f32 accumulators).
baracuda_kernels_flash_sdpa_f16_can_implement
Implementability check for flash_sdpa_f16. Host-side only.
baracuda_kernels_flash_sdpa_f16_run
Flash SDPA FW, f16 (f32 accumulators).
baracuda_kernels_flash_sdpa_f32_can_implement
Implementability check for flash_sdpa_f32. Host-side only.
baracuda_kernels_flash_sdpa_f32_run
Flash SDPA FW, f32. Computes y = softmax(Q·K^T·scale) · V via tiled fused online softmax. Optional upper-triangular causal mask (is_causal = 1); explicit additive mask is not supported in the trailblazer. Writes y: [B, H, Q, D_v] and the saved lse: [B, H, Q] log-sum-exp tensor that BW consumes.
baracuda_kernels_flash_sdpa_f64_can_implement
Implementability check for flash_sdpa_f64. Host-side only.
baracuda_kernels_flash_sdpa_f64_run
Flash SDPA FW, f64.
baracuda_kernels_flip_bf16_can_implement
Pre-launch implementability check for flip_bf16.
baracuda_kernels_flip_bf16_run
Flip, bf16. Pure element copy — no math.
baracuda_kernels_flip_bf16_strided_can_implement
flip_bf16_strided_can_implement companion.
baracuda_kernels_flip_bf16_strided_run
Flip strided sibling, bf16.
baracuda_kernels_flip_f16_can_implement
Pre-launch implementability check for flip_f16.
baracuda_kernels_flip_f16_run
Flip, f16. Pure element copy — no math.
baracuda_kernels_flip_f16_strided_can_implement
flip_f16_strided_can_implement companion.
baracuda_kernels_flip_f16_strided_run
Flip strided sibling, f16.
baracuda_kernels_flip_f32_can_implement
Pre-launch implementability check for flip_f32.
baracuda_kernels_flip_f32_run
Flip (reverse along selected axes), f32. flip_axes[d] is 1 = reverse axis d, 0 = no-op.
baracuda_kernels_flip_f32_strided_can_implement
flip_f32_strided_can_implement companion.
baracuda_kernels_flip_f32_strided_run
Flip strided sibling, f32.
baracuda_kernels_flip_f64_can_implement
Pre-launch implementability check for flip_f64.
baracuda_kernels_flip_f64_run
Flip, f64. Pure element copy — no math.
baracuda_kernels_flip_f64_strided_can_implement
flip_f64_strided_can_implement companion.
baracuda_kernels_flip_f64_strided_run
Flip strided sibling, f64.
baracuda_kernels_fractional_max_pool_2d_bw_bf16_can_implement
baracuda_kernels_fractional_max_pool_2d_bw_bf16_can_implement (baracuda kernels fractional max pool 2d bw bf16 can implement).
baracuda_kernels_fractional_max_pool_2d_bw_bf16_run
FractionalMaxPool2d BW, bf16.
baracuda_kernels_fractional_max_pool_2d_bw_f16_can_implement
baracuda_kernels_fractional_max_pool_2d_bw_f16_can_implement (baracuda kernels fractional max pool 2d bw f16 can implement).
baracuda_kernels_fractional_max_pool_2d_bw_f16_run
FractionalMaxPool2d BW, f16.
baracuda_kernels_fractional_max_pool_2d_bw_f32_can_implement
baracuda_kernels_fractional_max_pool_2d_bw_f32_can_implement (baracuda kernels fractional max pool 2d bw f32 can implement).
baracuda_kernels_fractional_max_pool_2d_bw_f32_run
FractionalMaxPool2d BW, f32.
baracuda_kernels_fractional_max_pool_2d_bw_f64_can_implement
baracuda_kernels_fractional_max_pool_2d_bw_f64_can_implement (baracuda kernels fractional max pool 2d bw f64 can implement).
baracuda_kernels_fractional_max_pool_2d_bw_f64_run
FractionalMaxPool2d BW, f64.
baracuda_kernels_fractional_max_pool_2d_fw_bf16_can_implement
baracuda_kernels_fractional_max_pool_2d_fw_bf16_can_implement (baracuda kernels fractional max pool 2d fw bf16 can implement).
baracuda_kernels_fractional_max_pool_2d_fw_bf16_run
FractionalMaxPool2d FW, bf16.
baracuda_kernels_fractional_max_pool_2d_fw_f16_can_implement
baracuda_kernels_fractional_max_pool_2d_fw_f16_can_implement (baracuda kernels fractional max pool 2d fw f16 can implement).
baracuda_kernels_fractional_max_pool_2d_fw_f16_run
FractionalMaxPool2d FW, f16.
baracuda_kernels_fractional_max_pool_2d_fw_f32_can_implement
baracuda_kernels_fractional_max_pool_2d_fw_f32_can_implement (baracuda kernels fractional max pool 2d fw f32 can implement).
baracuda_kernels_fractional_max_pool_2d_fw_f32_run
FractionalMaxPool2d FW, f32.
baracuda_kernels_fractional_max_pool_2d_fw_f64_can_implement
baracuda_kernels_fractional_max_pool_2d_fw_f64_can_implement (baracuda kernels fractional max pool 2d fw f64 can implement).
baracuda_kernels_fractional_max_pool_2d_fw_f64_run
FractionalMaxPool2d FW, f64.
baracuda_kernels_fractional_max_pool_3d_bw_bf16_can_implement
baracuda_kernels_fractional_max_pool_3d_bw_bf16_can_implement (baracuda kernels fractional max pool 3d bw bf16 can implement).
baracuda_kernels_fractional_max_pool_3d_bw_bf16_run
FractionalMaxPool3d BW, bf16.
baracuda_kernels_fractional_max_pool_3d_bw_f16_can_implement
baracuda_kernels_fractional_max_pool_3d_bw_f16_can_implement (baracuda kernels fractional max pool 3d bw f16 can implement).
baracuda_kernels_fractional_max_pool_3d_bw_f16_run
FractionalMaxPool3d BW, f16.
baracuda_kernels_fractional_max_pool_3d_bw_f32_can_implement
baracuda_kernels_fractional_max_pool_3d_bw_f32_can_implement (baracuda kernels fractional max pool 3d bw f32 can implement).
baracuda_kernels_fractional_max_pool_3d_bw_f32_run
FractionalMaxPool3d BW, f32.
baracuda_kernels_fractional_max_pool_3d_bw_f64_can_implement
baracuda_kernels_fractional_max_pool_3d_bw_f64_can_implement (baracuda kernels fractional max pool 3d bw f64 can implement).
baracuda_kernels_fractional_max_pool_3d_bw_f64_run
FractionalMaxPool3d BW, f64.
baracuda_kernels_fractional_max_pool_3d_fw_bf16_can_implement
baracuda_kernels_fractional_max_pool_3d_fw_bf16_can_implement (baracuda kernels fractional max pool 3d fw bf16 can implement).
baracuda_kernels_fractional_max_pool_3d_fw_bf16_run
FractionalMaxPool3d FW, bf16.
baracuda_kernels_fractional_max_pool_3d_fw_f16_can_implement
baracuda_kernels_fractional_max_pool_3d_fw_f16_can_implement (baracuda kernels fractional max pool 3d fw f16 can implement).
baracuda_kernels_fractional_max_pool_3d_fw_f16_run
FractionalMaxPool3d FW, f16.
baracuda_kernels_fractional_max_pool_3d_fw_f32_can_implement
baracuda_kernels_fractional_max_pool_3d_fw_f32_can_implement (baracuda kernels fractional max pool 3d fw f32 can implement).
baracuda_kernels_fractional_max_pool_3d_fw_f32_run
FractionalMaxPool3d FW, f32.
baracuda_kernels_fractional_max_pool_3d_fw_f64_can_implement
baracuda_kernels_fractional_max_pool_3d_fw_f64_can_implement (baracuda kernels fractional max pool 3d fw f64 can implement).
baracuda_kernels_fractional_max_pool_3d_fw_f64_run
FractionalMaxPool3d FW, f64.
baracuda_kernels_gated_geglu_backward_bf16_can_implement
baracuda_kernels_gated_geglu_backward_bf16_can_implement (baracuda kernels gated geglu backward bf16 can implement).
baracuda_kernels_gated_geglu_backward_bf16_run
GeGLU backward, bf16.
baracuda_kernels_gated_geglu_backward_f16_can_implement
baracuda_kernels_gated_geglu_backward_f16_can_implement (baracuda kernels gated geglu backward f16 can implement).
baracuda_kernels_gated_geglu_backward_f16_run
GeGLU backward, f16.
baracuda_kernels_gated_geglu_backward_f32_can_implement
baracuda_kernels_gated_geglu_backward_f32_can_implement (baracuda kernels gated geglu backward f32 can implement).
baracuda_kernels_gated_geglu_backward_f32_run
GeGLU backward, f32. da = dy·gelu(b), db = dy·a·gelu'(b).
baracuda_kernels_gated_geglu_backward_f64_can_implement
baracuda_kernels_gated_geglu_backward_f64_can_implement (baracuda kernels gated geglu backward f64 can implement).
baracuda_kernels_gated_geglu_backward_f64_run
GeGLU backward, f64.
baracuda_kernels_gated_geglu_bf16_can_implement
baracuda_kernels_gated_geglu_bf16_can_implement (baracuda kernels gated geglu bf16 can implement).
baracuda_kernels_gated_geglu_bf16_run
GeGLU forward, bf16.
baracuda_kernels_gated_geglu_f16_can_implement
baracuda_kernels_gated_geglu_f16_can_implement (baracuda kernels gated geglu f16 can implement).
baracuda_kernels_gated_geglu_f16_run
GeGLU forward, f16.
baracuda_kernels_gated_geglu_f32_can_implement
baracuda_kernels_gated_geglu_f32_can_implement (baracuda kernels gated geglu f32 can implement).
baracuda_kernels_gated_geglu_f32_run
GeGLU forward, f32. y = a · gelu(b), exact erf-based.
baracuda_kernels_gated_geglu_f64_can_implement
baracuda_kernels_gated_geglu_f64_can_implement (baracuda kernels gated geglu f64 can implement).
baracuda_kernels_gated_geglu_f64_run
GeGLU forward, f64.
baracuda_kernels_gated_glu_backward_bf16_can_implement
baracuda_kernels_gated_glu_backward_bf16_can_implement (baracuda kernels gated glu backward bf16 can implement).
baracuda_kernels_gated_glu_backward_bf16_run
GLU backward, bf16.
baracuda_kernels_gated_glu_backward_f16_can_implement
baracuda_kernels_gated_glu_backward_f16_can_implement (baracuda kernels gated glu backward f16 can implement).
baracuda_kernels_gated_glu_backward_f16_run
GLU backward, f16.
baracuda_kernels_gated_glu_backward_f32_can_implement
baracuda_kernels_gated_glu_backward_f32_can_implement (baracuda kernels gated glu backward f32 can implement).
baracuda_kernels_gated_glu_backward_f32_run
GLU backward, f32. da = dy·sigmoid(b), db = dy·a·sigmoid(b)·(1-sigmoid(b)).
baracuda_kernels_gated_glu_backward_f64_can_implement
baracuda_kernels_gated_glu_backward_f64_can_implement (baracuda kernels gated glu backward f64 can implement).
baracuda_kernels_gated_glu_backward_f64_run
GLU backward, f64.
baracuda_kernels_gated_glu_bf16_can_implement
baracuda_kernels_gated_glu_bf16_can_implement (baracuda kernels gated glu bf16 can implement).
baracuda_kernels_gated_glu_bf16_run
GLU forward, bf16.
baracuda_kernels_gated_glu_f16_can_implement
baracuda_kernels_gated_glu_f16_can_implement (baracuda kernels gated glu f16 can implement).
baracuda_kernels_gated_glu_f16_run
GLU forward, f16.
baracuda_kernels_gated_glu_f32_can_implement
baracuda_kernels_gated_glu_f32_can_implement (baracuda kernels gated glu f32 can implement).
baracuda_kernels_gated_glu_f32_run
GLU forward, f32. y = a · sigmoid(b).
baracuda_kernels_gated_glu_f64_can_implement
baracuda_kernels_gated_glu_f64_can_implement (baracuda kernels gated glu f64 can implement).
baracuda_kernels_gated_glu_f64_run
GLU forward, f64.
baracuda_kernels_gated_reglu_backward_bf16_can_implement
baracuda_kernels_gated_reglu_backward_bf16_can_implement (baracuda kernels gated reglu backward bf16 can implement).
baracuda_kernels_gated_reglu_backward_bf16_run
ReGLU backward, bf16.
baracuda_kernels_gated_reglu_backward_f16_can_implement
baracuda_kernels_gated_reglu_backward_f16_can_implement (baracuda kernels gated reglu backward f16 can implement).
baracuda_kernels_gated_reglu_backward_f16_run
ReGLU backward, f16.
baracuda_kernels_gated_reglu_backward_f32_can_implement
baracuda_kernels_gated_reglu_backward_f32_can_implement (baracuda kernels gated reglu backward f32 can implement).
baracuda_kernels_gated_reglu_backward_f32_run
ReGLU backward, f32. da = (b>0)?dy·b:0, db = (b>0)?dy·a:0.
baracuda_kernels_gated_reglu_backward_f64_can_implement
baracuda_kernels_gated_reglu_backward_f64_can_implement (baracuda kernels gated reglu backward f64 can implement).
baracuda_kernels_gated_reglu_backward_f64_run
ReGLU backward, f64.
baracuda_kernels_gated_reglu_bf16_can_implement
baracuda_kernels_gated_reglu_bf16_can_implement (baracuda kernels gated reglu bf16 can implement).
baracuda_kernels_gated_reglu_bf16_run
ReGLU forward, bf16.
baracuda_kernels_gated_reglu_f16_can_implement
baracuda_kernels_gated_reglu_f16_can_implement (baracuda kernels gated reglu f16 can implement).
baracuda_kernels_gated_reglu_f16_run
ReGLU forward, f16.
baracuda_kernels_gated_reglu_f32_can_implement
baracuda_kernels_gated_reglu_f32_can_implement (baracuda kernels gated reglu f32 can implement).
baracuda_kernels_gated_reglu_f32_run
ReGLU forward, f32. y = a · relu(b) = a · max(b, 0).
baracuda_kernels_gated_reglu_f64_can_implement
baracuda_kernels_gated_reglu_f64_can_implement (baracuda kernels gated reglu f64 can implement).
baracuda_kernels_gated_reglu_f64_run
ReGLU forward, f64.
baracuda_kernels_gated_swiglu_backward_bf16_can_implement
baracuda_kernels_gated_swiglu_backward_bf16_can_implement (baracuda kernels gated swiglu backward bf16 can implement).
baracuda_kernels_gated_swiglu_backward_bf16_run
SwiGLU backward, bf16.
baracuda_kernels_gated_swiglu_backward_f16_can_implement
baracuda_kernels_gated_swiglu_backward_f16_can_implement (baracuda kernels gated swiglu backward f16 can implement).
baracuda_kernels_gated_swiglu_backward_f16_run
SwiGLU backward, f16.
baracuda_kernels_gated_swiglu_backward_f32_can_implement
baracuda_kernels_gated_swiglu_backward_f32_can_implement (baracuda kernels gated swiglu backward f32 can implement).
baracuda_kernels_gated_swiglu_backward_f32_run
SwiGLU backward, f32. da = dy·silu(b), db = dy·a·silu'(b).
baracuda_kernels_gated_swiglu_backward_f64_can_implement
baracuda_kernels_gated_swiglu_backward_f64_can_implement (baracuda kernels gated swiglu backward f64 can implement).
baracuda_kernels_gated_swiglu_backward_f64_run
SwiGLU backward, f64.
baracuda_kernels_gated_swiglu_bf16_can_implement
baracuda_kernels_gated_swiglu_bf16_can_implement (baracuda kernels gated swiglu bf16 can implement).
baracuda_kernels_gated_swiglu_bf16_run
SwiGLU forward, bf16.
baracuda_kernels_gated_swiglu_f16_can_implement
baracuda_kernels_gated_swiglu_f16_can_implement (baracuda kernels gated swiglu f16 can implement).
baracuda_kernels_gated_swiglu_f16_run
SwiGLU forward, f16.
baracuda_kernels_gated_swiglu_f32_can_implement
baracuda_kernels_gated_swiglu_f32_can_implement (baracuda kernels gated swiglu f32 can implement).
baracuda_kernels_gated_swiglu_f32_run
SwiGLU forward, f32. y = a · b · sigmoid(b).
baracuda_kernels_gated_swiglu_f64_can_implement
baracuda_kernels_gated_swiglu_f64_can_implement (baracuda kernels gated swiglu f64 can implement).
baracuda_kernels_gated_swiglu_f64_run
SwiGLU forward, f64.
baracuda_kernels_gather_backward_f32_can_implement
Implementability check for gather_backward_f32.
baracuda_kernels_gather_backward_f32_run
dsrc[..., index[..., j, ...], ...] += dout[..., j, ...] along gather_dim. f32 (atomicAdd).
baracuda_kernels_gather_backward_f64_can_implement
Implementability check for gather_backward_f64.
baracuda_kernels_gather_backward_f64_run
gather_backward — f64 (atomicAdd).
baracuda_kernels_gather_backward_i64idx_f32_can_implement
Implementability check for gather_backward_i64idx_f32.
baracuda_kernels_gather_backward_i64idx_f32_run
gather BW — f32, i64 indices (atomicAdd).
baracuda_kernels_gather_backward_i64idx_f64_can_implement
Implementability check for gather_backward_i64idx_f64.
baracuda_kernels_gather_backward_i64idx_f64_run
gather BW — f64, i64 indices (atomicAdd).
baracuda_kernels_gather_f32_can_implement
Implementability check for gather_f32.
baracuda_kernels_gather_f32_run
out[..., j, ...] = src[..., index[..., j, ...], ...] along gather_dim. f32.
baracuda_kernels_gather_f64_can_implement
Implementability check for gather_f64.
baracuda_kernels_gather_f64_run
gather along gather_dim. f64.
baracuda_kernels_gather_i8_can_implement
baracuda_kernels_gather_i8_can_implement (baracuda kernels gather i8 can implement).
baracuda_kernels_gather_i8_run
baracuda_kernels_gather_i8_run (baracuda kernels gather i8 run).
baracuda_kernels_gather_i16_can_implement
baracuda_kernels_gather_i16_can_implement (baracuda kernels gather i16 can implement).
baracuda_kernels_gather_i16_run
baracuda_kernels_gather_i16_run (baracuda kernels gather i16 run).
baracuda_kernels_gather_i32_can_implement
Implementability check for gather_i32.
baracuda_kernels_gather_i32_run
gather along gather_dim. i32.
baracuda_kernels_gather_i64_can_implement
baracuda_kernels_gather_i64_can_implement (baracuda kernels gather i64 can implement).
baracuda_kernels_gather_i64_run
baracuda_kernels_gather_i64_run (baracuda kernels gather i64 run).
baracuda_kernels_gather_i64idx_f32_can_implement
Implementability check for gather_i64idx_f32.
baracuda_kernels_gather_i64idx_f32_run
gather FW — f32, i64 indices.
baracuda_kernels_gather_i64idx_f64_can_implement
Implementability check for gather_i64idx_f64.
baracuda_kernels_gather_i64idx_f64_run
gather FW — f64, i64 indices.
baracuda_kernels_gather_i64idx_i8_can_implement
baracuda_kernels_gather_i64idx_i8_can_implement (baracuda kernels gather i64idx i8 can implement).
baracuda_kernels_gather_i64idx_i8_run
baracuda_kernels_gather_i64idx_i8_run (baracuda kernels gather i64idx i8 run).
baracuda_kernels_gather_i64idx_i16_can_implement
baracuda_kernels_gather_i64idx_i16_can_implement (baracuda kernels gather i64idx i16 can implement).
baracuda_kernels_gather_i64idx_i16_run
baracuda_kernels_gather_i64idx_i16_run (baracuda kernels gather i64idx i16 run).
baracuda_kernels_gather_i64idx_i32_can_implement
Implementability check for gather_i64idx_i32.
baracuda_kernels_gather_i64idx_i32_run
gather FW — i32 values, i64 indices.
baracuda_kernels_gather_i64idx_i64_can_implement
baracuda_kernels_gather_i64idx_i64_can_implement (baracuda kernels gather i64idx i64 can implement).
baracuda_kernels_gather_i64idx_i64_run
baracuda_kernels_gather_i64idx_i64_run (baracuda kernels gather i64idx i64 run).
baracuda_kernels_gather_i64idx_u8_can_implement
baracuda_kernels_gather_i64idx_u8_can_implement (baracuda kernels gather i64idx u8 can implement).
baracuda_kernels_gather_i64idx_u8_run
baracuda_kernels_gather_i64idx_u8_run (baracuda kernels gather i64idx u8 run).
baracuda_kernels_gather_i64idx_u16_can_implement
baracuda_kernels_gather_i64idx_u16_can_implement (baracuda kernels gather i64idx u16 can implement).
baracuda_kernels_gather_i64idx_u16_run
baracuda_kernels_gather_i64idx_u16_run (baracuda kernels gather i64idx u16 run).
baracuda_kernels_gather_i64idx_u32_can_implement
baracuda_kernels_gather_i64idx_u32_can_implement (baracuda kernels gather i64idx u32 can implement).
baracuda_kernels_gather_i64idx_u32_run
baracuda_kernels_gather_i64idx_u32_run (baracuda kernels gather i64idx u32 run).
baracuda_kernels_gather_u8_can_implement
baracuda_kernels_gather_u8_can_implement (baracuda kernels gather u8 can implement).
baracuda_kernels_gather_u8_run
baracuda_kernels_gather_u8_run (baracuda kernels gather u8 run).
baracuda_kernels_gather_u8idx_f32_can_implement
Implementability check for gather_u8idx_f32.
baracuda_kernels_gather_u8idx_f32_run
gather FW — f32, u8 idx.
baracuda_kernels_gather_u8idx_f64_can_implement
Implementability check for gather_u8idx_f64.
baracuda_kernels_gather_u8idx_f64_run
gather FW — f64, u8 idx.
baracuda_kernels_gather_u16_can_implement
baracuda_kernels_gather_u16_can_implement (baracuda kernels gather u16 can implement).
baracuda_kernels_gather_u16_run
baracuda_kernels_gather_u16_run (baracuda kernels gather u16 run).
baracuda_kernels_gather_u32_can_implement
baracuda_kernels_gather_u32_can_implement (baracuda kernels gather u32 can implement).
baracuda_kernels_gather_u32_run
baracuda_kernels_gather_u32_run (baracuda kernels gather u32 run).
baracuda_kernels_gemm_batched_bf16_rcr_sm80_can_implement
Pre-launch implementability check (batched).
baracuda_kernels_gemm_batched_bf16_rcr_sm80_run
Launch strided-batched Cutlass GEMM. Batch i operates on A + i * stride_a, B + i * stride_b, etc. (strides in elements, not bytes).
baracuda_kernels_gemm_batched_bf16_rcr_sm80_workspace_size
Workspace bytes required for a batch_count-deep batched launch.
baracuda_kernels_gemm_batched_f16_rcr_sm80_can_implement
Pre-launch implementability check (batched).
baracuda_kernels_gemm_batched_f16_rcr_sm80_run
Launch strided-batched Cutlass GEMM. Batch i operates on A + i * stride_a, B + i * stride_b, etc. (strides in elements, not bytes).
baracuda_kernels_gemm_batched_f16_rcr_sm80_workspace_size
Workspace bytes required for a batch_count-deep batched launch.
baracuda_kernels_gemm_bf16_rcr_sm80_can_implement
Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_bf16_rcr_sm80_run
Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_bf16_rcr_sm80_workspace_size
Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_bf16_rrr_sm80_can_implement
Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_bf16_rrr_sm80_run
Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_bf16_rrr_sm80_workspace_size
Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_bias_bf16_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_bf16_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_bf16_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_bf16_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_bf16_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_bf16_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_f16_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_f16_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_f16_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_f16_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_f16_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_f16_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_f32_simt_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_f32_simt_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_f32_simt_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_f32_simt_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_f32_simt_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_f32_simt_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_f32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_f32bias_s8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_f32bias_s8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_f32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_f32bias_u8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_f32bias_u8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_f64_rcr_sm80_can_implement
Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_f64_rcr_sm80_run
Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_f64_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_f64_rrr_sm80_can_implement
Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_f64_rrr_sm80_run
Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_f64_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_bf16_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_bf16_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_bf16_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_bf16_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_bf16_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_bf16_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f16_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_f16_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_f16_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f16_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_f16_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_f16_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f32_simt_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_f32_simt_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_f32_simt_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f32_simt_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_f32_simt_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_f32_simt_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_f32bias_s8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_f32bias_s8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_f32bias_u8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_f32bias_u8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f64_rcr_sm80_can_implement
Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_gelu_f64_rcr_sm80_run
Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_gelu_f64_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f64_rrr_sm80_can_implement
Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_gelu_f64_rrr_sm80_run
Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_gelu_f64_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_i32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_i32bias_s8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_i32bias_s8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_i32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_i32bias_u8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_i32bias_u8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_tf32_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_tf32_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_tf32_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_tf32_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_tf32_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_tf32_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_i32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_i32bias_s8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_i32bias_s8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_i32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_i32bias_u8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_i32bias_u8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_relu_bf16_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_bf16_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_bf16_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_relu_bf16_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_bf16_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_bf16_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f16_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_f16_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_f16_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f16_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_f16_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_f16_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f32_simt_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_f32_simt_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_f32_simt_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f32_simt_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_f32_simt_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_f32_simt_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_f32bias_s8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_f32bias_s8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_f32bias_u8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_f32bias_u8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f64_rcr_sm80_can_implement
Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_relu_f64_rcr_sm80_run
Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_relu_f64_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f64_rrr_sm80_can_implement
Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_relu_f64_rrr_sm80_run
Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_relu_f64_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_relu_i32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_i32bias_s8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_i32bias_s8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_relu_i32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_i32bias_u8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_i32bias_u8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_relu_tf32_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_tf32_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_tf32_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_relu_tf32_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_tf32_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_tf32_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_silu_bf16_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_bf16_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_bf16_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_silu_bf16_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_bf16_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_bf16_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f16_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_f16_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_f16_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f16_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_f16_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_f16_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f32_simt_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_f32_simt_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_f32_simt_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f32_simt_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_f32_simt_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_f32_simt_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_f32bias_s8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_f32bias_s8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_f32bias_u8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_f32bias_u8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f64_rcr_sm80_can_implement
Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_silu_f64_rcr_sm80_run
Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_silu_f64_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f64_rrr_sm80_can_implement
Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_silu_f64_rrr_sm80_run
Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_silu_f64_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_silu_i32bias_s8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_i32bias_s8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_i32bias_s8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_silu_i32bias_u8_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_i32bias_u8_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_i32bias_u8_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_silu_tf32_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_tf32_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_tf32_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_silu_tf32_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_tf32_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_tf32_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_tf32_rcr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_tf32_rcr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_tf32_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_bias_tf32_rrr_sm80_can_implement
Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_tf32_rrr_sm80_run
Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_tf32_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_dense_bf16_can_implement
Host-side validity check for baracuda_kernels_gemm_dense_bf16_run. Validates extents, the layout tag, leading-dim minimums, i32-fit of leading dims, and stride_d != 0 at batch > 1. stride_a / stride_b are accepted unconditionally (any value, including 0-broadcast).
baracuda_kernels_gemm_dense_bf16_run
Dense bf16 GEMM (cuBLAS-backed): D[g] = α · A[g] · B[g] + β · D[g] for g ∈ [0, batch), accumulating in f32. Row-major problem; see the module docs for the layout tag (0 = RRR, 1 = RCR, 2 = CRR), leading-dim minimums, and the batch-stride contract (element strides; stride_a/stride_b may be 0 to broadcast; strides ignored at batch == 1).
baracuda_kernels_gemm_dense_bf16_workspace_size
Workspace query for baracuda_kernels_gemm_dense_bf16_run. Always 0 — cuBLAS allocates its workspace internally per handle.
baracuda_kernels_gemm_dense_f16_can_implement
Host-side validity check for baracuda_kernels_gemm_dense_f16_run. Validates extents, the layout tag, leading-dim minimums, i32-fit of leading dims, and stride_d != 0 at batch > 1. stride_a / stride_b are accepted unconditionally (any value, including 0-broadcast).
baracuda_kernels_gemm_dense_f16_run
Dense f16 GEMM (cuBLAS-backed): D[g] = α · A[g] · B[g] + β · D[g] for g ∈ [0, batch), accumulating in f32. Row-major problem; see the module docs for the layout tag (0 = RRR, 1 = RCR, 2 = CRR), leading-dim minimums, and the batch-stride contract (element strides; stride_a/stride_b may be 0 to broadcast; strides ignored at batch == 1).
baracuda_kernels_gemm_dense_f16_workspace_size
Workspace query for baracuda_kernels_gemm_dense_f16_run. Always 0 — cuBLAS allocates its workspace internally per handle.
baracuda_kernels_gemm_dense_f32_can_implement
Host-side validity check for baracuda_kernels_gemm_dense_f32_run. Validates extents, the layout tag, leading-dim minimums, i32-fit of leading dims, and stride_d != 0 at batch > 1. stride_a / stride_b are accepted unconditionally (any value, including 0-broadcast).
baracuda_kernels_gemm_dense_f32_run
Dense f32 GEMM (cuBLAS-backed): D[g] = α · A[g] · B[g] + β · D[g] for g ∈ [0, batch), accumulating in IEEE binary32 (default math mode — NOT TF32). Row-major problem; see the module docs for the layout tag (0 = RRR, 1 = RCR, 2 = CRR), leading-dim minimums, and the batch-stride contract (element strides; stride_a/stride_b may be 0 to broadcast; strides ignored at batch == 1).
baracuda_kernels_gemm_dense_f32_workspace_size
Workspace query for baracuda_kernels_gemm_dense_f32_run. Always 0 — cuBLAS allocates its workspace internally per handle.
baracuda_kernels_gemm_dense_f64_can_implement
Host-side validity check for baracuda_kernels_gemm_dense_f64_run. Validates extents, the layout tag, leading-dim minimums, i32-fit of leading dims, and stride_d != 0 at batch > 1. stride_a / stride_b are accepted unconditionally (any value, including 0-broadcast).
baracuda_kernels_gemm_dense_f64_run
Dense f64 GEMM (cuBLAS-backed): D[g] = α · A[g] · B[g] + β · D[g] for g ∈ [0, batch), accumulating in f64. Row-major problem; see the module docs for the layout tag (0 = RRR, 1 = RCR, 2 = CRR), leading-dim minimums, and the batch-stride contract (element strides; stride_a/stride_b may be 0 to broadcast; strides ignored at batch == 1).
baracuda_kernels_gemm_dense_f64_workspace_size
Workspace query for baracuda_kernels_gemm_dense_f64_run. Always 0 — cuBLAS allocates its workspace internally per handle.
baracuda_kernels_gemm_f16_rcr_sm80_can_implement
Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_f16_rcr_sm80_run
Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_f16_rcr_sm80_workspace_size
Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_f16_rrr_sm80_can_implement
Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_f16_rrr_sm80_run
Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_f16_rrr_sm80_workspace_size
Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_f32_simt_rcr_sm80_can_implement
Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_f32_simt_rcr_sm80_run
Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_f32_simt_rcr_sm80_workspace_size
Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_f32_simt_rrr_sm80_can_implement
Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_f32_simt_rrr_sm80_run
Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_f32_simt_rrr_sm80_workspace_size
Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_f64_rcr_sm80_can_implement
Pre-launch implementability check.
baracuda_kernels_gemm_f64_rcr_sm80_run
Launch DGEMM. f64 alpha/beta.
baracuda_kernels_gemm_f64_rcr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_f64_rrr_sm80_can_implement
Pre-launch implementability check.
baracuda_kernels_gemm_f64_rrr_sm80_run
Launch DGEMM. f64 alpha/beta.
baracuda_kernels_gemm_f64_rrr_sm80_workspace_size
Workspace bytes required.
baracuda_kernels_gemm_s8_rcr_sm80_can_implement
Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_s8_rcr_sm80_run
Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_s8_rcr_sm80_workspace_size
Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_s8_rrr_sm80_bias_f32_can_implement
baracuda_kernels_gemm_s8_rrr_sm80_bias_f32_can_implement (baracuda kernels gemm s8 rrr sm80 bias f32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_f32_run
baracuda_kernels_gemm_s8_rrr_sm80_bias_f32_run (baracuda kernels gemm s8 rrr sm80 bias f32 run).
baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_f32_can_implement
baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_f32_can_implement (baracuda kernels gemm s8 rrr sm80 bias gelu f32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_f32_run
baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_f32_run (baracuda kernels gemm s8 rrr sm80 bias gelu f32 run).
baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_i32_can_implement
baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_i32_can_implement (baracuda kernels gemm s8 rrr sm80 bias gelu i32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_i32_run
baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_i32_run (baracuda kernels gemm s8 rrr sm80 bias gelu i32 run).
baracuda_kernels_gemm_s8_rrr_sm80_bias_i32_can_implement
baracuda_kernels_gemm_s8_rrr_sm80_bias_i32_can_implement (baracuda kernels gemm s8 rrr sm80 bias i32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_i32_run
baracuda_kernels_gemm_s8_rrr_sm80_bias_i32_run (baracuda kernels gemm s8 rrr sm80 bias i32 run).
baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_f32_can_implement
baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_f32_can_implement (baracuda kernels gemm s8 rrr sm80 bias relu f32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_f32_run
baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_f32_run (baracuda kernels gemm s8 rrr sm80 bias relu f32 run).
baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_i32_can_implement
baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_i32_can_implement (baracuda kernels gemm s8 rrr sm80 bias relu i32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_i32_run
baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_i32_run (baracuda kernels gemm s8 rrr sm80 bias relu i32 run).
baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_f32_can_implement
baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_f32_can_implement (baracuda kernels gemm s8 rrr sm80 bias silu f32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_f32_run
baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_f32_run (baracuda kernels gemm s8 rrr sm80 bias silu f32 run).
baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_i32_can_implement
baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_i32_can_implement (baracuda kernels gemm s8 rrr sm80 bias silu i32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_i32_run
baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_i32_run (baracuda kernels gemm s8 rrr sm80 bias silu i32 run).
baracuda_kernels_gemm_s8_rrr_sm80_can_implement
Pre-launch implementability check for the S8 RRR sm_80 Identity SKU.
baracuda_kernels_gemm_s8_rrr_sm80_run
S8 GEMM, RRR layout, Identity epilogue, sm_80.
baracuda_kernels_gemm_s8_rrr_sm80_workspace_size
Workspace size in bytes for the S8 RRR sm_80 Identity SKU at the given problem size. Always returns zero today; reserved for future SKUs that need scratch.
baracuda_kernels_gemm_tf32_rcr_sm80_can_implement
Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_tf32_rcr_sm80_run
Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_tf32_rcr_sm80_workspace_size
Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_tf32_rrr_sm80_can_implement
Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_tf32_rrr_sm80_run
Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_tf32_rrr_sm80_workspace_size
Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_u8_rcr_sm80_can_implement
Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_u8_rcr_sm80_run
Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_u8_rcr_sm80_workspace_size
Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_u8_rrr_sm80_bias_f32_can_implement
baracuda_kernels_gemm_u8_rrr_sm80_bias_f32_can_implement (baracuda kernels gemm u8 rrr sm80 bias f32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_f32_run
baracuda_kernels_gemm_u8_rrr_sm80_bias_f32_run (baracuda kernels gemm u8 rrr sm80 bias f32 run).
baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_f32_can_implement
baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_f32_can_implement (baracuda kernels gemm u8 rrr sm80 bias gelu f32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_f32_run
baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_f32_run (baracuda kernels gemm u8 rrr sm80 bias gelu f32 run).
baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_i32_can_implement
baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_i32_can_implement (baracuda kernels gemm u8 rrr sm80 bias gelu i32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_i32_run
baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_i32_run (baracuda kernels gemm u8 rrr sm80 bias gelu i32 run).
baracuda_kernels_gemm_u8_rrr_sm80_bias_i32_can_implement
baracuda_kernels_gemm_u8_rrr_sm80_bias_i32_can_implement (baracuda kernels gemm u8 rrr sm80 bias i32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_i32_run
baracuda_kernels_gemm_u8_rrr_sm80_bias_i32_run (baracuda kernels gemm u8 rrr sm80 bias i32 run).
baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_f32_can_implement
baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_f32_can_implement (baracuda kernels gemm u8 rrr sm80 bias relu f32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_f32_run
baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_f32_run (baracuda kernels gemm u8 rrr sm80 bias relu f32 run).
baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_i32_can_implement
baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_i32_can_implement (baracuda kernels gemm u8 rrr sm80 bias relu i32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_i32_run
baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_i32_run (baracuda kernels gemm u8 rrr sm80 bias relu i32 run).
baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_f32_can_implement
baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_f32_can_implement (baracuda kernels gemm u8 rrr sm80 bias silu f32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_f32_run
baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_f32_run (baracuda kernels gemm u8 rrr sm80 bias silu f32 run).
baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_i32_can_implement
baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_i32_can_implement (baracuda kernels gemm u8 rrr sm80 bias silu i32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_i32_run
baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_i32_run (baracuda kernels gemm u8 rrr sm80 bias silu i32 run).
baracuda_kernels_gemm_u8_rrr_sm80_can_implement
baracuda_kernels_gemm_u8_rrr_sm80_can_implement (baracuda kernels gemm u8 rrr sm80 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_run
U8 GEMM, RRR layout, Identity epilogue, sm_80.
baracuda_kernels_gemm_u8_rrr_sm80_workspace_size
baracuda_kernels_gemm_u8_rrr_sm80_workspace_size (baracuda kernels gemm u8 rrr sm80 workspace size).
baracuda_kernels_grid_sample_2d_backward_f32_can_implement
baracuda_kernels_grid_sample_2d_backward_f32_can_implement (baracuda kernels grid sample 2d backward f32 can implement).
baracuda_kernels_grid_sample_2d_backward_f32_run
grid_sample_2d BW, f32. Caller pre-zeros dinput and dgrid. dgrid: [N, OH, OW, 2]. # Safety: as FW.
baracuda_kernels_grid_sample_2d_backward_f64_can_implement
baracuda_kernels_grid_sample_2d_backward_f64_can_implement (baracuda kernels grid sample 2d backward f64 can implement).
baracuda_kernels_grid_sample_2d_backward_f64_run
grid_sample_2d BW, f64. # Safety: as f32 BW.
baracuda_kernels_grid_sample_2d_f32_can_implement
baracuda_kernels_grid_sample_2d_f32_can_implement (baracuda kernels grid sample 2d f32 can implement).
baracuda_kernels_grid_sample_2d_f32_run
grid_sample(input, grid) FW, f32. grid: [N, OH, OW, 2] with (x, y) normalized in [-1, 1]. # Safety: as interpolate_*.
baracuda_kernels_grid_sample_2d_f64_can_implement
baracuda_kernels_grid_sample_2d_f64_can_implement (baracuda kernels grid sample 2d f64 can implement).
baracuda_kernels_grid_sample_2d_f64_run
grid_sample_2d FW, f64. # Safety: as f32.
baracuda_kernels_group_norm_backward_bf16_can_implement
baracuda_kernels_group_norm_backward_bf16_can_implement (baracuda kernels group norm backward bf16 can implement).
baracuda_kernels_group_norm_backward_bf16_run
GroupNorm BW, bf16.
baracuda_kernels_group_norm_backward_f16_can_implement
baracuda_kernels_group_norm_backward_f16_can_implement (baracuda kernels group norm backward f16 can implement).
baracuda_kernels_group_norm_backward_f16_run
GroupNorm BW, f16.
baracuda_kernels_group_norm_backward_f32_can_implement
baracuda_kernels_group_norm_backward_f32_can_implement (baracuda kernels group norm backward f32 can implement).
baracuda_kernels_group_norm_backward_f32_run
GroupNorm BW, f32. Workspace size: 2 * (n_extent * num_groups) * sizeof(float) bytes for the stage-1 partial sums.
baracuda_kernels_group_norm_backward_f64_can_implement
baracuda_kernels_group_norm_backward_f64_can_implement (baracuda kernels group norm backward f64 can implement).
baracuda_kernels_group_norm_backward_f64_run
GroupNorm BW, f64.
baracuda_kernels_group_norm_bf16_can_implement
baracuda_kernels_group_norm_bf16_can_implement (baracuda kernels group norm bf16 can implement).
baracuda_kernels_group_norm_bf16_run
GroupNorm FW, bf16.
baracuda_kernels_group_norm_f16_can_implement
baracuda_kernels_group_norm_f16_can_implement (baracuda kernels group norm f16 can implement).
baracuda_kernels_group_norm_f16_run
GroupNorm FW, f16.
baracuda_kernels_group_norm_f32_can_implement
baracuda_kernels_group_norm_f32_can_implement (baracuda kernels group norm f32 can implement).
baracuda_kernels_group_norm_f32_run
GroupNorm FW, f32. Per (sample, group) mean / inv_std, per-channel affine. num_groups must divide c_extent. group_kind = 1 selects the GN dispatch (also used by InstanceNorm with num_groups == c_extent).
baracuda_kernels_group_norm_f64_can_implement
baracuda_kernels_group_norm_f64_can_implement (baracuda kernels group norm f64 can implement).
baracuda_kernels_group_norm_f64_run
GroupNorm FW, f64.
baracuda_kernels_gumbel_softmax_bf16_can_implement
baracuda_kernels_gumbel_softmax_bf16_can_implement (baracuda kernels gumbel softmax bf16 can implement).
baracuda_kernels_gumbel_softmax_bf16_run
GumbelSoftmax FW, bf16.
baracuda_kernels_gumbel_softmax_f16_can_implement
baracuda_kernels_gumbel_softmax_f16_can_implement (baracuda kernels gumbel softmax f16 can implement).
baracuda_kernels_gumbel_softmax_f16_run
GumbelSoftmax FW, f16.
baracuda_kernels_gumbel_softmax_f32_can_implement
baracuda_kernels_gumbel_softmax_f32_can_implement (baracuda kernels gumbel softmax f32 can implement).
baracuda_kernels_gumbel_softmax_f32_run
GumbelSoftmax FW, f32. y = softmax((x + g) / τ) where g = -log(-log(u)) and u is a caller-supplied cuRAND uniform buffer (one f32 per output cell, dense / contiguous layout). inv_tau = 1/τ. hard != 0 → one-hot at the noisy argmax.
baracuda_kernels_gumbel_softmax_f64_can_implement
baracuda_kernels_gumbel_softmax_f64_can_implement (baracuda kernels gumbel softmax f64 can implement).
baracuda_kernels_gumbel_softmax_f64_run
GumbelSoftmax FW, f64.
baracuda_kernels_histogram_f32_can_implement
baracuda_kernels_histogram_f32_can_implement (baracuda kernels histogram f32 can implement).
baracuda_kernels_histogram_f32_run
1-D histogram, f32 input. lo / hi passed as double — kernel casts to T (keeps the FFI shape uniform across dtypes).
baracuda_kernels_histogram_f64_can_implement
baracuda_kernels_histogram_f64_can_implement (baracuda kernels histogram f64 can implement).
baracuda_kernels_histogram_f64_run
1-D histogram, f64 input.
baracuda_kernels_ifftshift_4_can_implement
baracuda_kernels_ifftshift_4_can_implement (baracuda kernels ifftshift 4 can implement).
baracuda_kernels_ifftshift_4_run
Inverse fftshift along the last axis of a [batch, n] tensor: y[b, i] = x[b, (i + (n + 1) / 2) % n]. Differs from fftshift only for odd n; for even n the two are identical (each permutation is self-inverse). 4-byte cells.
baracuda_kernels_ifftshift_8_can_implement
baracuda_kernels_ifftshift_8_can_implement (baracuda kernels ifftshift 8 can implement).
baracuda_kernels_ifftshift_8_run
8-byte-element inverse fftshift.
baracuda_kernels_ifftshift_16_can_implement
baracuda_kernels_ifftshift_16_can_implement (baracuda kernels ifftshift 16 can implement).
baracuda_kernels_ifftshift_16_run
16-byte-element inverse fftshift.
baracuda_kernels_im2col_1d_bf16_can_implement
baracuda_kernels_im2col_1d_bf16_can_implement (baracuda kernels im2col 1d bf16 can implement).
baracuda_kernels_im2col_1d_bf16_run
im2col 1-D, bf16.
baracuda_kernels_im2col_1d_f16_can_implement
baracuda_kernels_im2col_1d_f16_can_implement (baracuda kernels im2col 1d f16 can implement).
baracuda_kernels_im2col_1d_f16_run
im2col 1-D, f16.
baracuda_kernels_im2col_1d_f32_can_implement
baracuda_kernels_im2col_1d_f32_can_implement (baracuda kernels im2col 1d f32 can implement).
baracuda_kernels_im2col_1d_f32_run
im2col 1-D, f32.
baracuda_kernels_im2col_1d_f64_can_implement
baracuda_kernels_im2col_1d_f64_can_implement (baracuda kernels im2col 1d f64 can implement).
baracuda_kernels_im2col_1d_f64_run
im2col 1-D, f64.
baracuda_kernels_im2col_2d_bf16_can_implement
baracuda_kernels_im2col_2d_bf16_can_implement (baracuda kernels im2col 2d bf16 can implement).
baracuda_kernels_im2col_2d_bf16_run
im2col 2-D, bf16.
baracuda_kernels_im2col_2d_f16_can_implement
baracuda_kernels_im2col_2d_f16_can_implement (baracuda kernels im2col 2d f16 can implement).
baracuda_kernels_im2col_2d_f16_run
im2col 2-D, f16.
baracuda_kernels_im2col_2d_f32_can_implement
baracuda_kernels_im2col_2d_f32_can_implement (baracuda kernels im2col 2d f32 can implement).
baracuda_kernels_im2col_2d_f32_run
im2col 2-D, f32.
baracuda_kernels_im2col_2d_f64_can_implement
baracuda_kernels_im2col_2d_f64_can_implement (baracuda kernels im2col 2d f64 can implement).
baracuda_kernels_im2col_2d_f64_run
im2col 2-D, f64.
baracuda_kernels_index_add_bf16_can_implement
Implementability check for index_add_bf16.
baracuda_kernels_index_add_bf16_run
index_add — bf16, i32 idx. atomicCAS-via- baracuda::atomic::add<__nv_bfloat16> (same caveats as f16).
baracuda_kernels_index_add_f16_can_implement
Implementability check for index_add_f16.
baracuda_kernels_index_add_f16_run
index_add — f16, i32 idx. Uses atomicCAS-via- baracuda::atomic::add<__half> (deterministic per-thread arithmetic regardless of CUDA toolkit; non-deterministic accumulation order).
baracuda_kernels_index_add_f32_can_implement
Implementability check for index_add_f32.
baracuda_kernels_index_add_f32_run
index_adddst[idx[i], ...] += src[i, ...], f32, i32 idx.
baracuda_kernels_index_add_f64_can_implement
Implementability check for index_add_f64.
baracuda_kernels_index_add_f64_run
index_add — f64, i32 idx.
baracuda_kernels_index_add_i32_can_implement
baracuda_kernels_index_add_i32_can_implement (baracuda kernels index add i32 can implement).
baracuda_kernels_index_add_i32_run
baracuda_kernels_index_add_i32_run (baracuda kernels index add i32 run).
baracuda_kernels_index_add_i64_can_implement
baracuda_kernels_index_add_i64_can_implement (baracuda kernels index add i64 can implement).
baracuda_kernels_index_add_i64_run
baracuda_kernels_index_add_i64_run (baracuda kernels index add i64 run).
baracuda_kernels_index_add_i64idx_bf16_can_implement
Implementability check for index_add_i64idx_bf16.
baracuda_kernels_index_add_i64idx_bf16_run
index_add — bf16, i64 idx.
baracuda_kernels_index_add_i64idx_f16_can_implement
Implementability check for index_add_i64idx_f16.
baracuda_kernels_index_add_i64idx_f16_run
index_add — f16, i64 idx.
baracuda_kernels_index_add_i64idx_f32_can_implement
Implementability check for index_add_i64idx_f32.
baracuda_kernels_index_add_i64idx_f32_run
index_add — f32, i64 idx.
baracuda_kernels_index_add_i64idx_f64_can_implement
Implementability check for index_add_i64idx_f64.
baracuda_kernels_index_add_i64idx_f64_run
index_add — f64, i64 idx.
baracuda_kernels_index_add_i64idx_i32_can_implement
baracuda_kernels_index_add_i64idx_i32_can_implement (baracuda kernels index add i64idx i32 can implement).
baracuda_kernels_index_add_i64idx_i32_run
baracuda_kernels_index_add_i64idx_i32_run (baracuda kernels index add i64idx i32 run).
baracuda_kernels_index_add_i64idx_i64_can_implement
baracuda_kernels_index_add_i64idx_i64_can_implement (baracuda kernels index add i64idx i64 can implement).
baracuda_kernels_index_add_i64idx_i64_run
baracuda_kernels_index_add_i64idx_i64_run (baracuda kernels index add i64idx i64 run).
baracuda_kernels_index_add_i64idx_u32_can_implement
baracuda_kernels_index_add_i64idx_u32_can_implement (baracuda kernels index add i64idx u32 can implement).
baracuda_kernels_index_add_i64idx_u32_run
baracuda_kernels_index_add_i64idx_u32_run (baracuda kernels index add i64idx u32 run).
baracuda_kernels_index_add_u32_can_implement
baracuda_kernels_index_add_u32_can_implement (baracuda kernels index add u32 can implement).
baracuda_kernels_index_add_u32_run
baracuda_kernels_index_add_u32_run (baracuda kernels index add u32 run).
baracuda_kernels_index_select_backward_f32_can_implement
Implementability check for index_select_backward_f32.
baracuda_kernels_index_select_backward_f32_run
dsrc[..., idx[j], ...] += dout[..., j, ...] along select_dim. f32 (atomicAdd).
baracuda_kernels_index_select_backward_f64_can_implement
Implementability check for index_select_backward_f64.
baracuda_kernels_index_select_backward_f64_run
index_select_backward — f64 (atomicAdd).
baracuda_kernels_index_select_backward_i64idx_f32_can_implement
Implementability check for index_select_backward_i64idx_f32.
baracuda_kernels_index_select_backward_i64idx_f32_run
index_select BW — f32, i64 indices.
baracuda_kernels_index_select_backward_i64idx_f64_can_implement
Implementability check for index_select_backward_i64idx_f64.
baracuda_kernels_index_select_backward_i64idx_f64_run
index_select BW — f64, i64 indices.
baracuda_kernels_index_select_f32_can_implement
Implementability check for index_select_f32.
baracuda_kernels_index_select_f32_run
out[..., j, ...] = src[..., idx[j], ...] along select_dim. idx is 1-D i32. f32.
baracuda_kernels_index_select_f64_can_implement
Implementability check for index_select_f64.
baracuda_kernels_index_select_f64_run
index_select — f64.
baracuda_kernels_index_select_i8_can_implement
baracuda_kernels_index_select_i8_can_implement (baracuda kernels index select i8 can implement).
baracuda_kernels_index_select_i8_run
baracuda_kernels_index_select_i8_run (baracuda kernels index select i8 run).
baracuda_kernels_index_select_i16_can_implement
baracuda_kernels_index_select_i16_can_implement (baracuda kernels index select i16 can implement).
baracuda_kernels_index_select_i16_run
baracuda_kernels_index_select_i16_run (baracuda kernels index select i16 run).
baracuda_kernels_index_select_i32_can_implement
Implementability check for index_select_i32.
baracuda_kernels_index_select_i32_run
index_select — i32.
baracuda_kernels_index_select_i64_can_implement
baracuda_kernels_index_select_i64_can_implement (baracuda kernels index select i64 can implement).
baracuda_kernels_index_select_i64_run
baracuda_kernels_index_select_i64_run (baracuda kernels index select i64 run).
baracuda_kernels_index_select_i64idx_f32_can_implement
Implementability check for index_select_i64idx_f32.
baracuda_kernels_index_select_i64idx_f32_run
index_select — f32, i64 indices.
baracuda_kernels_index_select_i64idx_f64_can_implement
Implementability check for index_select_i64idx_f64.
baracuda_kernels_index_select_i64idx_f64_run
index_select — f64, i64 indices.
baracuda_kernels_index_select_i64idx_i8_can_implement
baracuda_kernels_index_select_i64idx_i8_can_implement (baracuda kernels index select i64idx i8 can implement).
baracuda_kernels_index_select_i64idx_i8_run
baracuda_kernels_index_select_i64idx_i8_run (baracuda kernels index select i64idx i8 run).
baracuda_kernels_index_select_i64idx_i16_can_implement
baracuda_kernels_index_select_i64idx_i16_can_implement (baracuda kernels index select i64idx i16 can implement).
baracuda_kernels_index_select_i64idx_i16_run
baracuda_kernels_index_select_i64idx_i16_run (baracuda kernels index select i64idx i16 run).
baracuda_kernels_index_select_i64idx_i32_can_implement
Implementability check for index_select_i64idx_i32.
baracuda_kernels_index_select_i64idx_i32_run
index_select — i32 values, i64 indices.
baracuda_kernels_index_select_i64idx_i64_can_implement
baracuda_kernels_index_select_i64idx_i64_can_implement (baracuda kernels index select i64idx i64 can implement).
baracuda_kernels_index_select_i64idx_i64_run
baracuda_kernels_index_select_i64idx_i64_run (baracuda kernels index select i64idx i64 run).
baracuda_kernels_index_select_i64idx_u8_can_implement
baracuda_kernels_index_select_i64idx_u8_can_implement (baracuda kernels index select i64idx u8 can implement).
baracuda_kernels_index_select_i64idx_u8_run
baracuda_kernels_index_select_i64idx_u8_run (baracuda kernels index select i64idx u8 run).
baracuda_kernels_index_select_i64idx_u16_can_implement
baracuda_kernels_index_select_i64idx_u16_can_implement (baracuda kernels index select i64idx u16 can implement).
baracuda_kernels_index_select_i64idx_u16_run
baracuda_kernels_index_select_i64idx_u16_run (baracuda kernels index select i64idx u16 run).
baracuda_kernels_index_select_i64idx_u32_can_implement
baracuda_kernels_index_select_i64idx_u32_can_implement (baracuda kernels index select i64idx u32 can implement).
baracuda_kernels_index_select_i64idx_u32_run
baracuda_kernels_index_select_i64idx_u32_run (baracuda kernels index select i64idx u32 run).
baracuda_kernels_index_select_u8_can_implement
baracuda_kernels_index_select_u8_can_implement (baracuda kernels index select u8 can implement).
baracuda_kernels_index_select_u8_run
baracuda_kernels_index_select_u8_run (baracuda kernels index select u8 run).
baracuda_kernels_index_select_u16_can_implement
baracuda_kernels_index_select_u16_can_implement (baracuda kernels index select u16 can implement).
baracuda_kernels_index_select_u16_run
baracuda_kernels_index_select_u16_run (baracuda kernels index select u16 run).
baracuda_kernels_index_select_u32_can_implement
baracuda_kernels_index_select_u32_can_implement (baracuda kernels index select u32 can implement).
baracuda_kernels_index_select_u32_run
baracuda_kernels_index_select_u32_run (baracuda kernels index select u32 run).
baracuda_kernels_interpolate_bilinear_2d_backward_bf16_can_implement
baracuda_kernels_interpolate_bilinear_2d_backward_bf16_can_implement (baracuda kernels interpolate bilinear 2d backward bf16 can implement).
baracuda_kernels_interpolate_bilinear_2d_backward_bf16_run
interpolate_bilinear_2d BW, bf16. Caller pre-zeros dinput. atomicCAS-based bf16 atomic add. # Safety: as f32 BW.
baracuda_kernels_interpolate_bilinear_2d_backward_f16_can_implement
baracuda_kernels_interpolate_bilinear_2d_backward_f16_can_implement (baracuda kernels interpolate bilinear 2d backward f16 can implement).
baracuda_kernels_interpolate_bilinear_2d_backward_f16_run
interpolate_bilinear_2d BW, f16. Caller pre-zeros dinput. atomicCAS-based half atomic add. # Safety: as f32 BW.
baracuda_kernels_interpolate_bilinear_2d_backward_f32_can_implement
baracuda_kernels_interpolate_bilinear_2d_backward_f32_can_implement (baracuda kernels interpolate bilinear 2d backward f32 can implement).
baracuda_kernels_interpolate_bilinear_2d_backward_f32_run
interpolate_bilinear_2d BW, f32. Caller pre-zeros dinput.
baracuda_kernels_interpolate_bilinear_2d_backward_f64_can_implement
baracuda_kernels_interpolate_bilinear_2d_backward_f64_can_implement (baracuda kernels interpolate bilinear 2d backward f64 can implement).
baracuda_kernels_interpolate_bilinear_2d_backward_f64_run
interpolate_bilinear_2d BW, f64. # Safety: as f32 BW.
baracuda_kernels_interpolate_bilinear_2d_bf16_can_implement
baracuda_kernels_interpolate_bilinear_2d_bf16_can_implement (baracuda kernels interpolate bilinear 2d bf16 can implement).
baracuda_kernels_interpolate_bilinear_2d_bf16_run
interpolate_bilinear_2d FW, bf16. Cast-at-read / f32 accumulator / cast-at-write. # Safety: as f32.
baracuda_kernels_interpolate_bilinear_2d_f16_can_implement
baracuda_kernels_interpolate_bilinear_2d_f16_can_implement (baracuda kernels interpolate bilinear 2d f16 can implement).
baracuda_kernels_interpolate_bilinear_2d_f16_run
interpolate_bilinear_2d FW, f16 (half). Cast-at-read / f32 accumulator / cast-at-write. # Safety: as f32.
baracuda_kernels_interpolate_bilinear_2d_f32_can_implement
baracuda_kernels_interpolate_bilinear_2d_f32_can_implement (baracuda kernels interpolate bilinear 2d f32 can implement).
baracuda_kernels_interpolate_bilinear_2d_f32_run
interpolate(x, mode='bilinear') FW, f32. input: [N, C, IH, IW]; output: [N, C, OH, OW]. NCHW. align_corners: 0 = false (PyTorch default), nonzero = true. scale_h_factor / scale_w_factor: 0.0 = derive from sizes; nonzero = use as SCALE override.
baracuda_kernels_interpolate_bilinear_2d_f64_can_implement
baracuda_kernels_interpolate_bilinear_2d_f64_can_implement (baracuda kernels interpolate bilinear 2d f64 can implement).
baracuda_kernels_interpolate_bilinear_2d_f64_run
interpolate_bilinear_2d FW, f64. # Safety: as f32.
baracuda_kernels_inverse_f32_run
Matrix inverse via getrf + getrs over caller-staged identity in inv_inout. The caller MUST pre-stage an n × n identity in inv_inout (column-major) before invoking. After the call, inv_inout holds A^{-1} and a_inout holds the packed LU factors.
baracuda_kernels_inverse_f32_workspace_size
Inverse workspace size (== getrf workspace).
baracuda_kernels_inverse_f64_run
Matrix inverse via getrf + getrs over caller-staged identity in inv_inout. The caller MUST pre-stage an n × n identity in inv_inout (column-major) before invoking. After the call, inv_inout holds A^{-1} and a_inout holds the packed LU factors.
baracuda_kernels_inverse_f64_workspace_size
Inverse workspace size (== getrf workspace).
baracuda_kernels_irfft_1d_f32_run
1-D C2R FFT (Hermitian-half complex → real). Applies 1/n normalization in-place (PyTorch norm="backward"). n is the real-side output length; complex input shape is [batch, n/2 + 1].
baracuda_kernels_irfft_1d_f32_workspace_size
1-D C2R FFT workspace size in bytes — always 0.
baracuda_kernels_irfft_1d_f64_run
1-D C2R FFT (Hermitian-half complex → real). Applies 1/n normalization in-place (PyTorch norm="backward"). n is the real-side output length; complex input shape is [batch, n/2 + 1].
baracuda_kernels_irfft_1d_f64_workspace_size
1-D C2R FFT workspace size in bytes — always 0.
baracuda_kernels_irfft_nd_f32_run
ND C2R FFT (Hermitian-half complex → real). Applies 1/product(dims[..rank]) normalization in-place. dims carries the real-side extents.
baracuda_kernels_irfft_nd_f32_workspace_size
ND C2R FFT workspace size in bytes — always 0.
baracuda_kernels_irfft_nd_f64_run
ND C2R FFT (Hermitian-half complex → real). Applies 1/product(dims[..rank]) normalization in-place. dims carries the real-side extents.
baracuda_kernels_irfft_nd_f64_workspace_size
ND C2R FFT workspace size in bytes — always 0.
baracuda_kernels_kv_cache_append_bf16_can_implement
Implementability check for kv_cache_append_bf16. Host-side only.
baracuda_kernels_kv_cache_append_bf16_run
KV-cache append, bf16.
baracuda_kernels_kv_cache_append_f16_can_implement
Implementability check for kv_cache_append_f16. Host-side only.
baracuda_kernels_kv_cache_append_f16_run
KV-cache append, f16.
baracuda_kernels_kv_cache_append_f32_can_implement
Implementability check for kv_cache_append_f32. Host-side only.
baracuda_kernels_kv_cache_append_f32_run
KV-cache append, f32.
baracuda_kernels_kv_cache_append_f64_can_implement
Implementability check for kv_cache_append_f64. Host-side only.
baracuda_kernels_kv_cache_append_f64_run
KV-cache append, f64.
baracuda_kernels_layer_norm_backward_bf16_can_implement
baracuda_kernels_layer_norm_backward_bf16_can_implement (baracuda kernels layer norm backward bf16 can implement).
baracuda_kernels_layer_norm_backward_bf16_run
LayerNorm BW, bf16.
baracuda_kernels_layer_norm_backward_bf16_strided_can_implement
layer_norm_backward_bf16_strided_can_implement companion.
baracuda_kernels_layer_norm_backward_bf16_strided_run
LayerNorm BW strided sibling, bf16.
baracuda_kernels_layer_norm_backward_f16_can_implement
baracuda_kernels_layer_norm_backward_f16_can_implement (baracuda kernels layer norm backward f16 can implement).
baracuda_kernels_layer_norm_backward_f16_run
LayerNorm BW, f16.
baracuda_kernels_layer_norm_backward_f16_strided_can_implement
layer_norm_backward_f16_strided_can_implement companion.
baracuda_kernels_layer_norm_backward_f16_strided_run
LayerNorm BW strided sibling, f16.
baracuda_kernels_layer_norm_backward_f32_can_implement
baracuda_kernels_layer_norm_backward_f32_can_implement (baracuda kernels layer norm backward f32 can implement).
baracuda_kernels_layer_norm_backward_f32_run
LayerNorm BW, f32. Computes dx and (when non-null) dgamma / dbeta reductions. Caller passes saved mean + inv_std from FW.
baracuda_kernels_layer_norm_backward_f32_strided_can_implement
layer_norm_backward_f32_strided_can_implement companion.
baracuda_kernels_layer_norm_backward_f32_strided_run
LayerNorm BW strided sibling, f32. Same contract as baracuda_kernels_layer_norm_backward_f32_run; identical launcher.
baracuda_kernels_layer_norm_backward_f64_can_implement
baracuda_kernels_layer_norm_backward_f64_can_implement (baracuda kernels layer norm backward f64 can implement).
baracuda_kernels_layer_norm_backward_f64_run
LayerNorm BW, f64.
baracuda_kernels_layer_norm_backward_f64_strided_can_implement
layer_norm_backward_f64_strided_can_implement companion.
baracuda_kernels_layer_norm_backward_f64_strided_run
LayerNorm BW strided sibling, f64.
baracuda_kernels_layer_norm_bf16_can_implement
baracuda_kernels_layer_norm_bf16_can_implement (baracuda kernels layer norm bf16 can implement).
baracuda_kernels_layer_norm_bf16_run
LayerNorm FW, bf16. f32 accumulator inside the kernel.
baracuda_kernels_layer_norm_bf16_strided_can_implement
layer_norm_bf16_strided_can_implement companion.
baracuda_kernels_layer_norm_bf16_strided_run
LayerNorm FW strided sibling, bf16.
baracuda_kernels_layer_norm_f16_can_implement
baracuda_kernels_layer_norm_f16_can_implement (baracuda kernels layer norm f16 can implement).
baracuda_kernels_layer_norm_f16_run
LayerNorm FW, f16. f32 accumulator inside the kernel.
baracuda_kernels_layer_norm_f16_strided_can_implement
layer_norm_f16_strided_can_implement companion.
baracuda_kernels_layer_norm_f16_strided_run
LayerNorm FW strided sibling, f16.
baracuda_kernels_layer_norm_f32_can_implement
baracuda_kernels_layer_norm_f32_can_implement (baracuda kernels layer norm f32 can implement).
baracuda_kernels_layer_norm_f32_run
LayerNorm FW, f32. y = (x - mean) / sqrt(var + eps) * gamma + beta. gamma / beta independently optional. Biased (“population”) variance. Save buffers mean_out / inv_std_out share stride_save, each shape == input with norm axes collapsed to 1.
baracuda_kernels_layer_norm_f32_strided_can_implement
layer_norm_f32_strided_can_implement companion.
baracuda_kernels_layer_norm_f32_strided_run
LayerNorm FW strided sibling, f32. Same contract as baracuda_kernels_layer_norm_f32_run; identical underlying launcher.
baracuda_kernels_layer_norm_f64_can_implement
baracuda_kernels_layer_norm_f64_can_implement (baracuda kernels layer norm f64 can implement).
baracuda_kernels_layer_norm_f64_run
LayerNorm FW, f64.
baracuda_kernels_layer_norm_f64_strided_can_implement
layer_norm_f64_strided_can_implement companion.
baracuda_kernels_layer_norm_f64_strided_run
LayerNorm FW strided sibling, f64.
baracuda_kernels_log_softmax_backward_bf16_can_implement
baracuda_kernels_log_softmax_backward_bf16_can_implement (baracuda kernels log softmax backward bf16 can implement).
baracuda_kernels_log_softmax_backward_bf16_run
LogSoftmax BW, bf16.
baracuda_kernels_log_softmax_backward_bf16_strided_can_implement
log_softmax_backward_bf16_strided_can_implement companion.
baracuda_kernels_log_softmax_backward_bf16_strided_run
LogSoftmax BW strided sibling, bf16.
baracuda_kernels_log_softmax_backward_f16_can_implement
baracuda_kernels_log_softmax_backward_f16_can_implement (baracuda kernels log softmax backward f16 can implement).
baracuda_kernels_log_softmax_backward_f16_run
LogSoftmax BW, f16.
baracuda_kernels_log_softmax_backward_f16_strided_can_implement
log_softmax_backward_f16_strided_can_implement companion.
baracuda_kernels_log_softmax_backward_f16_strided_run
LogSoftmax BW strided sibling, f16.
baracuda_kernels_log_softmax_backward_f32_can_implement
baracuda_kernels_log_softmax_backward_f32_can_implement (baracuda kernels log softmax backward f32 can implement).
baracuda_kernels_log_softmax_backward_f32_run
LogSoftmax BW, f32. dx[k] = dy[k] - exp(y[k]) * Σ_j dy[j]. Caller passes the saved forward output y (log-softmax values).
baracuda_kernels_log_softmax_backward_f32_strided_can_implement
log_softmax_backward_f32_strided_can_implement companion.
baracuda_kernels_log_softmax_backward_f32_strided_run
LogSoftmax BW strided sibling, f32. ABI identical to softmax BW.
baracuda_kernels_log_softmax_backward_f64_can_implement
baracuda_kernels_log_softmax_backward_f64_can_implement (baracuda kernels log softmax backward f64 can implement).
baracuda_kernels_log_softmax_backward_f64_run
LogSoftmax BW, f64.
baracuda_kernels_log_softmax_backward_f64_strided_can_implement
log_softmax_backward_f64_strided_can_implement companion.
baracuda_kernels_log_softmax_backward_f64_strided_run
LogSoftmax BW strided sibling, f64.
baracuda_kernels_log_softmax_bf16_can_implement
baracuda_kernels_log_softmax_bf16_can_implement (baracuda kernels log softmax bf16 can implement).
baracuda_kernels_log_softmax_bf16_run
LogSoftmax FW, bf16.
baracuda_kernels_log_softmax_bf16_strided_can_implement
log_softmax_bf16_strided_can_implement companion.
baracuda_kernels_log_softmax_bf16_strided_run
LogSoftmax FW strided sibling, bf16.
baracuda_kernels_log_softmax_f16_can_implement
baracuda_kernels_log_softmax_f16_can_implement (baracuda kernels log softmax f16 can implement).
baracuda_kernels_log_softmax_f16_run
LogSoftmax FW, f16. f32 accumulator inside the kernel.
baracuda_kernels_log_softmax_f16_strided_can_implement
log_softmax_f16_strided_can_implement companion.
baracuda_kernels_log_softmax_f16_strided_run
LogSoftmax FW strided sibling, f16.
baracuda_kernels_log_softmax_f32_can_implement
baracuda_kernels_log_softmax_f32_can_implement (baracuda kernels log softmax f32 can implement).
baracuda_kernels_log_softmax_f32_run
LogSoftmax FW, f32. y[k] = (x[k] - max) - log(Σ exp(x[j] - max)) along softmax_axis. Numerically stable.
baracuda_kernels_log_softmax_f32_strided_can_implement
log_softmax_f32_strided_can_implement companion.
baracuda_kernels_log_softmax_f32_strided_run
LogSoftmax FW strided sibling, f32. ABI identical to softmax FW.
baracuda_kernels_log_softmax_f64_can_implement
baracuda_kernels_log_softmax_f64_can_implement (baracuda kernels log softmax f64 can implement).
baracuda_kernels_log_softmax_f64_run
LogSoftmax FW, f64.
baracuda_kernels_log_softmax_f64_strided_can_implement
log_softmax_f64_strided_can_implement companion.
baracuda_kernels_log_softmax_f64_strided_run
LogSoftmax FW strided sibling, f64.
baracuda_kernels_loss_bce_backward_bf16_can_implement
BCE BW _can_implement, bf16.
baracuda_kernels_loss_bce_backward_bf16_run
BCE BW, bf16.
baracuda_kernels_loss_bce_backward_f16_can_implement
BCE BW _can_implement, f16.
baracuda_kernels_loss_bce_backward_f16_run
BCE BW, f16.
baracuda_kernels_loss_bce_backward_f32_can_implement
BCE BW _can_implement, f32.
baracuda_kernels_loss_bce_backward_f32_run
BCE BW, f32. dpred = (pred - target) / (pred·(1-pred)) · scale.
baracuda_kernels_loss_bce_backward_f64_can_implement
BCE BW _can_implement, f64.
baracuda_kernels_loss_bce_backward_f64_run
BCE BW, f64.
baracuda_kernels_loss_bce_bf16_can_implement
BCE FW _can_implement, bf16.
baracuda_kernels_loss_bce_bf16_run
BCE FW, bf16.
baracuda_kernels_loss_bce_f16_can_implement
BCE FW _can_implement, f16.
baracuda_kernels_loss_bce_f16_run
BCE FW, f16.
baracuda_kernels_loss_bce_f32_can_implement
BCE FW _can_implement, f32.
baracuda_kernels_loss_bce_f32_run
BCE FW, f32. -(t·log(p) + (1-t)·log(1-p)) per-cell, then reduce. Caller ensures pred ∈ (0, 1).
baracuda_kernels_loss_bce_f64_can_implement
BCE FW _can_implement, f64.
baracuda_kernels_loss_bce_f64_run
BCE FW, f64.
baracuda_kernels_loss_bce_with_logits_backward_bf16_can_implement
baracuda_kernels_loss_bce_with_logits_backward_bf16_can_implement (baracuda kernels loss bce with logits backward bf16 can implement).
baracuda_kernels_loss_bce_with_logits_backward_bf16_run
BCEWithLogits BW, bf16.
baracuda_kernels_loss_bce_with_logits_backward_f16_can_implement
baracuda_kernels_loss_bce_with_logits_backward_f16_can_implement (baracuda kernels loss bce with logits backward f16 can implement).
baracuda_kernels_loss_bce_with_logits_backward_f16_run
BCEWithLogits BW, f16.
baracuda_kernels_loss_bce_with_logits_backward_f32_can_implement
baracuda_kernels_loss_bce_with_logits_backward_f32_can_implement (baracuda kernels loss bce with logits backward f32 can implement).
baracuda_kernels_loss_bce_with_logits_backward_f32_run
BCEWithLogits BW, f32. dlogits = (sigmoid(x) - target) · scale.
baracuda_kernels_loss_bce_with_logits_backward_f64_can_implement
baracuda_kernels_loss_bce_with_logits_backward_f64_can_implement (baracuda kernels loss bce with logits backward f64 can implement).
baracuda_kernels_loss_bce_with_logits_backward_f64_run
BCEWithLogits BW, f64.
baracuda_kernels_loss_bce_with_logits_bf16_can_implement
baracuda_kernels_loss_bce_with_logits_bf16_can_implement (baracuda kernels loss bce with logits bf16 can implement).
baracuda_kernels_loss_bce_with_logits_bf16_run
BCEWithLogits FW, bf16.
baracuda_kernels_loss_bce_with_logits_f16_can_implement
baracuda_kernels_loss_bce_with_logits_f16_can_implement (baracuda kernels loss bce with logits f16 can implement).
baracuda_kernels_loss_bce_with_logits_f16_run
BCEWithLogits FW, f16.
baracuda_kernels_loss_bce_with_logits_f32_can_implement
baracuda_kernels_loss_bce_with_logits_f32_can_implement (baracuda kernels loss bce with logits f32 can implement).
baracuda_kernels_loss_bce_with_logits_f32_run
BCEWithLogits FW, f32. Stable BCE for raw logits.
baracuda_kernels_loss_bce_with_logits_f64_can_implement
baracuda_kernels_loss_bce_with_logits_f64_can_implement (baracuda kernels loss bce with logits f64 can implement).
baracuda_kernels_loss_bce_with_logits_f64_run
BCEWithLogits FW, f64.
baracuda_kernels_loss_cosine_embedding_backward_bf16_can_implement
baracuda_kernels_loss_cosine_embedding_backward_bf16_can_implement (baracuda kernels loss cosine embedding backward bf16 can implement).
baracuda_kernels_loss_cosine_embedding_backward_bf16_run
CosineEmbedding BW, bf16.
baracuda_kernels_loss_cosine_embedding_backward_f16_can_implement
baracuda_kernels_loss_cosine_embedding_backward_f16_can_implement (baracuda kernels loss cosine embedding backward f16 can implement).
baracuda_kernels_loss_cosine_embedding_backward_f16_run
CosineEmbedding BW, f16.
baracuda_kernels_loss_cosine_embedding_backward_f32_can_implement
baracuda_kernels_loss_cosine_embedding_backward_f32_can_implement (baracuda kernels loss cosine embedding backward f32 can implement).
baracuda_kernels_loss_cosine_embedding_backward_f32_run
CosineEmbedding BW.
baracuda_kernels_loss_cosine_embedding_backward_f64_can_implement
baracuda_kernels_loss_cosine_embedding_backward_f64_can_implement (baracuda kernels loss cosine embedding backward f64 can implement).
baracuda_kernels_loss_cosine_embedding_backward_f64_run
CosineEmbedding BW, f64.
baracuda_kernels_loss_cosine_embedding_bf16_can_implement
baracuda_kernels_loss_cosine_embedding_bf16_can_implement (baracuda kernels loss cosine embedding bf16 can implement).
baracuda_kernels_loss_cosine_embedding_bf16_run
CosineEmbedding FW, bf16.
baracuda_kernels_loss_cosine_embedding_f16_can_implement
baracuda_kernels_loss_cosine_embedding_f16_can_implement (baracuda kernels loss cosine embedding f16 can implement).
baracuda_kernels_loss_cosine_embedding_f16_run
CosineEmbedding FW, f16.
baracuda_kernels_loss_cosine_embedding_f32_can_implement
baracuda_kernels_loss_cosine_embedding_f32_can_implement (baracuda kernels loss cosine embedding f32 can implement).
baracuda_kernels_loss_cosine_embedding_f32_run
CosineEmbedding FW (per-row). ABI: (n_rows, d_extent, row_stride_x, reduction_mode, margin, x1, x2, t, out, workspace, workspace_bytes, stream).
baracuda_kernels_loss_cosine_embedding_f64_can_implement
baracuda_kernels_loss_cosine_embedding_f64_can_implement (baracuda kernels loss cosine embedding f64 can implement).
baracuda_kernels_loss_cosine_embedding_f64_run
CosineEmbedding FW, f64.
baracuda_kernels_loss_cross_entropy_backward_bf16_can_implement
baracuda_kernels_loss_cross_entropy_backward_bf16_can_implement (baracuda kernels loss cross entropy backward bf16 can implement).
baracuda_kernels_loss_cross_entropy_backward_bf16_run
CrossEntropy BW, bf16.
baracuda_kernels_loss_cross_entropy_backward_f16_can_implement
baracuda_kernels_loss_cross_entropy_backward_f16_can_implement (baracuda kernels loss cross entropy backward f16 can implement).
baracuda_kernels_loss_cross_entropy_backward_f16_run
CrossEntropy BW, f16.
baracuda_kernels_loss_cross_entropy_backward_f32_can_implement
CrossEntropy BW _can_implement, f32.
baracuda_kernels_loss_cross_entropy_backward_f32_run
CrossEntropy BW, f32. dinput[i, c] = (softmax(input)[i, c] - 1{c==t[i]}) · scale.
baracuda_kernels_loss_cross_entropy_backward_f64_can_implement
baracuda_kernels_loss_cross_entropy_backward_f64_can_implement (baracuda kernels loss cross entropy backward f64 can implement).
baracuda_kernels_loss_cross_entropy_backward_f64_run
CrossEntropy BW, f64.
baracuda_kernels_loss_cross_entropy_bf16_can_implement
CrossEntropy FW _can_implement, bf16.
baracuda_kernels_loss_cross_entropy_bf16_run
CrossEntropy FW, bf16.
baracuda_kernels_loss_cross_entropy_f16_can_implement
CrossEntropy FW _can_implement, f16.
baracuda_kernels_loss_cross_entropy_f16_run
CrossEntropy FW, f16.
baracuda_kernels_loss_cross_entropy_f32_can_implement
CrossEntropy FW _can_implement, f32.
baracuda_kernels_loss_cross_entropy_f32_run
CrossEntropy FW, f32. Fused LogSoftmax + NLL. Numerically stable per-row two-pass max subtraction.
baracuda_kernels_loss_cross_entropy_f64_can_implement
CrossEntropy FW _can_implement, f64.
baracuda_kernels_loss_cross_entropy_f64_run
CrossEntropy FW, f64.
baracuda_kernels_loss_cross_entropy_soft_backward_bf16_can_implement
baracuda_kernels_loss_cross_entropy_soft_backward_bf16_can_implement (baracuda kernels loss cross entropy soft backward bf16 can implement).
baracuda_kernels_loss_cross_entropy_soft_backward_bf16_run
Soft-target CrossEntropy BW, bf16.
baracuda_kernels_loss_cross_entropy_soft_backward_f16_can_implement
baracuda_kernels_loss_cross_entropy_soft_backward_f16_can_implement (baracuda kernels loss cross entropy soft backward f16 can implement).
baracuda_kernels_loss_cross_entropy_soft_backward_f16_run
Soft-target CrossEntropy BW, f16.
baracuda_kernels_loss_cross_entropy_soft_backward_f32_can_implement
baracuda_kernels_loss_cross_entropy_soft_backward_f32_can_implement (baracuda kernels loss cross entropy soft backward f32 can implement).
baracuda_kernels_loss_cross_entropy_soft_backward_f32_run
Soft-target CrossEntropy BW, f32.
baracuda_kernels_loss_cross_entropy_soft_backward_f64_can_implement
baracuda_kernels_loss_cross_entropy_soft_backward_f64_can_implement (baracuda kernels loss cross entropy soft backward f64 can implement).
baracuda_kernels_loss_cross_entropy_soft_backward_f64_run
Soft-target CrossEntropy BW, f64.
baracuda_kernels_loss_cross_entropy_soft_bf16_can_implement
baracuda_kernels_loss_cross_entropy_soft_bf16_can_implement (baracuda kernels loss cross entropy soft bf16 can implement).
baracuda_kernels_loss_cross_entropy_soft_bf16_run
Soft-target CrossEntropy FW, bf16.
baracuda_kernels_loss_cross_entropy_soft_f16_can_implement
baracuda_kernels_loss_cross_entropy_soft_f16_can_implement (baracuda kernels loss cross entropy soft f16 can implement).
baracuda_kernels_loss_cross_entropy_soft_f16_run
Soft-target CrossEntropy FW, f16.
baracuda_kernels_loss_cross_entropy_soft_f32_can_implement
baracuda_kernels_loss_cross_entropy_soft_f32_can_implement (baracuda kernels loss cross entropy soft f32 can implement).
baracuda_kernels_loss_cross_entropy_soft_f32_run
Soft-target CrossEntropy FW, f32. Target is T-typed prob tensor.
baracuda_kernels_loss_cross_entropy_soft_f64_can_implement
baracuda_kernels_loss_cross_entropy_soft_f64_can_implement (baracuda kernels loss cross entropy soft f64 can implement).
baracuda_kernels_loss_cross_entropy_soft_f64_run
Soft-target CrossEntropy FW, f64.
baracuda_kernels_loss_ctc_backward_bf16_can_implement
CTCLoss BW _can_implement, bf16.
baracuda_kernels_loss_ctc_backward_bf16_run
CTCLoss BW, bf16.
baracuda_kernels_loss_ctc_backward_f16_can_implement
CTCLoss BW _can_implement, f16.
baracuda_kernels_loss_ctc_backward_f16_run
CTCLoss BW, f16.
baracuda_kernels_loss_ctc_backward_f32_can_implement
CTCLoss BW _can_implement, f32.
baracuda_kernels_loss_ctc_backward_f32_run
CTCLoss BW, f32.
baracuda_kernels_loss_ctc_backward_f64_can_implement
CTCLoss BW _can_implement, f64 (F64_ACC).
baracuda_kernels_loss_ctc_backward_f64_run
CTCLoss BW, f64.
baracuda_kernels_loss_ctc_bf16_can_implement
CTCLoss FW _can_implement, bf16 (F32_ACC).
baracuda_kernels_loss_ctc_bf16_run
CTCLoss FW, bf16.
baracuda_kernels_loss_ctc_f16_can_implement
CTCLoss FW _can_implement, f16 (F32_ACC).
baracuda_kernels_loss_ctc_f16_run
CTCLoss FW, f16.
baracuda_kernels_loss_ctc_f32_can_implement
CTCLoss FW _can_implement, f32. Validates num_classes <= 32, max_target_len <= 256, blank ∈ [0, num_classes), reduction_mode ∈ [0,2].
baracuda_kernels_loss_ctc_f32_run
CTCLoss FW, f32.
baracuda_kernels_loss_ctc_f64_can_implement
CTCLoss FW _can_implement, f64 (F64_ACC).
baracuda_kernels_loss_ctc_f64_run
CTCLoss FW, f64.
baracuda_kernels_loss_flce_count_non_ignore_can_implement
Implementability check for baracuda_kernels_loss_flce_count_non_ignore. Host-side only.
baracuda_kernels_loss_flce_count_non_ignore_run
FLCE count-non-ignore. Single-block tree reduction; writes the target[i] != ignore_index count into count_out[0] (i64).
baracuda_kernels_loss_flce_inplace_scale_bf16_can_implement
Implementability check for baracuda_kernels_loss_flce_inplace_scale_bf16. Host-side only.
baracuda_kernels_loss_flce_inplace_scale_bf16_run
FLCE in-place scale, bf16.
baracuda_kernels_loss_flce_inplace_scale_f16_can_implement
Implementability check for baracuda_kernels_loss_flce_inplace_scale_f16. Host-side only.
baracuda_kernels_loss_flce_inplace_scale_f16_run
FLCE in-place scale, f16.
baracuda_kernels_loss_flce_inplace_scale_f32_can_implement
Implementability check for baracuda_kernels_loss_flce_inplace_scale_f32. Host-side only.
baracuda_kernels_loss_flce_inplace_scale_f32_run
FLCE in-place scale, f32. Multiplies buf in place by scalar.
baracuda_kernels_loss_flce_inplace_scale_f64_can_implement
Implementability check for baracuda_kernels_loss_flce_inplace_scale_f64. Host-side only.
baracuda_kernels_loss_flce_inplace_scale_f64_run
FLCE in-place scale, f64.
baracuda_kernels_loss_flce_per_row_bf16_can_implement
Implementability check for baracuda_kernels_loss_flce_per_row_bf16. Host-side only.
baracuda_kernels_loss_flce_per_row_bf16_run
FLCE per-row fused step, bf16.
baracuda_kernels_loss_flce_per_row_cast_bf16_can_implement
Implementability check for baracuda_kernels_loss_flce_per_row_cast_bf16. Host-side only.
baracuda_kernels_loss_flce_per_row_cast_bf16_run
FLCE per-row cast, f32 → bf16.
baracuda_kernels_loss_flce_per_row_cast_f16_can_implement
Implementability check for baracuda_kernels_loss_flce_per_row_cast_f16. Host-side only.
baracuda_kernels_loss_flce_per_row_cast_f16_run
FLCE per-row cast, f32 → f16.
baracuda_kernels_loss_flce_per_row_cast_f32_can_implement
Implementability check for baracuda_kernels_loss_flce_per_row_cast_f32. Host-side only.
baracuda_kernels_loss_flce_per_row_cast_f32_run
FLCE per-row cast (None mode finalizer), f32 → f32.
baracuda_kernels_loss_flce_per_row_cast_f64_can_implement
Implementability check for baracuda_kernels_loss_flce_per_row_cast_f64. Host-side only.
baracuda_kernels_loss_flce_per_row_cast_f64_run
FLCE per-row cast, f32 → f64.
baracuda_kernels_loss_flce_per_row_f16_can_implement
Implementability check for baracuda_kernels_loss_flce_per_row_f16. Host-side only.
baracuda_kernels_loss_flce_per_row_f16_run
FLCE per-row fused step, f16.
baracuda_kernels_loss_flce_per_row_f32_can_implement
Implementability check for baracuda_kernels_loss_flce_per_row_f32. Host-side only.
baracuda_kernels_loss_flce_per_row_f32_run
FLCE per-row fused step, f32. Mutates logits in place to grad_logits = (softmax - one_hot) · scale_per_row; writes per-row -log_softmax[target] into loss_1d (f32 accumulator).
baracuda_kernels_loss_flce_per_row_f64_can_implement
Implementability check for baracuda_kernels_loss_flce_per_row_f64. Host-side only.
baracuda_kernels_loss_flce_per_row_f64_run
FLCE per-row fused step, f64.
baracuda_kernels_loss_flce_scalar_finalize_bf16_can_implement
Implementability check for baracuda_kernels_loss_flce_scalar_finalize_bf16. Host-side only.
baracuda_kernels_loss_flce_scalar_finalize_bf16_run
FLCE scalar finalize, f32 → bf16.
baracuda_kernels_loss_flce_scalar_finalize_f16_can_implement
Implementability check for baracuda_kernels_loss_flce_scalar_finalize_f16. Host-side only.
baracuda_kernels_loss_flce_scalar_finalize_f16_run
FLCE scalar finalize, f32 → f16.
baracuda_kernels_loss_flce_scalar_finalize_f32_can_implement
Implementability check for baracuda_kernels_loss_flce_scalar_finalize_f32. Host-side only.
baracuda_kernels_loss_flce_scalar_finalize_f32_run
FLCE scalar finalize (Mean/Sum), f32 → f32.
baracuda_kernels_loss_flce_scalar_finalize_f64_can_implement
Implementability check for baracuda_kernels_loss_flce_scalar_finalize_f64. Host-side only.
baracuda_kernels_loss_flce_scalar_finalize_f64_run
FLCE scalar finalize, f32 → f64.
baracuda_kernels_loss_gaussian_nll_backward_bf16_can_implement
baracuda_kernels_loss_gaussian_nll_backward_bf16_can_implement (baracuda kernels loss gaussian nll backward bf16 can implement).
baracuda_kernels_loss_gaussian_nll_backward_bf16_run
GaussianNLL BW, bf16.
baracuda_kernels_loss_gaussian_nll_backward_f16_can_implement
baracuda_kernels_loss_gaussian_nll_backward_f16_can_implement (baracuda kernels loss gaussian nll backward f16 can implement).
baracuda_kernels_loss_gaussian_nll_backward_f16_run
GaussianNLL BW, f16.
baracuda_kernels_loss_gaussian_nll_backward_f32_can_implement
baracuda_kernels_loss_gaussian_nll_backward_f32_can_implement (baracuda kernels loss gaussian nll backward f32 can implement).
baracuda_kernels_loss_gaussian_nll_backward_f32_run
GaussianNLL BW, f32.
baracuda_kernels_loss_gaussian_nll_backward_f64_can_implement
baracuda_kernels_loss_gaussian_nll_backward_f64_can_implement (baracuda kernels loss gaussian nll backward f64 can implement).
baracuda_kernels_loss_gaussian_nll_backward_f64_run
GaussianNLL BW, f64.
baracuda_kernels_loss_gaussian_nll_bf16_can_implement
baracuda_kernels_loss_gaussian_nll_bf16_can_implement (baracuda kernels loss gaussian nll bf16 can implement).
baracuda_kernels_loss_gaussian_nll_bf16_run
GaussianNLL FW, bf16.
baracuda_kernels_loss_gaussian_nll_f16_can_implement
baracuda_kernels_loss_gaussian_nll_f16_can_implement (baracuda kernels loss gaussian nll f16 can implement).
baracuda_kernels_loss_gaussian_nll_f16_run
GaussianNLL FW, f16.
baracuda_kernels_loss_gaussian_nll_f32_can_implement
baracuda_kernels_loss_gaussian_nll_f32_can_implement (baracuda kernels loss gaussian nll f32 can implement).
baracuda_kernels_loss_gaussian_nll_f32_run
GaussianNLL FW, f32. 3-tensor input (input, target, var).
baracuda_kernels_loss_gaussian_nll_f64_can_implement
baracuda_kernels_loss_gaussian_nll_f64_can_implement (baracuda kernels loss gaussian nll f64 can implement).
baracuda_kernels_loss_gaussian_nll_f64_run
GaussianNLL FW, f64.
baracuda_kernels_loss_hinge_embedding_backward_bf16_can_implement
baracuda_kernels_loss_hinge_embedding_backward_bf16_can_implement (baracuda kernels loss hinge embedding backward bf16 can implement).
baracuda_kernels_loss_hinge_embedding_backward_bf16_run
HingeEmbedding BW, bf16.
baracuda_kernels_loss_hinge_embedding_backward_f16_can_implement
baracuda_kernels_loss_hinge_embedding_backward_f16_can_implement (baracuda kernels loss hinge embedding backward f16 can implement).
baracuda_kernels_loss_hinge_embedding_backward_f16_run
HingeEmbedding BW, f16.
baracuda_kernels_loss_hinge_embedding_backward_f32_can_implement
baracuda_kernels_loss_hinge_embedding_backward_f32_can_implement (baracuda kernels loss hinge embedding backward f32 can implement).
baracuda_kernels_loss_hinge_embedding_backward_f32_run
HingeEmbedding BW, f32.
baracuda_kernels_loss_hinge_embedding_backward_f64_can_implement
baracuda_kernels_loss_hinge_embedding_backward_f64_can_implement (baracuda kernels loss hinge embedding backward f64 can implement).
baracuda_kernels_loss_hinge_embedding_backward_f64_run
HingeEmbedding BW, f64.
baracuda_kernels_loss_hinge_embedding_bf16_can_implement
baracuda_kernels_loss_hinge_embedding_bf16_can_implement (baracuda kernels loss hinge embedding bf16 can implement).
baracuda_kernels_loss_hinge_embedding_bf16_run
HingeEmbedding FW, bf16.
baracuda_kernels_loss_hinge_embedding_f16_can_implement
baracuda_kernels_loss_hinge_embedding_f16_can_implement (baracuda kernels loss hinge embedding f16 can implement).
baracuda_kernels_loss_hinge_embedding_f16_run
HingeEmbedding FW, f16.
baracuda_kernels_loss_hinge_embedding_f32_can_implement
baracuda_kernels_loss_hinge_embedding_f32_can_implement (baracuda kernels loss hinge embedding f32 can implement).
baracuda_kernels_loss_hinge_embedding_f32_run
HingeEmbedding FW, f32. ABI: (numel, reduction_mode, margin, input, target_i64, out, workspace, workspace_bytes, stream).
baracuda_kernels_loss_hinge_embedding_f64_can_implement
baracuda_kernels_loss_hinge_embedding_f64_can_implement (baracuda kernels loss hinge embedding f64 can implement).
baracuda_kernels_loss_hinge_embedding_f64_run
HingeEmbedding FW, f64.
baracuda_kernels_loss_huber_backward_bf16_can_implement
baracuda_kernels_loss_huber_backward_bf16_can_implement (baracuda kernels loss huber backward bf16 can implement).
baracuda_kernels_loss_huber_backward_bf16_run
Huber BW, bf16.
baracuda_kernels_loss_huber_backward_f16_can_implement
baracuda_kernels_loss_huber_backward_f16_can_implement (baracuda kernels loss huber backward f16 can implement).
baracuda_kernels_loss_huber_backward_f16_run
Huber BW, f16.
baracuda_kernels_loss_huber_backward_f32_can_implement
baracuda_kernels_loss_huber_backward_f32_can_implement (baracuda kernels loss huber backward f32 can implement).
baracuda_kernels_loss_huber_backward_f32_run
Huber BW, f32.
baracuda_kernels_loss_huber_backward_f64_can_implement
baracuda_kernels_loss_huber_backward_f64_can_implement (baracuda kernels loss huber backward f64 can implement).
baracuda_kernels_loss_huber_backward_f64_run
Huber BW, f64.
baracuda_kernels_loss_huber_bf16_can_implement
baracuda_kernels_loss_huber_bf16_can_implement (baracuda kernels loss huber bf16 can implement).
baracuda_kernels_loss_huber_bf16_run
Huber FW, bf16.
baracuda_kernels_loss_huber_f16_can_implement
baracuda_kernels_loss_huber_f16_can_implement (baracuda kernels loss huber f16 can implement).
baracuda_kernels_loss_huber_f16_run
Huber FW, f16.
baracuda_kernels_loss_huber_f32_can_implement
baracuda_kernels_loss_huber_f32_can_implement (baracuda kernels loss huber f32 can implement).
baracuda_kernels_loss_huber_f32_run
Huber FW, f32. param = δ.
baracuda_kernels_loss_huber_f64_can_implement
baracuda_kernels_loss_huber_f64_can_implement (baracuda kernels loss huber f64 can implement).
baracuda_kernels_loss_huber_f64_run
Huber FW, f64.
baracuda_kernels_loss_kl_div_backward_bf16_can_implement
KLDiv BW _can_implement, bf16.
baracuda_kernels_loss_kl_div_backward_bf16_run
KLDiv BW, bf16.
baracuda_kernels_loss_kl_div_backward_f16_can_implement
KLDiv BW _can_implement, f16.
baracuda_kernels_loss_kl_div_backward_f16_run
KLDiv BW, f16.
baracuda_kernels_loss_kl_div_backward_f32_can_implement
KLDiv BW _can_implement, f32.
baracuda_kernels_loss_kl_div_backward_f32_run
KLDiv BW, f32. dinput = -target · scale.
baracuda_kernels_loss_kl_div_backward_f64_can_implement
KLDiv BW _can_implement, f64.
baracuda_kernels_loss_kl_div_backward_f64_run
KLDiv BW, f64.
baracuda_kernels_loss_kl_div_bf16_can_implement
KLDiv FW _can_implement, bf16.
baracuda_kernels_loss_kl_div_bf16_run
KLDiv FW, bf16.
baracuda_kernels_loss_kl_div_f16_can_implement
KLDiv FW _can_implement, f16.
baracuda_kernels_loss_kl_div_f16_run
KLDiv FW, f16.
baracuda_kernels_loss_kl_div_f32_can_implement
KLDiv FW _can_implement, f32.
baracuda_kernels_loss_kl_div_f32_run
KLDiv FW, f32. target·(log(target) - input) per-cell. PyTorch convention: input is already log-prob.
baracuda_kernels_loss_kl_div_f64_can_implement
KLDiv FW _can_implement, f64.
baracuda_kernels_loss_kl_div_f64_run
KLDiv FW, f64.
baracuda_kernels_loss_l1_backward_bf16_can_implement
baracuda_kernels_loss_l1_backward_bf16_can_implement (baracuda kernels loss l1 backward bf16 can implement).
baracuda_kernels_loss_l1_backward_bf16_run
L1 BW, bf16.
baracuda_kernels_loss_l1_backward_f16_can_implement
baracuda_kernels_loss_l1_backward_f16_can_implement (baracuda kernels loss l1 backward f16 can implement).
baracuda_kernels_loss_l1_backward_f16_run
L1 BW, f16.
baracuda_kernels_loss_l1_backward_f32_can_implement
L1 BW _can_implement, f32.
baracuda_kernels_loss_l1_backward_f32_run
L1 BW, f32. dpred = sign(pred - target) · scale.
baracuda_kernels_loss_l1_backward_f64_can_implement
baracuda_kernels_loss_l1_backward_f64_can_implement (baracuda kernels loss l1 backward f64 can implement).
baracuda_kernels_loss_l1_backward_f64_run
L1 BW, f64.
baracuda_kernels_loss_l1_bf16_can_implement
L1 FW _can_implement, bf16.
baracuda_kernels_loss_l1_bf16_run
L1 FW, bf16.
baracuda_kernels_loss_l1_f16_can_implement
L1 FW _can_implement, f16.
baracuda_kernels_loss_l1_f16_run
L1 FW, f16.
baracuda_kernels_loss_l1_f32_can_implement
L1 FW _can_implement, f32.
baracuda_kernels_loss_l1_f32_run
L1 FW, f32. y = |pred - target| per-cell; mean/sum reduce to scalar.
baracuda_kernels_loss_l1_f64_can_implement
L1 FW _can_implement, f64.
baracuda_kernels_loss_l1_f64_run
L1 FW, f64.
baracuda_kernels_loss_margin_ranking_backward_bf16_can_implement
baracuda_kernels_loss_margin_ranking_backward_bf16_can_implement (baracuda kernels loss margin ranking backward bf16 can implement).
baracuda_kernels_loss_margin_ranking_backward_bf16_run
MarginRanking BW, bf16.
baracuda_kernels_loss_margin_ranking_backward_f16_can_implement
baracuda_kernels_loss_margin_ranking_backward_f16_can_implement (baracuda kernels loss margin ranking backward f16 can implement).
baracuda_kernels_loss_margin_ranking_backward_f16_run
MarginRanking BW, f16.
baracuda_kernels_loss_margin_ranking_backward_f32_can_implement
baracuda_kernels_loss_margin_ranking_backward_f32_can_implement (baracuda kernels loss margin ranking backward f32 can implement).
baracuda_kernels_loss_margin_ranking_backward_f32_run
MarginRanking BW, f32. ABI: (numel, reduction_mode, scale, margin, x1, x2, t, dy, dx1, dx2, workspace, workspace_bytes, stream).
baracuda_kernels_loss_margin_ranking_backward_f64_can_implement
baracuda_kernels_loss_margin_ranking_backward_f64_can_implement (baracuda kernels loss margin ranking backward f64 can implement).
baracuda_kernels_loss_margin_ranking_backward_f64_run
MarginRanking BW, f64.
baracuda_kernels_loss_margin_ranking_bf16_can_implement
baracuda_kernels_loss_margin_ranking_bf16_can_implement (baracuda kernels loss margin ranking bf16 can implement).
baracuda_kernels_loss_margin_ranking_bf16_run
MarginRanking FW, bf16.
baracuda_kernels_loss_margin_ranking_f16_can_implement
baracuda_kernels_loss_margin_ranking_f16_can_implement (baracuda kernels loss margin ranking f16 can implement).
baracuda_kernels_loss_margin_ranking_f16_run
MarginRanking FW, f16.
baracuda_kernels_loss_margin_ranking_f32_can_implement
baracuda_kernels_loss_margin_ranking_f32_can_implement (baracuda kernels loss margin ranking f32 can implement).
baracuda_kernels_loss_margin_ranking_f32_run
MarginRanking FW, f32. ABI: (numel, reduction_mode, margin, x1, x2, t, out, workspace, workspace_bytes, stream).
baracuda_kernels_loss_margin_ranking_f64_can_implement
baracuda_kernels_loss_margin_ranking_f64_can_implement (baracuda kernels loss margin ranking f64 can implement).
baracuda_kernels_loss_margin_ranking_f64_run
MarginRanking FW, f64.
baracuda_kernels_loss_mse_backward_bf16_can_implement
MSE BW _can_implement, bf16.
baracuda_kernels_loss_mse_backward_bf16_run
MSE BW, bf16.
baracuda_kernels_loss_mse_backward_f16_can_implement
MSE BW _can_implement, f16.
baracuda_kernels_loss_mse_backward_f16_run
MSE BW, f16.
baracuda_kernels_loss_mse_backward_f32_can_implement
MSE BW _can_implement, f32.
baracuda_kernels_loss_mse_backward_f32_run
MSE BW, f32. dpred = 2·(pred - target) · scale.
baracuda_kernels_loss_mse_backward_f64_can_implement
MSE BW _can_implement, f64.
baracuda_kernels_loss_mse_backward_f64_run
MSE BW, f64.
baracuda_kernels_loss_mse_bf16_can_implement
MSE FW _can_implement, bf16.
baracuda_kernels_loss_mse_bf16_run
MSE FW, bf16.
baracuda_kernels_loss_mse_f16_can_implement
MSE FW _can_implement, f16.
baracuda_kernels_loss_mse_f16_run
MSE FW, f16.
baracuda_kernels_loss_mse_f32_can_implement
MSE FW _can_implement, f32. Host-side validator (no launch).
baracuda_kernels_loss_mse_f32_run
MSE FW, f32. (pred - target)² per-cell; mean/sum reduce to scalar. Workspace: numel * sizeof(T) bytes for Mean/Sum; unused for None.
baracuda_kernels_loss_mse_f64_can_implement
MSE FW _can_implement, f64.
baracuda_kernels_loss_mse_f64_run
MSE FW, f64.
baracuda_kernels_loss_multi_margin_backward_bf16_can_implement
baracuda_kernels_loss_multi_margin_backward_bf16_can_implement (baracuda kernels loss multi margin backward bf16 can implement).
baracuda_kernels_loss_multi_margin_backward_bf16_run
MultiMargin BW, bf16.
baracuda_kernels_loss_multi_margin_backward_f16_can_implement
baracuda_kernels_loss_multi_margin_backward_f16_can_implement (baracuda kernels loss multi margin backward f16 can implement).
baracuda_kernels_loss_multi_margin_backward_f16_run
MultiMargin BW, f16.
baracuda_kernels_loss_multi_margin_backward_f32_can_implement
baracuda_kernels_loss_multi_margin_backward_f32_can_implement (baracuda kernels loss multi margin backward f32 can implement).
baracuda_kernels_loss_multi_margin_backward_f32_run
MultiMargin BW.
baracuda_kernels_loss_multi_margin_backward_f64_can_implement
baracuda_kernels_loss_multi_margin_backward_f64_can_implement (baracuda kernels loss multi margin backward f64 can implement).
baracuda_kernels_loss_multi_margin_backward_f64_run
MultiMargin BW, f64.
baracuda_kernels_loss_multi_margin_bf16_can_implement
baracuda_kernels_loss_multi_margin_bf16_can_implement (baracuda kernels loss multi margin bf16 can implement).
baracuda_kernels_loss_multi_margin_bf16_run
MultiMargin FW, bf16.
baracuda_kernels_loss_multi_margin_f16_can_implement
baracuda_kernels_loss_multi_margin_f16_can_implement (baracuda kernels loss multi margin f16 can implement).
baracuda_kernels_loss_multi_margin_f16_run
MultiMargin FW, f16.
baracuda_kernels_loss_multi_margin_f32_can_implement
baracuda_kernels_loss_multi_margin_f32_can_implement (baracuda kernels loss multi margin f32 can implement).
baracuda_kernels_loss_multi_margin_f32_run
MultiMargin FW (per-row). ABI: (n_rows, class_extent, row_stride, reduction_mode, margin, p_norm, input, target_i64, out, workspace, workspace_bytes, stream).
baracuda_kernels_loss_multi_margin_f64_can_implement
baracuda_kernels_loss_multi_margin_f64_can_implement (baracuda kernels loss multi margin f64 can implement).
baracuda_kernels_loss_multi_margin_f64_run
MultiMargin FW, f64.
baracuda_kernels_loss_multilabel_margin_backward_bf16_can_implement
baracuda_kernels_loss_multilabel_margin_backward_bf16_can_implement (baracuda kernels loss multilabel margin backward bf16 can implement).
baracuda_kernels_loss_multilabel_margin_backward_bf16_run
MultilabelMargin BW, bf16.
baracuda_kernels_loss_multilabel_margin_backward_f16_can_implement
baracuda_kernels_loss_multilabel_margin_backward_f16_can_implement (baracuda kernels loss multilabel margin backward f16 can implement).
baracuda_kernels_loss_multilabel_margin_backward_f16_run
MultilabelMargin BW, f16.
baracuda_kernels_loss_multilabel_margin_backward_f32_can_implement
baracuda_kernels_loss_multilabel_margin_backward_f32_can_implement (baracuda kernels loss multilabel margin backward f32 can implement).
baracuda_kernels_loss_multilabel_margin_backward_f32_run
MultilabelMargin BW.
baracuda_kernels_loss_multilabel_margin_backward_f64_can_implement
baracuda_kernels_loss_multilabel_margin_backward_f64_can_implement (baracuda kernels loss multilabel margin backward f64 can implement).
baracuda_kernels_loss_multilabel_margin_backward_f64_run
MultilabelMargin BW, f64.
baracuda_kernels_loss_multilabel_margin_bf16_can_implement
baracuda_kernels_loss_multilabel_margin_bf16_can_implement (baracuda kernels loss multilabel margin bf16 can implement).
baracuda_kernels_loss_multilabel_margin_bf16_run
MultilabelMargin FW, bf16.
baracuda_kernels_loss_multilabel_margin_f16_can_implement
baracuda_kernels_loss_multilabel_margin_f16_can_implement (baracuda kernels loss multilabel margin f16 can implement).
baracuda_kernels_loss_multilabel_margin_f16_run
MultilabelMargin FW, f16.
baracuda_kernels_loss_multilabel_margin_f32_can_implement
baracuda_kernels_loss_multilabel_margin_f32_can_implement (baracuda kernels loss multilabel margin f32 can implement).
baracuda_kernels_loss_multilabel_margin_f32_run
MultilabelMargin FW (per-row). ABI: (n_rows, class_extent, row_stride_in, row_stride_tgt, reduction_mode, input, target_i64, out, workspace, workspace_bytes, stream).
baracuda_kernels_loss_multilabel_margin_f64_can_implement
baracuda_kernels_loss_multilabel_margin_f64_can_implement (baracuda kernels loss multilabel margin f64 can implement).
baracuda_kernels_loss_multilabel_margin_f64_run
MultilabelMargin FW, f64.
baracuda_kernels_loss_multilabel_soft_margin_backward_bf16_can_implement
baracuda_kernels_loss_multilabel_soft_margin_backward_bf16_can_implement (baracuda kernels loss multilabel soft margin backward bf16 can implement).
baracuda_kernels_loss_multilabel_soft_margin_backward_bf16_run
MultilabelSoftMargin BW, bf16.
baracuda_kernels_loss_multilabel_soft_margin_backward_f16_can_implement
baracuda_kernels_loss_multilabel_soft_margin_backward_f16_can_implement (baracuda kernels loss multilabel soft margin backward f16 can implement).
baracuda_kernels_loss_multilabel_soft_margin_backward_f16_run
MultilabelSoftMargin BW, f16.
baracuda_kernels_loss_multilabel_soft_margin_backward_f32_can_implement
baracuda_kernels_loss_multilabel_soft_margin_backward_f32_can_implement (baracuda kernels loss multilabel soft margin backward f32 can implement).
baracuda_kernels_loss_multilabel_soft_margin_backward_f32_run
MultilabelSoftMargin BW.
baracuda_kernels_loss_multilabel_soft_margin_backward_f64_can_implement
baracuda_kernels_loss_multilabel_soft_margin_backward_f64_can_implement (baracuda kernels loss multilabel soft margin backward f64 can implement).
baracuda_kernels_loss_multilabel_soft_margin_backward_f64_run
MultilabelSoftMargin BW, f64.
baracuda_kernels_loss_multilabel_soft_margin_bf16_can_implement
baracuda_kernels_loss_multilabel_soft_margin_bf16_can_implement (baracuda kernels loss multilabel soft margin bf16 can implement).
baracuda_kernels_loss_multilabel_soft_margin_bf16_run
MultilabelSoftMargin FW, bf16.
baracuda_kernels_loss_multilabel_soft_margin_f16_can_implement
baracuda_kernels_loss_multilabel_soft_margin_f16_can_implement (baracuda kernels loss multilabel soft margin f16 can implement).
baracuda_kernels_loss_multilabel_soft_margin_f16_run
MultilabelSoftMargin FW, f16.
baracuda_kernels_loss_multilabel_soft_margin_f32_can_implement
baracuda_kernels_loss_multilabel_soft_margin_f32_can_implement (baracuda kernels loss multilabel soft margin f32 can implement).
baracuda_kernels_loss_multilabel_soft_margin_f32_run
MultilabelSoftMargin FW.
baracuda_kernels_loss_multilabel_soft_margin_f64_can_implement
baracuda_kernels_loss_multilabel_soft_margin_f64_can_implement (baracuda kernels loss multilabel soft margin f64 can implement).
baracuda_kernels_loss_multilabel_soft_margin_f64_run
MultilabelSoftMargin FW, f64.
baracuda_kernels_loss_nll_backward_bf16_can_implement
NLL BW _can_implement, bf16.
baracuda_kernels_loss_nll_backward_bf16_run
NLL BW, bf16.
baracuda_kernels_loss_nll_backward_f16_can_implement
NLL BW _can_implement, f16.
baracuda_kernels_loss_nll_backward_f16_run
NLL BW, f16.
baracuda_kernels_loss_nll_backward_f32_can_implement
NLL BW _can_implement, f32.
baracuda_kernels_loss_nll_backward_f32_run
NLL BW, f32. Pre-zeros dinput (size dinput_numel · sizeof(T)), then writes dinput[i, target[i]] = -dy_or_scale.
baracuda_kernels_loss_nll_backward_f64_can_implement
NLL BW _can_implement, f64.
baracuda_kernels_loss_nll_backward_f64_run
NLL BW, f64.
baracuda_kernels_loss_nll_bf16_can_implement
NLL FW _can_implement, bf16.
baracuda_kernels_loss_nll_bf16_run
NLL FW, bf16.
baracuda_kernels_loss_nll_f16_can_implement
NLL FW _can_implement, f16.
baracuda_kernels_loss_nll_f16_run
NLL FW, f16.
baracuda_kernels_loss_nll_f32_can_implement
NLL FW _can_implement, f32.
baracuda_kernels_loss_nll_f32_run
NLL FW, f32. -input[i, target[i]] per row. Heterogeneous-dtype: input is T, target is i64. row_stride_input is the i64 stride between adjacent rows of input (must equal class_extent for contiguous input).
baracuda_kernels_loss_nll_f64_can_implement
NLL FW _can_implement, f64.
baracuda_kernels_loss_nll_f64_run
NLL FW, f64.
baracuda_kernels_loss_poisson_nll_backward_bf16_can_implement
baracuda_kernels_loss_poisson_nll_backward_bf16_can_implement (baracuda kernels loss poisson nll backward bf16 can implement).
baracuda_kernels_loss_poisson_nll_backward_bf16_run
PoissonNLL BW, bf16.
baracuda_kernels_loss_poisson_nll_backward_f16_can_implement
baracuda_kernels_loss_poisson_nll_backward_f16_can_implement (baracuda kernels loss poisson nll backward f16 can implement).
baracuda_kernels_loss_poisson_nll_backward_f16_run
PoissonNLL BW, f16.
baracuda_kernels_loss_poisson_nll_backward_f32_can_implement
baracuda_kernels_loss_poisson_nll_backward_f32_can_implement (baracuda kernels loss poisson nll backward f32 can implement).
baracuda_kernels_loss_poisson_nll_backward_f32_run
PoissonNLL BW, f32.
baracuda_kernels_loss_poisson_nll_backward_f64_can_implement
baracuda_kernels_loss_poisson_nll_backward_f64_can_implement (baracuda kernels loss poisson nll backward f64 can implement).
baracuda_kernels_loss_poisson_nll_backward_f64_run
PoissonNLL BW, f64.
baracuda_kernels_loss_poisson_nll_bf16_can_implement
baracuda_kernels_loss_poisson_nll_bf16_can_implement (baracuda kernels loss poisson nll bf16 can implement).
baracuda_kernels_loss_poisson_nll_bf16_run
PoissonNLL FW, bf16.
baracuda_kernels_loss_poisson_nll_f16_can_implement
baracuda_kernels_loss_poisson_nll_f16_can_implement (baracuda kernels loss poisson nll f16 can implement).
baracuda_kernels_loss_poisson_nll_f16_run
PoissonNLL FW, f16.
baracuda_kernels_loss_poisson_nll_f32_can_implement
baracuda_kernels_loss_poisson_nll_f32_can_implement (baracuda kernels loss poisson nll f32 can implement).
baracuda_kernels_loss_poisson_nll_f32_run
PoissonNLL FW, f32. log_input_flag 0/1.
baracuda_kernels_loss_poisson_nll_f64_can_implement
baracuda_kernels_loss_poisson_nll_f64_can_implement (baracuda kernels loss poisson nll f64 can implement).
baracuda_kernels_loss_poisson_nll_f64_run
PoissonNLL FW, f64.
baracuda_kernels_loss_smooth_l1_backward_bf16_can_implement
baracuda_kernels_loss_smooth_l1_backward_bf16_can_implement (baracuda kernels loss smooth l1 backward bf16 can implement).
baracuda_kernels_loss_smooth_l1_backward_bf16_run
SmoothL1 BW, bf16.
baracuda_kernels_loss_smooth_l1_backward_f16_can_implement
baracuda_kernels_loss_smooth_l1_backward_f16_can_implement (baracuda kernels loss smooth l1 backward f16 can implement).
baracuda_kernels_loss_smooth_l1_backward_f16_run
SmoothL1 BW, f16.
baracuda_kernels_loss_smooth_l1_backward_f32_can_implement
baracuda_kernels_loss_smooth_l1_backward_f32_can_implement (baracuda kernels loss smooth l1 backward f32 can implement).
baracuda_kernels_loss_smooth_l1_backward_f32_run
SmoothL1 BW, f32.
baracuda_kernels_loss_smooth_l1_backward_f64_can_implement
baracuda_kernels_loss_smooth_l1_backward_f64_can_implement (baracuda kernels loss smooth l1 backward f64 can implement).
baracuda_kernels_loss_smooth_l1_backward_f64_run
SmoothL1 BW, f64.
baracuda_kernels_loss_smooth_l1_bf16_can_implement
baracuda_kernels_loss_smooth_l1_bf16_can_implement (baracuda kernels loss smooth l1 bf16 can implement).
baracuda_kernels_loss_smooth_l1_bf16_run
SmoothL1 FW, bf16.
baracuda_kernels_loss_smooth_l1_f16_can_implement
baracuda_kernels_loss_smooth_l1_f16_can_implement (baracuda kernels loss smooth l1 f16 can implement).
baracuda_kernels_loss_smooth_l1_f16_run
SmoothL1 FW, f16.
baracuda_kernels_loss_smooth_l1_f32_can_implement
baracuda_kernels_loss_smooth_l1_f32_can_implement (baracuda kernels loss smooth l1 f32 can implement).
baracuda_kernels_loss_smooth_l1_f32_run
SmoothL1 FW, f32. param = β.
baracuda_kernels_loss_smooth_l1_f64_can_implement
baracuda_kernels_loss_smooth_l1_f64_can_implement (baracuda kernels loss smooth l1 f64 can implement).
baracuda_kernels_loss_smooth_l1_f64_run
SmoothL1 FW, f64.
baracuda_kernels_loss_triplet_margin_backward_bf16_can_implement
baracuda_kernels_loss_triplet_margin_backward_bf16_can_implement (baracuda kernels loss triplet margin backward bf16 can implement).
baracuda_kernels_loss_triplet_margin_backward_bf16_run
TripletMargin BW, bf16.
baracuda_kernels_loss_triplet_margin_backward_f16_can_implement
baracuda_kernels_loss_triplet_margin_backward_f16_can_implement (baracuda kernels loss triplet margin backward f16 can implement).
baracuda_kernels_loss_triplet_margin_backward_f16_run
TripletMargin BW, f16.
baracuda_kernels_loss_triplet_margin_backward_f32_can_implement
baracuda_kernels_loss_triplet_margin_backward_f32_can_implement (baracuda kernels loss triplet margin backward f32 can implement).
baracuda_kernels_loss_triplet_margin_backward_f32_run
TripletMargin BW.
baracuda_kernels_loss_triplet_margin_backward_f64_can_implement
baracuda_kernels_loss_triplet_margin_backward_f64_can_implement (baracuda kernels loss triplet margin backward f64 can implement).
baracuda_kernels_loss_triplet_margin_backward_f64_run
TripletMargin BW, f64.
baracuda_kernels_loss_triplet_margin_bf16_can_implement
baracuda_kernels_loss_triplet_margin_bf16_can_implement (baracuda kernels loss triplet margin bf16 can implement).
baracuda_kernels_loss_triplet_margin_bf16_run
TripletMargin FW, bf16.
baracuda_kernels_loss_triplet_margin_f16_can_implement
baracuda_kernels_loss_triplet_margin_f16_can_implement (baracuda kernels loss triplet margin f16 can implement).
baracuda_kernels_loss_triplet_margin_f16_run
TripletMargin FW, f16.
baracuda_kernels_loss_triplet_margin_f32_can_implement
baracuda_kernels_loss_triplet_margin_f32_can_implement (baracuda kernels loss triplet margin f32 can implement).
baracuda_kernels_loss_triplet_margin_f32_run
TripletMargin FW (per-row). ABI: (n_rows, d_extent, row_stride, reduction_mode, margin, p_norm, a, p, n, out, workspace, workspace_bytes, stream).
baracuda_kernels_loss_triplet_margin_f64_can_implement
baracuda_kernels_loss_triplet_margin_f64_can_implement (baracuda kernels loss triplet margin f64 can implement).
baracuda_kernels_loss_triplet_margin_f64_run
TripletMargin FW, f64.
baracuda_kernels_lp_pool_1d_bf16_backward_can_implement
baracuda_kernels_lp_pool_1d_bf16_backward_can_implement (baracuda kernels lp pool 1d bf16 backward can implement).
baracuda_kernels_lp_pool_1d_bf16_backward_run
LpPool 1d BW, bf16.
baracuda_kernels_lp_pool_1d_bf16_can_implement
baracuda_kernels_lp_pool_1d_bf16_can_implement (baracuda kernels lp pool 1d bf16 can implement).
baracuda_kernels_lp_pool_1d_bf16_run
LpPool 1d FW, bf16.
baracuda_kernels_lp_pool_1d_f16_backward_can_implement
baracuda_kernels_lp_pool_1d_f16_backward_can_implement (baracuda kernels lp pool 1d f16 backward can implement).
baracuda_kernels_lp_pool_1d_f16_backward_run
LpPool 1d BW, f16.
baracuda_kernels_lp_pool_1d_f16_can_implement
baracuda_kernels_lp_pool_1d_f16_can_implement (baracuda kernels lp pool 1d f16 can implement).
baracuda_kernels_lp_pool_1d_f16_run
LpPool 1d FW, f16.
baracuda_kernels_lp_pool_1d_f32_backward_can_implement
baracuda_kernels_lp_pool_1d_f32_backward_can_implement (baracuda kernels lp pool 1d f32 backward can implement).
baracuda_kernels_lp_pool_1d_f32_backward_run
LpPool 1d BW, f32. Caller must zero dx first.
baracuda_kernels_lp_pool_1d_f32_can_implement
baracuda_kernels_lp_pool_1d_f32_can_implement (baracuda kernels lp pool 1d f32 can implement).
baracuda_kernels_lp_pool_1d_f32_run
LpPool 1d FW, f32.
baracuda_kernels_lp_pool_1d_f64_backward_can_implement
baracuda_kernels_lp_pool_1d_f64_backward_can_implement (baracuda kernels lp pool 1d f64 backward can implement).
baracuda_kernels_lp_pool_1d_f64_backward_run
LpPool 1d BW, f64.
baracuda_kernels_lp_pool_1d_f64_can_implement
baracuda_kernels_lp_pool_1d_f64_can_implement (baracuda kernels lp pool 1d f64 can implement).
baracuda_kernels_lp_pool_1d_f64_run
LpPool 1d FW, f64.
baracuda_kernels_lp_pool_2d_bf16_backward_can_implement
baracuda_kernels_lp_pool_2d_bf16_backward_can_implement (baracuda kernels lp pool 2d bf16 backward can implement).
baracuda_kernels_lp_pool_2d_bf16_backward_run
LpPool 2d BW, bf16.
baracuda_kernels_lp_pool_2d_bf16_can_implement
baracuda_kernels_lp_pool_2d_bf16_can_implement (baracuda kernels lp pool 2d bf16 can implement).
baracuda_kernels_lp_pool_2d_bf16_run
LpPool 2d FW, bf16.
baracuda_kernels_lp_pool_2d_f16_backward_can_implement
baracuda_kernels_lp_pool_2d_f16_backward_can_implement (baracuda kernels lp pool 2d f16 backward can implement).
baracuda_kernels_lp_pool_2d_f16_backward_run
LpPool 2d BW, f16.
baracuda_kernels_lp_pool_2d_f16_can_implement
baracuda_kernels_lp_pool_2d_f16_can_implement (baracuda kernels lp pool 2d f16 can implement).
baracuda_kernels_lp_pool_2d_f16_run
LpPool 2d FW, f16.
baracuda_kernels_lp_pool_2d_f32_backward_can_implement
baracuda_kernels_lp_pool_2d_f32_backward_can_implement (baracuda kernels lp pool 2d f32 backward can implement).
baracuda_kernels_lp_pool_2d_f32_backward_run
LpPool 2d BW, f32. Caller must zero dx first.
baracuda_kernels_lp_pool_2d_f32_can_implement
baracuda_kernels_lp_pool_2d_f32_can_implement (baracuda kernels lp pool 2d f32 can implement).
baracuda_kernels_lp_pool_2d_f32_run
LpPool 2d FW, f32.
baracuda_kernels_lp_pool_2d_f64_backward_can_implement
baracuda_kernels_lp_pool_2d_f64_backward_can_implement (baracuda kernels lp pool 2d f64 backward can implement).
baracuda_kernels_lp_pool_2d_f64_backward_run
LpPool 2d BW, f64.
baracuda_kernels_lp_pool_2d_f64_can_implement
baracuda_kernels_lp_pool_2d_f64_can_implement (baracuda kernels lp pool 2d f64 can implement).
baracuda_kernels_lp_pool_2d_f64_run
LpPool 2d FW, f64.
baracuda_kernels_lstsq_f32_run
Least-squares solve via iterative _gels (no QR fallback). On convergence, niters_out >= 0. On non-convergence, niters_out < 0 and the caller should retry via the Rust plan layer (which holds the QR fallback path).
baracuda_kernels_lstsq_f32_workspace_size
LstSq workspace size in BYTES (not elements — cuSOLVER’s _gels API differs from the others on this point).
baracuda_kernels_lstsq_f64_run
Least-squares solve via iterative _gels (no QR fallback). On convergence, niters_out >= 0. On non-convergence, niters_out < 0 and the caller should retry via the Rust plan layer (which holds the QR fallback path).
baracuda_kernels_lstsq_f64_workspace_size
LstSq workspace size in BYTES (not elements — cuSOLVER’s _gels API differs from the others on this point).
baracuda_kernels_lu_f32_run
LU factorization with partial pivoting (non-batched). a_inout is overwritten with the packed LU factors; pivots_out receives the 1-based row swaps (length min(m, n)); info_out is a single i32.
baracuda_kernels_lu_f32_workspace_size
LU factorization workspace size in bytes for getrf.
baracuda_kernels_lu_f64_run
LU factorization with partial pivoting (non-batched). a_inout is overwritten with the packed LU factors; pivots_out receives the 1-based row swaps (length min(m, n)); info_out is a single i32.
baracuda_kernels_lu_f64_workspace_size
LU factorization workspace size in bytes for getrf.
baracuda_kernels_masked_fill_backward_bool_can_implement
Implementability check for masked_fill_backward_bool.
baracuda_kernels_masked_fill_backward_bool_run
masked_fill_backward — bool (u8 storage).
baracuda_kernels_masked_fill_backward_f32_can_implement
Implementability check for masked_fill_backward_f32.
baracuda_kernels_masked_fill_backward_f32_run
dsrc[i] = mask[i] ? 0 : dout[i]. f32.
baracuda_kernels_masked_fill_backward_f64_can_implement
Implementability check for masked_fill_backward_f64.
baracuda_kernels_masked_fill_backward_f64_run
masked_fill_backward — f64.
baracuda_kernels_masked_fill_backward_i32_can_implement
Implementability check for masked_fill_backward_i32.
baracuda_kernels_masked_fill_backward_i32_run
masked_fill_backward — i32.
baracuda_kernels_masked_fill_bool_can_implement
Implementability check for masked_fill_bool.
baracuda_kernels_masked_fill_bool_run
masked_fill — bool (u8 storage).
baracuda_kernels_masked_fill_f32_can_implement
Implementability check for masked_fill_f32.
baracuda_kernels_masked_fill_f32_run
out[i] = mask[i] ? fill_value : src[i]. f32 (caller passes fill_value.to_bits() as i64).
baracuda_kernels_masked_fill_f64_can_implement
Implementability check for masked_fill_f64.
baracuda_kernels_masked_fill_f64_run
masked_fill — f64.
baracuda_kernels_masked_fill_i32_can_implement
Implementability check for masked_fill_i32.
baracuda_kernels_masked_fill_i32_run
masked_fill — i32.
baracuda_kernels_mmvq_batched_bf16_can_implement
baracuda_kernels_mmvq_batched_bf16_can_implement (baracuda kernels mmvq batched bf16 can implement).
baracuda_kernels_mmvq_batched_bf16_run
Batched MMV (non-quant) — bf16. # Safety: as f32.
baracuda_kernels_mmvq_batched_f16_can_implement
baracuda_kernels_mmvq_batched_f16_can_implement (baracuda kernels mmvq batched f16 can implement).
baracuda_kernels_mmvq_batched_f16_run
Batched MMV (non-quant) — f16. # Safety: as f32.
baracuda_kernels_mmvq_batched_f32_can_implement
baracuda_kernels_mmvq_batched_f32_can_implement (baracuda kernels mmvq batched f32 can implement).
baracuda_kernels_mmvq_batched_f32_run
Batched MMV (non-quant) — f32 weights + activation + output.
baracuda_kernels_mmvq_multim_q2_K_m1_can_implement
baracuda_kernels_mmvq_multim_q2_K_m1_can_implement (baracuda kernels mmvq multim q2 k m1 can implement).
baracuda_kernels_mmvq_multim_q2_K_m1_run
baracuda_kernels_mmvq_multim_q2_K_m1_run (baracuda kernels mmvq multim q2 k m1 run).
baracuda_kernels_mmvq_multim_q2_K_m2_can_implement
baracuda_kernels_mmvq_multim_q2_K_m2_can_implement (baracuda kernels mmvq multim q2 k m2 can implement).
baracuda_kernels_mmvq_multim_q2_K_m2_run
baracuda_kernels_mmvq_multim_q2_K_m2_run (baracuda kernels mmvq multim q2 k m2 run).
baracuda_kernels_mmvq_multim_q2_K_m4_can_implement
baracuda_kernels_mmvq_multim_q2_K_m4_can_implement (baracuda kernels mmvq multim q2 k m4 can implement).
baracuda_kernels_mmvq_multim_q2_K_m4_run
baracuda_kernels_mmvq_multim_q2_K_m4_run (baracuda kernels mmvq multim q2 k m4 run).
baracuda_kernels_mmvq_multim_q2_K_m8_can_implement
baracuda_kernels_mmvq_multim_q2_K_m8_can_implement (baracuda kernels mmvq multim q2 k m8 can implement).
baracuda_kernels_mmvq_multim_q2_K_m8_run
baracuda_kernels_mmvq_multim_q2_K_m8_run (baracuda kernels mmvq multim q2 k m8 run).
baracuda_kernels_mmvq_multim_q3_K_m1_can_implement
baracuda_kernels_mmvq_multim_q3_K_m1_can_implement (baracuda kernels mmvq multim q3 k m1 can implement).
baracuda_kernels_mmvq_multim_q3_K_m1_run
baracuda_kernels_mmvq_multim_q3_K_m1_run (baracuda kernels mmvq multim q3 k m1 run).
baracuda_kernels_mmvq_multim_q3_K_m2_can_implement
baracuda_kernels_mmvq_multim_q3_K_m2_can_implement (baracuda kernels mmvq multim q3 k m2 can implement).
baracuda_kernels_mmvq_multim_q3_K_m2_run
baracuda_kernels_mmvq_multim_q3_K_m2_run (baracuda kernels mmvq multim q3 k m2 run).
baracuda_kernels_mmvq_multim_q3_K_m4_can_implement
baracuda_kernels_mmvq_multim_q3_K_m4_can_implement (baracuda kernels mmvq multim q3 k m4 can implement).
baracuda_kernels_mmvq_multim_q3_K_m4_run
baracuda_kernels_mmvq_multim_q3_K_m4_run (baracuda kernels mmvq multim q3 k m4 run).
baracuda_kernels_mmvq_multim_q3_K_m8_can_implement
baracuda_kernels_mmvq_multim_q3_K_m8_can_implement (baracuda kernels mmvq multim q3 k m8 can implement).
baracuda_kernels_mmvq_multim_q3_K_m8_run
baracuda_kernels_mmvq_multim_q3_K_m8_run (baracuda kernels mmvq multim q3 k m8 run).
baracuda_kernels_mmvq_multim_q4_0_m1_can_implement
baracuda_kernels_mmvq_multim_q4_0_m1_can_implement (baracuda kernels mmvq multim q4 0 m1 can implement).
baracuda_kernels_mmvq_multim_q4_0_m1_run
baracuda_kernels_mmvq_multim_q4_0_m1_run (baracuda kernels mmvq multim q4 0 m1 run).
baracuda_kernels_mmvq_multim_q4_0_m2_can_implement
baracuda_kernels_mmvq_multim_q4_0_m2_can_implement (baracuda kernels mmvq multim q4 0 m2 can implement).
baracuda_kernels_mmvq_multim_q4_0_m2_run
baracuda_kernels_mmvq_multim_q4_0_m2_run (baracuda kernels mmvq multim q4 0 m2 run).
baracuda_kernels_mmvq_multim_q4_0_m4_can_implement
baracuda_kernels_mmvq_multim_q4_0_m4_can_implement (baracuda kernels mmvq multim q4 0 m4 can implement).
baracuda_kernels_mmvq_multim_q4_0_m4_run
baracuda_kernels_mmvq_multim_q4_0_m4_run (baracuda kernels mmvq multim q4 0 m4 run).
baracuda_kernels_mmvq_multim_q4_0_m8_can_implement
baracuda_kernels_mmvq_multim_q4_0_m8_can_implement (baracuda kernels mmvq multim q4 0 m8 can implement).
baracuda_kernels_mmvq_multim_q4_0_m8_run
baracuda_kernels_mmvq_multim_q4_0_m8_run (baracuda kernels mmvq multim q4 0 m8 run).
baracuda_kernels_mmvq_multim_q4_1_m1_can_implement
baracuda_kernels_mmvq_multim_q4_1_m1_can_implement (baracuda kernels mmvq multim q4 1 m1 can implement).
baracuda_kernels_mmvq_multim_q4_1_m1_run
baracuda_kernels_mmvq_multim_q4_1_m1_run (baracuda kernels mmvq multim q4 1 m1 run).
baracuda_kernels_mmvq_multim_q4_1_m2_can_implement
baracuda_kernels_mmvq_multim_q4_1_m2_can_implement (baracuda kernels mmvq multim q4 1 m2 can implement).
baracuda_kernels_mmvq_multim_q4_1_m2_run
baracuda_kernels_mmvq_multim_q4_1_m2_run (baracuda kernels mmvq multim q4 1 m2 run).
baracuda_kernels_mmvq_multim_q4_1_m4_can_implement
baracuda_kernels_mmvq_multim_q4_1_m4_can_implement (baracuda kernels mmvq multim q4 1 m4 can implement).
baracuda_kernels_mmvq_multim_q4_1_m4_run
baracuda_kernels_mmvq_multim_q4_1_m4_run (baracuda kernels mmvq multim q4 1 m4 run).
baracuda_kernels_mmvq_multim_q4_1_m8_can_implement
baracuda_kernels_mmvq_multim_q4_1_m8_can_implement (baracuda kernels mmvq multim q4 1 m8 can implement).
baracuda_kernels_mmvq_multim_q4_1_m8_run
baracuda_kernels_mmvq_multim_q4_1_m8_run (baracuda kernels mmvq multim q4 1 m8 run).
baracuda_kernels_mmvq_multim_q4_K_m1_can_implement
baracuda_kernels_mmvq_multim_q4_K_m1_can_implement (baracuda kernels mmvq multim q4 k m1 can implement).
baracuda_kernels_mmvq_multim_q4_K_m1_run
baracuda_kernels_mmvq_multim_q4_K_m1_run (baracuda kernels mmvq multim q4 k m1 run).
baracuda_kernels_mmvq_multim_q4_K_m2_can_implement
baracuda_kernels_mmvq_multim_q4_K_m2_can_implement (baracuda kernels mmvq multim q4 k m2 can implement).
baracuda_kernels_mmvq_multim_q4_K_m2_run
baracuda_kernels_mmvq_multim_q4_K_m2_run (baracuda kernels mmvq multim q4 k m2 run).
baracuda_kernels_mmvq_multim_q4_K_m4_can_implement
baracuda_kernels_mmvq_multim_q4_K_m4_can_implement (baracuda kernels mmvq multim q4 k m4 can implement).
baracuda_kernels_mmvq_multim_q4_K_m4_run
baracuda_kernels_mmvq_multim_q4_K_m4_run (baracuda kernels mmvq multim q4 k m4 run).
baracuda_kernels_mmvq_multim_q4_K_m8_can_implement
baracuda_kernels_mmvq_multim_q4_K_m8_can_implement (baracuda kernels mmvq multim q4 k m8 can implement).
baracuda_kernels_mmvq_multim_q4_K_m8_run
baracuda_kernels_mmvq_multim_q4_K_m8_run (baracuda kernels mmvq multim q4 k m8 run).
baracuda_kernels_mmvq_multim_q5_0_m1_can_implement
baracuda_kernels_mmvq_multim_q5_0_m1_can_implement (baracuda kernels mmvq multim q5 0 m1 can implement).
baracuda_kernels_mmvq_multim_q5_0_m1_run
baracuda_kernels_mmvq_multim_q5_0_m1_run (baracuda kernels mmvq multim q5 0 m1 run).
baracuda_kernels_mmvq_multim_q5_0_m2_can_implement
baracuda_kernels_mmvq_multim_q5_0_m2_can_implement (baracuda kernels mmvq multim q5 0 m2 can implement).
baracuda_kernels_mmvq_multim_q5_0_m2_run
baracuda_kernels_mmvq_multim_q5_0_m2_run (baracuda kernels mmvq multim q5 0 m2 run).
baracuda_kernels_mmvq_multim_q5_0_m4_can_implement
baracuda_kernels_mmvq_multim_q5_0_m4_can_implement (baracuda kernels mmvq multim q5 0 m4 can implement).
baracuda_kernels_mmvq_multim_q5_0_m4_run
baracuda_kernels_mmvq_multim_q5_0_m4_run (baracuda kernels mmvq multim q5 0 m4 run).
baracuda_kernels_mmvq_multim_q5_0_m8_can_implement
baracuda_kernels_mmvq_multim_q5_0_m8_can_implement (baracuda kernels mmvq multim q5 0 m8 can implement).
baracuda_kernels_mmvq_multim_q5_0_m8_run
baracuda_kernels_mmvq_multim_q5_0_m8_run (baracuda kernels mmvq multim q5 0 m8 run).
baracuda_kernels_mmvq_multim_q5_1_m1_can_implement
baracuda_kernels_mmvq_multim_q5_1_m1_can_implement (baracuda kernels mmvq multim q5 1 m1 can implement).
baracuda_kernels_mmvq_multim_q5_1_m1_run
baracuda_kernels_mmvq_multim_q5_1_m1_run (baracuda kernels mmvq multim q5 1 m1 run).
baracuda_kernels_mmvq_multim_q5_1_m2_can_implement
baracuda_kernels_mmvq_multim_q5_1_m2_can_implement (baracuda kernels mmvq multim q5 1 m2 can implement).
baracuda_kernels_mmvq_multim_q5_1_m2_run
baracuda_kernels_mmvq_multim_q5_1_m2_run (baracuda kernels mmvq multim q5 1 m2 run).
baracuda_kernels_mmvq_multim_q5_1_m4_can_implement
baracuda_kernels_mmvq_multim_q5_1_m4_can_implement (baracuda kernels mmvq multim q5 1 m4 can implement).
baracuda_kernels_mmvq_multim_q5_1_m4_run
baracuda_kernels_mmvq_multim_q5_1_m4_run (baracuda kernels mmvq multim q5 1 m4 run).
baracuda_kernels_mmvq_multim_q5_1_m8_can_implement
baracuda_kernels_mmvq_multim_q5_1_m8_can_implement (baracuda kernels mmvq multim q5 1 m8 can implement).
baracuda_kernels_mmvq_multim_q5_1_m8_run
baracuda_kernels_mmvq_multim_q5_1_m8_run (baracuda kernels mmvq multim q5 1 m8 run).
baracuda_kernels_mmvq_multim_q5_K_m1_can_implement
baracuda_kernels_mmvq_multim_q5_K_m1_can_implement (baracuda kernels mmvq multim q5 k m1 can implement).
baracuda_kernels_mmvq_multim_q5_K_m1_run
baracuda_kernels_mmvq_multim_q5_K_m1_run (baracuda kernels mmvq multim q5 k m1 run).
baracuda_kernels_mmvq_multim_q5_K_m2_can_implement
baracuda_kernels_mmvq_multim_q5_K_m2_can_implement (baracuda kernels mmvq multim q5 k m2 can implement).
baracuda_kernels_mmvq_multim_q5_K_m2_run
baracuda_kernels_mmvq_multim_q5_K_m2_run (baracuda kernels mmvq multim q5 k m2 run).
baracuda_kernels_mmvq_multim_q5_K_m4_can_implement
baracuda_kernels_mmvq_multim_q5_K_m4_can_implement (baracuda kernels mmvq multim q5 k m4 can implement).
baracuda_kernels_mmvq_multim_q5_K_m4_run
baracuda_kernels_mmvq_multim_q5_K_m4_run (baracuda kernels mmvq multim q5 k m4 run).
baracuda_kernels_mmvq_multim_q5_K_m8_can_implement
baracuda_kernels_mmvq_multim_q5_K_m8_can_implement (baracuda kernels mmvq multim q5 k m8 can implement).
baracuda_kernels_mmvq_multim_q5_K_m8_run
baracuda_kernels_mmvq_multim_q5_K_m8_run (baracuda kernels mmvq multim q5 k m8 run).
baracuda_kernels_mmvq_multim_q6_K_m1_can_implement
baracuda_kernels_mmvq_multim_q6_K_m1_can_implement (baracuda kernels mmvq multim q6 k m1 can implement).
baracuda_kernels_mmvq_multim_q6_K_m1_run
baracuda_kernels_mmvq_multim_q6_K_m1_run (baracuda kernels mmvq multim q6 k m1 run).
baracuda_kernels_mmvq_multim_q6_K_m2_can_implement
baracuda_kernels_mmvq_multim_q6_K_m2_can_implement (baracuda kernels mmvq multim q6 k m2 can implement).
baracuda_kernels_mmvq_multim_q6_K_m2_run
baracuda_kernels_mmvq_multim_q6_K_m2_run (baracuda kernels mmvq multim q6 k m2 run).
baracuda_kernels_mmvq_multim_q6_K_m4_can_implement
baracuda_kernels_mmvq_multim_q6_K_m4_can_implement (baracuda kernels mmvq multim q6 k m4 can implement).
baracuda_kernels_mmvq_multim_q6_K_m4_run
baracuda_kernels_mmvq_multim_q6_K_m4_run (baracuda kernels mmvq multim q6 k m4 run).
baracuda_kernels_mmvq_multim_q6_K_m8_can_implement
baracuda_kernels_mmvq_multim_q6_K_m8_can_implement (baracuda kernels mmvq multim q6 k m8 can implement).
baracuda_kernels_mmvq_multim_q6_K_m8_run
baracuda_kernels_mmvq_multim_q6_K_m8_run (baracuda kernels mmvq multim q6 k m8 run).
baracuda_kernels_mmvq_multim_q8_0_m1_can_implement
baracuda_kernels_mmvq_multim_q8_0_m1_can_implement (baracuda kernels mmvq multim q8 0 m1 can implement).
baracuda_kernels_mmvq_multim_q8_0_m1_run
Multi-M MMVQ for Q8_0 weights, M=1 (decode regime). Computes dst[0, r] = Σ_c W[r, c] * y[0, c] for r ∈ [0, nrows_x).
baracuda_kernels_mmvq_multim_q8_0_m2_can_implement
baracuda_kernels_mmvq_multim_q8_0_m2_can_implement (baracuda kernels mmvq multim q8 0 m2 can implement).
baracuda_kernels_mmvq_multim_q8_0_m2_run
Multi-M MMVQ for Q8_0 weights, M=2. # Safety: as M=1.
baracuda_kernels_mmvq_multim_q8_0_m4_can_implement
baracuda_kernels_mmvq_multim_q8_0_m4_can_implement (baracuda kernels mmvq multim q8 0 m4 can implement).
baracuda_kernels_mmvq_multim_q8_0_m4_run
Multi-M MMVQ for Q8_0 weights, M=4. # Safety: as M=1.
baracuda_kernels_mmvq_multim_q8_0_m8_can_implement
baracuda_kernels_mmvq_multim_q8_0_m8_can_implement (baracuda kernels mmvq multim q8 0 m8 can implement).
baracuda_kernels_mmvq_multim_q8_0_m8_run
Multi-M MMVQ for Q8_0 weights, M=8 (prefill regime, target 3-7× vs the per-token M=1 dispatch). # Safety: as M=1.
baracuda_kernels_mmvq_q2_K_actstrided_bf16_can_implement
baracuda_kernels_mmvq_q2_K_actstrided_bf16_can_implement (baracuda kernels mmvq q2 k actstrided bf16 can implement).
baracuda_kernels_mmvq_q2_K_actstrided_bf16_run
Strided MMVQ — Q2_K, bf16. # Safety: as Q2_K strided f16.
baracuda_kernels_mmvq_q2_K_actstrided_can_implement
baracuda_kernels_mmvq_q2_K_actstrided_can_implement (baracuda kernels mmvq q2 k actstrided can implement).
baracuda_kernels_mmvq_q2_K_actstrided_f16_can_implement
baracuda_kernels_mmvq_q2_K_actstrided_f16_can_implement (baracuda kernels mmvq q2 k actstrided f16 can implement).
baracuda_kernels_mmvq_q2_K_actstrided_f16_run
Strided MMVQ — Q2_K, f16. # Safety: as Q4_0 strided f16, ncols mul of 256.
baracuda_kernels_mmvq_q2_K_actstrided_run
Strided MMVQ — GGUF Q2_K. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q2_K_batched_bf16_can_implement
baracuda_kernels_mmvq_q2_K_batched_bf16_can_implement (baracuda kernels mmvq q2 k batched bf16 can implement).
baracuda_kernels_mmvq_q2_K_batched_bf16_run
Batched MMVQ — Q2_K, bf16. # Safety: as Q2_K f32.
baracuda_kernels_mmvq_q2_K_batched_can_implement
baracuda_kernels_mmvq_q2_K_batched_can_implement (baracuda kernels mmvq q2 k batched can implement).
baracuda_kernels_mmvq_q2_K_batched_f16_can_implement
baracuda_kernels_mmvq_q2_K_batched_f16_can_implement (baracuda kernels mmvq q2 k batched f16 can implement).
baracuda_kernels_mmvq_q2_K_batched_f16_run
Batched MMVQ — Q2_K, f16. # Safety: as Q2_K f32.
baracuda_kernels_mmvq_q2_K_batched_run
Batched MMVQ — Q2_K, f32. # Safety: as Q4_0, ncols mul of 256.
baracuda_kernels_mmvq_q2_K_bf16_can_implement
baracuda_kernels_mmvq_q2_K_bf16_can_implement (baracuda kernels mmvq q2 k bf16 can implement).
baracuda_kernels_mmvq_q2_K_bf16_run
MMVQ — Q2_K, bf16. # Safety: as Q4_0 bf16, ncols must be multiple of 256.
baracuda_kernels_mmvq_q2_K_can_implement
baracuda_kernels_mmvq_q2_K_can_implement (baracuda kernels mmvq q2 k can implement).
baracuda_kernels_mmvq_q2_K_f16_can_implement
baracuda_kernels_mmvq_q2_K_f16_can_implement (baracuda kernels mmvq q2 k f16 can implement).
baracuda_kernels_mmvq_q2_K_f16_run
MMVQ — Q2_K, f16. # Safety: as Q4_0 f16, ncols must be multiple of 256.
baracuda_kernels_mmvq_q2_K_run
GGUF Q2_K MMVQ — FP-activation matrix-vector mul. ncols must be a multiple of 256.
baracuda_kernels_mmvq_q3_K_actstrided_bf16_can_implement
baracuda_kernels_mmvq_q3_K_actstrided_bf16_can_implement (baracuda kernels mmvq q3 k actstrided bf16 can implement).
baracuda_kernels_mmvq_q3_K_actstrided_bf16_run
Strided MMVQ — Q3_K, bf16. # Safety: as Q2_K strided bf16.
baracuda_kernels_mmvq_q3_K_actstrided_can_implement
baracuda_kernels_mmvq_q3_K_actstrided_can_implement (baracuda kernels mmvq q3 k actstrided can implement).
baracuda_kernels_mmvq_q3_K_actstrided_f16_can_implement
baracuda_kernels_mmvq_q3_K_actstrided_f16_can_implement (baracuda kernels mmvq q3 k actstrided f16 can implement).
baracuda_kernels_mmvq_q3_K_actstrided_f16_run
Strided MMVQ — Q3_K, f16. # Safety: as Q2_K strided f16.
baracuda_kernels_mmvq_q3_K_actstrided_run
Strided MMVQ — GGUF Q3_K. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q3_K_batched_bf16_can_implement
baracuda_kernels_mmvq_q3_K_batched_bf16_can_implement (baracuda kernels mmvq q3 k batched bf16 can implement).
baracuda_kernels_mmvq_q3_K_batched_bf16_run
Batched MMVQ — Q3_K, bf16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q3_K_batched_can_implement
baracuda_kernels_mmvq_q3_K_batched_can_implement (baracuda kernels mmvq q3 k batched can implement).
baracuda_kernels_mmvq_q3_K_batched_f16_can_implement
baracuda_kernels_mmvq_q3_K_batched_f16_can_implement (baracuda kernels mmvq q3 k batched f16 can implement).
baracuda_kernels_mmvq_q3_K_batched_f16_run
Batched MMVQ — Q3_K, f16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q3_K_batched_run
Batched MMVQ — Q3_K, f32. # Safety: as Q2_K.
baracuda_kernels_mmvq_q3_K_bf16_can_implement
baracuda_kernels_mmvq_q3_K_bf16_can_implement (baracuda kernels mmvq q3 k bf16 can implement).
baracuda_kernels_mmvq_q3_K_bf16_run
MMVQ — Q3_K, bf16. # Safety: as Q2_K bf16.
baracuda_kernels_mmvq_q3_K_can_implement
baracuda_kernels_mmvq_q3_K_can_implement (baracuda kernels mmvq q3 k can implement).
baracuda_kernels_mmvq_q3_K_f16_can_implement
baracuda_kernels_mmvq_q3_K_f16_can_implement (baracuda kernels mmvq q3 k f16 can implement).
baracuda_kernels_mmvq_q3_K_f16_run
MMVQ — Q3_K, f16. # Safety: as Q2_K f16.
baracuda_kernels_mmvq_q3_K_run
GGUF Q3_K MMVQ. # Safety: as Q2_K.
baracuda_kernels_mmvq_q4_0_actstrided_bf16_can_implement
baracuda_kernels_mmvq_q4_0_actstrided_bf16_can_implement (baracuda kernels mmvq q4 0 actstrided bf16 can implement).
baracuda_kernels_mmvq_q4_0_actstrided_bf16_run
Strided MMVQ — Q4_0, bf16. # Safety: as the f32 strided sibling.
baracuda_kernels_mmvq_q4_0_actstrided_can_implement
baracuda_kernels_mmvq_q4_0_actstrided_can_implement (baracuda kernels mmvq q4 0 actstrided can implement).
baracuda_kernels_mmvq_q4_0_actstrided_f16_can_implement
baracuda_kernels_mmvq_q4_0_actstrided_f16_can_implement (baracuda kernels mmvq q4 0 actstrided f16 can implement).
baracuda_kernels_mmvq_q4_0_actstrided_f16_run
Strided MMVQ — Q4_0, f16. # Safety: as the f32 strided sibling.
baracuda_kernels_mmvq_q4_0_actstrided_run
Strided MMVQ — GGUF Q4_0. # Safety: as the contig Q4_0 variant, plus (y[k * stride_y])_k=0..ncols must be a valid f32 read.
baracuda_kernels_mmvq_q4_0_batched_bf16_can_implement
baracuda_kernels_mmvq_q4_0_batched_bf16_can_implement (baracuda kernels mmvq q4 0 batched bf16 can implement).
baracuda_kernels_mmvq_q4_0_batched_bf16_run
Batched MMVQ — Q4_0, bf16. # Safety: as Q4_0 f32.
baracuda_kernels_mmvq_q4_0_batched_can_implement
baracuda_kernels_mmvq_q4_0_batched_can_implement (baracuda kernels mmvq q4 0 batched can implement).
baracuda_kernels_mmvq_q4_0_batched_f16_can_implement
baracuda_kernels_mmvq_q4_0_batched_f16_can_implement (baracuda kernels mmvq q4 0 batched f16 can implement).
baracuda_kernels_mmvq_q4_0_batched_f16_run
Batched MMVQ — Q4_0, f16. # Safety: as Q4_0 f32.
baracuda_kernels_mmvq_q4_0_batched_run
Batched MMVQ — Q4_0, f32 activation + output. # Safety: device-resident pointers; valid stream; workspacem_total * 4 bytes.
baracuda_kernels_mmvq_q4_0_bf16_can_implement
baracuda_kernels_mmvq_q4_0_bf16_can_implement (baracuda kernels mmvq q4 0 bf16 can implement).
baracuda_kernels_mmvq_q4_0_bf16_run
MMVQ — Q4_0, bf16 activation + bf16 output. # Safety: as the f32 sibling with y / dst typed __nv_bfloat16 device-resident.
baracuda_kernels_mmvq_q4_0_can_implement
baracuda_kernels_mmvq_q4_0_can_implement (baracuda kernels mmvq q4 0 can implement).
baracuda_kernels_mmvq_q4_0_f16_can_implement
baracuda_kernels_mmvq_q4_0_f16_can_implement (baracuda kernels mmvq q4 0 f16 can implement).
baracuda_kernels_mmvq_q4_0_f16_run
MMVQ — Q4_0, f16 activation + f16 output. # Safety: as the f32 sibling with y / dst typed __half device-resident.
baracuda_kernels_mmvq_q4_0_run
GGUF Q4_0 MMVQ — FP-activation matrix-vector mul. ncols must be a multiple of 32.
baracuda_kernels_mmvq_q4_1_actstrided_bf16_can_implement
baracuda_kernels_mmvq_q4_1_actstrided_bf16_can_implement (baracuda kernels mmvq q4 1 actstrided bf16 can implement).
baracuda_kernels_mmvq_q4_1_actstrided_bf16_run
Strided MMVQ — Q4_1, bf16. # Safety: as Q4_0 strided bf16.
baracuda_kernels_mmvq_q4_1_actstrided_can_implement
baracuda_kernels_mmvq_q4_1_actstrided_can_implement (baracuda kernels mmvq q4 1 actstrided can implement).
baracuda_kernels_mmvq_q4_1_actstrided_f16_can_implement
baracuda_kernels_mmvq_q4_1_actstrided_f16_can_implement (baracuda kernels mmvq q4 1 actstrided f16 can implement).
baracuda_kernels_mmvq_q4_1_actstrided_f16_run
Strided MMVQ — Q4_1, f16. # Safety: as Q4_0 strided f16.
baracuda_kernels_mmvq_q4_1_actstrided_run
Strided MMVQ — GGUF Q4_1. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q4_1_batched_bf16_can_implement
baracuda_kernels_mmvq_q4_1_batched_bf16_can_implement (baracuda kernels mmvq q4 1 batched bf16 can implement).
baracuda_kernels_mmvq_q4_1_batched_bf16_run
Batched MMVQ — Q4_1, bf16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q4_1_batched_can_implement
baracuda_kernels_mmvq_q4_1_batched_can_implement (baracuda kernels mmvq q4 1 batched can implement).
baracuda_kernels_mmvq_q4_1_batched_f16_can_implement
baracuda_kernels_mmvq_q4_1_batched_f16_can_implement (baracuda kernels mmvq q4 1 batched f16 can implement).
baracuda_kernels_mmvq_q4_1_batched_f16_run
Batched MMVQ — Q4_1, f16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q4_1_batched_run
Batched MMVQ — Q4_1, f32. # Safety: as Q4_0.
baracuda_kernels_mmvq_q4_1_bf16_can_implement
baracuda_kernels_mmvq_q4_1_bf16_can_implement (baracuda kernels mmvq q4 1 bf16 can implement).
baracuda_kernels_mmvq_q4_1_bf16_run
MMVQ — Q4_1, bf16 activation + bf16 output. # Safety: as Q4_0 bf16.
baracuda_kernels_mmvq_q4_1_can_implement
baracuda_kernels_mmvq_q4_1_can_implement (baracuda kernels mmvq q4 1 can implement).
baracuda_kernels_mmvq_q4_1_f16_can_implement
baracuda_kernels_mmvq_q4_1_f16_can_implement (baracuda kernels mmvq q4 1 f16 can implement).
baracuda_kernels_mmvq_q4_1_f16_run
MMVQ — Q4_1, f16 activation + f16 output. # Safety: as Q4_0 f16.
baracuda_kernels_mmvq_q4_1_run
GGUF Q4_1 MMVQ. # Safety: as Q4_0.
baracuda_kernels_mmvq_q4_K_actstrided_bf16_can_implement
baracuda_kernels_mmvq_q4_K_actstrided_bf16_can_implement (baracuda kernels mmvq q4 k actstrided bf16 can implement).
baracuda_kernels_mmvq_q4_K_actstrided_bf16_run
Strided MMVQ — Q4_K, bf16. # Safety: as Q2_K strided bf16.
baracuda_kernels_mmvq_q4_K_actstrided_can_implement
baracuda_kernels_mmvq_q4_K_actstrided_can_implement (baracuda kernels mmvq q4 k actstrided can implement).
baracuda_kernels_mmvq_q4_K_actstrided_f16_can_implement
baracuda_kernels_mmvq_q4_K_actstrided_f16_can_implement (baracuda kernels mmvq q4 k actstrided f16 can implement).
baracuda_kernels_mmvq_q4_K_actstrided_f16_run
Strided MMVQ — Q4_K, f16. # Safety: as Q2_K strided f16.
baracuda_kernels_mmvq_q4_K_actstrided_run
Strided MMVQ — GGUF Q4_K. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q4_K_batched_bf16_can_implement
baracuda_kernels_mmvq_q4_K_batched_bf16_can_implement (baracuda kernels mmvq q4 k batched bf16 can implement).
baracuda_kernels_mmvq_q4_K_batched_bf16_run
Batched MMVQ — Q4_K, bf16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q4_K_batched_can_implement
baracuda_kernels_mmvq_q4_K_batched_can_implement (baracuda kernels mmvq q4 k batched can implement).
baracuda_kernels_mmvq_q4_K_batched_f16_can_implement
baracuda_kernels_mmvq_q4_K_batched_f16_can_implement (baracuda kernels mmvq q4 k batched f16 can implement).
baracuda_kernels_mmvq_q4_K_batched_f16_run
Batched MMVQ — Q4_K, f16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q4_K_batched_run
Batched MMVQ — Q4_K, f32. # Safety: as Q2_K.
baracuda_kernels_mmvq_q4_K_bf16_can_implement
baracuda_kernels_mmvq_q4_K_bf16_can_implement (baracuda kernels mmvq q4 k bf16 can implement).
baracuda_kernels_mmvq_q4_K_bf16_run
MMVQ — Q4_K, bf16. # Safety: as Q2_K bf16.
baracuda_kernels_mmvq_q4_K_can_implement
baracuda_kernels_mmvq_q4_K_can_implement (baracuda kernels mmvq q4 k can implement).
baracuda_kernels_mmvq_q4_K_f16_can_implement
baracuda_kernels_mmvq_q4_K_f16_can_implement (baracuda kernels mmvq q4 k f16 can implement).
baracuda_kernels_mmvq_q4_K_f16_run
MMVQ — Q4_K, f16. # Safety: as Q2_K f16.
baracuda_kernels_mmvq_q4_K_run
GGUF Q4_K MMVQ. # Safety: as Q2_K.
baracuda_kernels_mmvq_q5_0_actstrided_bf16_can_implement
baracuda_kernels_mmvq_q5_0_actstrided_bf16_can_implement (baracuda kernels mmvq q5 0 actstrided bf16 can implement).
baracuda_kernels_mmvq_q5_0_actstrided_bf16_run
Strided MMVQ — Q5_0, bf16. # Safety: as Q4_0 strided bf16.
baracuda_kernels_mmvq_q5_0_actstrided_can_implement
baracuda_kernels_mmvq_q5_0_actstrided_can_implement (baracuda kernels mmvq q5 0 actstrided can implement).
baracuda_kernels_mmvq_q5_0_actstrided_f16_can_implement
baracuda_kernels_mmvq_q5_0_actstrided_f16_can_implement (baracuda kernels mmvq q5 0 actstrided f16 can implement).
baracuda_kernels_mmvq_q5_0_actstrided_f16_run
Strided MMVQ — Q5_0, f16. # Safety: as Q4_0 strided f16.
baracuda_kernels_mmvq_q5_0_actstrided_run
Strided MMVQ — GGUF Q5_0. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q5_0_batched_bf16_can_implement
baracuda_kernels_mmvq_q5_0_batched_bf16_can_implement (baracuda kernels mmvq q5 0 batched bf16 can implement).
baracuda_kernels_mmvq_q5_0_batched_bf16_run
Batched MMVQ — Q5_0, bf16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_0_batched_can_implement
baracuda_kernels_mmvq_q5_0_batched_can_implement (baracuda kernels mmvq q5 0 batched can implement).
baracuda_kernels_mmvq_q5_0_batched_f16_can_implement
baracuda_kernels_mmvq_q5_0_batched_f16_can_implement (baracuda kernels mmvq q5 0 batched f16 can implement).
baracuda_kernels_mmvq_q5_0_batched_f16_run
Batched MMVQ — Q5_0, f16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_0_batched_run
Batched MMVQ — Q5_0, f32. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_0_bf16_can_implement
baracuda_kernels_mmvq_q5_0_bf16_can_implement (baracuda kernels mmvq q5 0 bf16 can implement).
baracuda_kernels_mmvq_q5_0_bf16_run
MMVQ — Q5_0, bf16. # Safety: as Q4_0 bf16.
baracuda_kernels_mmvq_q5_0_can_implement
baracuda_kernels_mmvq_q5_0_can_implement (baracuda kernels mmvq q5 0 can implement).
baracuda_kernels_mmvq_q5_0_f16_can_implement
baracuda_kernels_mmvq_q5_0_f16_can_implement (baracuda kernels mmvq q5 0 f16 can implement).
baracuda_kernels_mmvq_q5_0_f16_run
MMVQ — Q5_0, f16. # Safety: as Q4_0 f16.
baracuda_kernels_mmvq_q5_0_run
GGUF Q5_0 MMVQ. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_1_actstrided_bf16_can_implement
baracuda_kernels_mmvq_q5_1_actstrided_bf16_can_implement (baracuda kernels mmvq q5 1 actstrided bf16 can implement).
baracuda_kernels_mmvq_q5_1_actstrided_bf16_run
Strided MMVQ — Q5_1, bf16. # Safety: as Q4_0 strided bf16.
baracuda_kernels_mmvq_q5_1_actstrided_can_implement
baracuda_kernels_mmvq_q5_1_actstrided_can_implement (baracuda kernels mmvq q5 1 actstrided can implement).
baracuda_kernels_mmvq_q5_1_actstrided_f16_can_implement
baracuda_kernels_mmvq_q5_1_actstrided_f16_can_implement (baracuda kernels mmvq q5 1 actstrided f16 can implement).
baracuda_kernels_mmvq_q5_1_actstrided_f16_run
Strided MMVQ — Q5_1, f16. # Safety: as Q4_0 strided f16.
baracuda_kernels_mmvq_q5_1_actstrided_run
Strided MMVQ — GGUF Q5_1. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q5_1_batched_bf16_can_implement
baracuda_kernels_mmvq_q5_1_batched_bf16_can_implement (baracuda kernels mmvq q5 1 batched bf16 can implement).
baracuda_kernels_mmvq_q5_1_batched_bf16_run
Batched MMVQ — Q5_1, bf16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_1_batched_can_implement
baracuda_kernels_mmvq_q5_1_batched_can_implement (baracuda kernels mmvq q5 1 batched can implement).
baracuda_kernels_mmvq_q5_1_batched_f16_can_implement
baracuda_kernels_mmvq_q5_1_batched_f16_can_implement (baracuda kernels mmvq q5 1 batched f16 can implement).
baracuda_kernels_mmvq_q5_1_batched_f16_run
Batched MMVQ — Q5_1, f16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_1_batched_run
Batched MMVQ — Q5_1, f32. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_1_bf16_can_implement
baracuda_kernels_mmvq_q5_1_bf16_can_implement (baracuda kernels mmvq q5 1 bf16 can implement).
baracuda_kernels_mmvq_q5_1_bf16_run
MMVQ — Q5_1, bf16. # Safety: as Q4_0 bf16.
baracuda_kernels_mmvq_q5_1_can_implement
baracuda_kernels_mmvq_q5_1_can_implement (baracuda kernels mmvq q5 1 can implement).
baracuda_kernels_mmvq_q5_1_f16_can_implement
baracuda_kernels_mmvq_q5_1_f16_can_implement (baracuda kernels mmvq q5 1 f16 can implement).
baracuda_kernels_mmvq_q5_1_f16_run
MMVQ — Q5_1, f16. # Safety: as Q4_0 f16.
baracuda_kernels_mmvq_q5_1_run
GGUF Q5_1 MMVQ. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_K_actstrided_bf16_can_implement
baracuda_kernels_mmvq_q5_K_actstrided_bf16_can_implement (baracuda kernels mmvq q5 k actstrided bf16 can implement).
baracuda_kernels_mmvq_q5_K_actstrided_bf16_run
Strided MMVQ — Q5_K, bf16. # Safety: as Q2_K strided bf16.
baracuda_kernels_mmvq_q5_K_actstrided_can_implement
baracuda_kernels_mmvq_q5_K_actstrided_can_implement (baracuda kernels mmvq q5 k actstrided can implement).
baracuda_kernels_mmvq_q5_K_actstrided_f16_can_implement
baracuda_kernels_mmvq_q5_K_actstrided_f16_can_implement (baracuda kernels mmvq q5 k actstrided f16 can implement).
baracuda_kernels_mmvq_q5_K_actstrided_f16_run
Strided MMVQ — Q5_K, f16. # Safety: as Q2_K strided f16.
baracuda_kernels_mmvq_q5_K_actstrided_run
Strided MMVQ — GGUF Q5_K. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q5_K_batched_bf16_can_implement
baracuda_kernels_mmvq_q5_K_batched_bf16_can_implement (baracuda kernels mmvq q5 k batched bf16 can implement).
baracuda_kernels_mmvq_q5_K_batched_bf16_run
Batched MMVQ — Q5_K, bf16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q5_K_batched_can_implement
baracuda_kernels_mmvq_q5_K_batched_can_implement (baracuda kernels mmvq q5 k batched can implement).
baracuda_kernels_mmvq_q5_K_batched_f16_can_implement
baracuda_kernels_mmvq_q5_K_batched_f16_can_implement (baracuda kernels mmvq q5 k batched f16 can implement).
baracuda_kernels_mmvq_q5_K_batched_f16_run
Batched MMVQ — Q5_K, f16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q5_K_batched_run
Batched MMVQ — Q5_K, f32. # Safety: as Q2_K.
baracuda_kernels_mmvq_q5_K_bf16_can_implement
baracuda_kernels_mmvq_q5_K_bf16_can_implement (baracuda kernels mmvq q5 k bf16 can implement).
baracuda_kernels_mmvq_q5_K_bf16_run
MMVQ — Q5_K, bf16. # Safety: as Q2_K bf16.
baracuda_kernels_mmvq_q5_K_can_implement
baracuda_kernels_mmvq_q5_K_can_implement (baracuda kernels mmvq q5 k can implement).
baracuda_kernels_mmvq_q5_K_f16_can_implement
baracuda_kernels_mmvq_q5_K_f16_can_implement (baracuda kernels mmvq q5 k f16 can implement).
baracuda_kernels_mmvq_q5_K_f16_run
MMVQ — Q5_K, f16. # Safety: as Q2_K f16.
baracuda_kernels_mmvq_q5_K_run
GGUF Q5_K MMVQ. # Safety: as Q2_K.
baracuda_kernels_mmvq_q6_K_actstrided_bf16_can_implement
baracuda_kernels_mmvq_q6_K_actstrided_bf16_can_implement (baracuda kernels mmvq q6 k actstrided bf16 can implement).
baracuda_kernels_mmvq_q6_K_actstrided_bf16_run
Strided MMVQ — Q6_K, bf16. # Safety: as Q2_K strided bf16.
baracuda_kernels_mmvq_q6_K_actstrided_can_implement
baracuda_kernels_mmvq_q6_K_actstrided_can_implement (baracuda kernels mmvq q6 k actstrided can implement).
baracuda_kernels_mmvq_q6_K_actstrided_f16_can_implement
baracuda_kernels_mmvq_q6_K_actstrided_f16_can_implement (baracuda kernels mmvq q6 k actstrided f16 can implement).
baracuda_kernels_mmvq_q6_K_actstrided_f16_run
Strided MMVQ — Q6_K, f16. # Safety: as Q2_K strided f16.
baracuda_kernels_mmvq_q6_K_actstrided_run
Strided MMVQ — GGUF Q6_K. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q6_K_batched_bf16_can_implement
baracuda_kernels_mmvq_q6_K_batched_bf16_can_implement (baracuda kernels mmvq q6 k batched bf16 can implement).
baracuda_kernels_mmvq_q6_K_batched_bf16_run
Batched MMVQ — Q6_K, bf16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q6_K_batched_can_implement
baracuda_kernels_mmvq_q6_K_batched_can_implement (baracuda kernels mmvq q6 k batched can implement).
baracuda_kernels_mmvq_q6_K_batched_f16_can_implement
baracuda_kernels_mmvq_q6_K_batched_f16_can_implement (baracuda kernels mmvq q6 k batched f16 can implement).
baracuda_kernels_mmvq_q6_K_batched_f16_run
Batched MMVQ — Q6_K, f16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q6_K_batched_run
Batched MMVQ — Q6_K, f32. # Safety: as Q2_K.
baracuda_kernels_mmvq_q6_K_bf16_can_implement
baracuda_kernels_mmvq_q6_K_bf16_can_implement (baracuda kernels mmvq q6 k bf16 can implement).
baracuda_kernels_mmvq_q6_K_bf16_run
MMVQ — Q6_K, bf16. # Safety: as Q2_K bf16.
baracuda_kernels_mmvq_q6_K_can_implement
baracuda_kernels_mmvq_q6_K_can_implement (baracuda kernels mmvq q6 k can implement).
baracuda_kernels_mmvq_q6_K_f16_can_implement
baracuda_kernels_mmvq_q6_K_f16_can_implement (baracuda kernels mmvq q6 k f16 can implement).
baracuda_kernels_mmvq_q6_K_f16_run
MMVQ — Q6_K, f16. # Safety: as Q2_K f16.
baracuda_kernels_mmvq_q6_K_run
GGUF Q6_K MMVQ. # Safety: as Q2_K.
baracuda_kernels_mmvq_q8_0_actstrided_bf16_can_implement
baracuda_kernels_mmvq_q8_0_actstrided_bf16_can_implement (baracuda kernels mmvq q8 0 actstrided bf16 can implement).
baracuda_kernels_mmvq_q8_0_actstrided_bf16_run
Strided MMVQ — Q8_0, bf16. # Safety: as Q4_0 strided bf16.
baracuda_kernels_mmvq_q8_0_actstrided_can_implement
baracuda_kernels_mmvq_q8_0_actstrided_can_implement (baracuda kernels mmvq q8 0 actstrided can implement).
baracuda_kernels_mmvq_q8_0_actstrided_f16_can_implement
baracuda_kernels_mmvq_q8_0_actstrided_f16_can_implement (baracuda kernels mmvq q8 0 actstrided f16 can implement).
baracuda_kernels_mmvq_q8_0_actstrided_f16_run
Strided MMVQ — Q8_0, f16. # Safety: as Q4_0 strided f16.
baracuda_kernels_mmvq_q8_0_actstrided_run
Strided MMVQ — GGUF Q8_0. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q8_0_batched_bf16_can_implement
baracuda_kernels_mmvq_q8_0_batched_bf16_can_implement (baracuda kernels mmvq q8 0 batched bf16 can implement).
baracuda_kernels_mmvq_q8_0_batched_bf16_run
Batched MMVQ — Q8_0, bf16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q8_0_batched_can_implement
baracuda_kernels_mmvq_q8_0_batched_can_implement (baracuda kernels mmvq q8 0 batched can implement).
baracuda_kernels_mmvq_q8_0_batched_f16_can_implement
baracuda_kernels_mmvq_q8_0_batched_f16_can_implement (baracuda kernels mmvq q8 0 batched f16 can implement).
baracuda_kernels_mmvq_q8_0_batched_f16_run
Batched MMVQ — Q8_0, f16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q8_0_batched_run
Batched MMVQ — Q8_0, f32. # Safety: as Q4_0.
baracuda_kernels_mmvq_q8_0_bf16_can_implement
baracuda_kernels_mmvq_q8_0_bf16_can_implement (baracuda kernels mmvq q8 0 bf16 can implement).
baracuda_kernels_mmvq_q8_0_bf16_run
MMVQ — Q8_0, bf16. # Safety: as Q4_0 bf16.
baracuda_kernels_mmvq_q8_0_can_implement
baracuda_kernels_mmvq_q8_0_can_implement (baracuda kernels mmvq q8 0 can implement).
baracuda_kernels_mmvq_q8_0_f16_can_implement
baracuda_kernels_mmvq_q8_0_f16_can_implement (baracuda kernels mmvq q8 0 f16 can implement).
baracuda_kernels_mmvq_q8_0_f16_run
MMVQ — Q8_0, f16. # Safety: as Q4_0 f16.
baracuda_kernels_mmvq_q8_0_run
GGUF Q8_0 MMVQ. # Safety: as Q4_0.
baracuda_kernels_mmvq_q8_K_actstrided_bf16_can_implement
baracuda_kernels_mmvq_q8_K_actstrided_bf16_can_implement (baracuda kernels mmvq q8 k actstrided bf16 can implement).
baracuda_kernels_mmvq_q8_K_actstrided_bf16_run
Strided MMVQ — Q8_K, bf16 (bespoke). # Safety: as Q2_K strided bf16.
baracuda_kernels_mmvq_q8_K_actstrided_can_implement
baracuda_kernels_mmvq_q8_K_actstrided_can_implement (baracuda kernels mmvq q8 k actstrided can implement).
baracuda_kernels_mmvq_q8_K_actstrided_f16_can_implement
baracuda_kernels_mmvq_q8_K_actstrided_f16_can_implement (baracuda kernels mmvq q8 k actstrided f16 can implement).
baracuda_kernels_mmvq_q8_K_actstrided_f16_run
Strided MMVQ — Q8_K, f16 (bespoke). # Safety: as Q2_K strided f16.
baracuda_kernels_mmvq_q8_K_actstrided_run
Strided MMVQ — GGUF Q8_K (bespoke; Phase 11.4 + 14.5).
baracuda_kernels_mmvq_q8_K_batched_bf16_can_implement
baracuda_kernels_mmvq_q8_K_batched_bf16_can_implement (baracuda kernels mmvq q8 k batched bf16 can implement).
baracuda_kernels_mmvq_q8_K_batched_bf16_run
Batched MMVQ — Q8_K (bespoke), bf16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q8_K_batched_can_implement
baracuda_kernels_mmvq_q8_K_batched_can_implement (baracuda kernels mmvq q8 k batched can implement).
baracuda_kernels_mmvq_q8_K_batched_f16_can_implement
baracuda_kernels_mmvq_q8_K_batched_f16_can_implement (baracuda kernels mmvq q8 k batched f16 can implement).
baracuda_kernels_mmvq_q8_K_batched_f16_run
Batched MMVQ — Q8_K (bespoke), f16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q8_K_batched_run
Batched MMVQ — Q8_K (bespoke), f32. # Safety: as Q2_K.
baracuda_kernels_mmvq_q8_K_bf16_can_implement
baracuda_kernels_mmvq_q8_K_bf16_can_implement (baracuda kernels mmvq q8 k bf16 can implement).
baracuda_kernels_mmvq_q8_K_bf16_run
MMVQ — Q8_K, bf16 (bespoke; Phase 11.4 + 18.1). # Safety: as Q2_K bf16.
baracuda_kernels_mmvq_q8_K_can_implement
baracuda_kernels_mmvq_q8_K_can_implement (baracuda kernels mmvq q8 k can implement).
baracuda_kernels_mmvq_q8_K_f16_can_implement
baracuda_kernels_mmvq_q8_K_f16_can_implement (baracuda kernels mmvq q8 k f16 can implement).
baracuda_kernels_mmvq_q8_K_f16_run
MMVQ — Q8_K, f16 (bespoke; Phase 11.4 + 18.1). # Safety: as Q2_K f16.
baracuda_kernels_mmvq_q8_K_run
GGUF Q8_K MMVQ — Phase 11.4 (bespoke, not vendored from llama.cpp). ncols must be a multiple of 256. # Safety: as Q2_K.
baracuda_kernels_moe_scalar_gguf_can_implement
baracuda_kernels_moe_scalar_gguf_can_implement (baracuda kernels moe scalar gguf can implement).
baracuda_kernels_moe_scalar_gguf_run
MoE forward — scalar dispatch path on GGUF-packed expert weights. f32 activations in, f32 output out.
baracuda_kernels_moe_wmma_bf16_can_implement
baracuda_kernels_moe_wmma_bf16_can_implement (baracuda kernels moe wmma bf16 can implement).
baracuda_kernels_moe_wmma_bf16_run
MoE forward — WMMA FP weights, bf16 activations + weights, bf16 output.
baracuda_kernels_moe_wmma_f16_can_implement
baracuda_kernels_moe_wmma_f16_can_implement (baracuda kernels moe wmma f16 can implement).
baracuda_kernels_moe_wmma_f16_run
MoE forward — WMMA FP weights, f16 activations + weights, f16 output. Output buffer must be zero-initialized by the caller when topk_weights == null and topk > 1 (multiple writes per row).
baracuda_kernels_moe_wmma_gguf_bf16_can_implement
baracuda_kernels_moe_wmma_gguf_bf16_can_implement (baracuda kernels moe wmma gguf bf16 can implement).
baracuda_kernels_moe_wmma_gguf_bf16_run
MoE forward — WMMA + GGUF combined path, bf16 activations.
baracuda_kernels_moe_wmma_gguf_f16_can_implement
baracuda_kernels_moe_wmma_gguf_f16_can_implement (baracuda kernels moe wmma gguf f16 can implement).
baracuda_kernels_moe_wmma_gguf_f16_run
MoE forward — WMMA + GGUF combined path. f16 activations, GGUF-packed weights, f32 output.
baracuda_kernels_msort_backward_f32_can_implement
baracuda_kernels_msort_backward_f32_can_implement (baracuda kernels msort backward f32 can implement).
baracuda_kernels_msort_backward_f32_run
Msort BW, f32. Same scatter as sort BW; distinct symbol kept for FFI / telemetry parity.
baracuda_kernels_msort_backward_f64_can_implement
baracuda_kernels_msort_backward_f64_can_implement (baracuda kernels msort backward f64 can implement).
baracuda_kernels_msort_backward_f64_run
Msort BW, f64.
baracuda_kernels_msort_f32_can_implement
baracuda_kernels_msort_f32_can_implement (baracuda kernels msort f32 can implement).
baracuda_kernels_msort_f32_run
Stable block-bitonic sort, f32. Tie-break on original index so equal keys preserve input order.
baracuda_kernels_msort_f64_can_implement
baracuda_kernels_msort_f64_can_implement (baracuda kernels msort f64 can implement).
baracuda_kernels_msort_f64_run
Stable block-bitonic sort, f64.
baracuda_kernels_msort_i32_can_implement
baracuda_kernels_msort_i32_can_implement (baracuda kernels msort i32 can implement).
baracuda_kernels_msort_i32_run
Stable block-bitonic sort, i32.
baracuda_kernels_msort_i64_can_implement
baracuda_kernels_msort_i64_can_implement (baracuda kernels msort i64 can implement).
baracuda_kernels_msort_i64_run
Stable block-bitonic sort, i64.
baracuda_kernels_nms_f32_can_implement
baracuda_kernels_nms_f32_can_implement (baracuda kernels nms f32 can implement).
baracuda_kernels_nms_f32_run
nms(boxes, iou_thresh). Caller supplies boxes pre-sorted by score, descending. boxes: [num_boxes, 4] (x1, y1, x2, y2). keep_mask: [num_boxes] u8 (0 / 1); count_out: single i32. f32. # Safety: as above.
baracuda_kernels_nms_f64_can_implement
baracuda_kernels_nms_f64_can_implement (baracuda kernels nms f64 can implement).
baracuda_kernels_nms_f64_run
nms, f64. # Safety: as f32.
baracuda_kernels_nonzero_bool_can_implement
Implementability check for nonzero_bool.
baracuda_kernels_nonzero_bool_run
nonzero — bool (u8) input.
baracuda_kernels_nonzero_f32_can_implement
Implementability check for nonzero_f32.
baracuda_kernels_nonzero_f32_run
Coordinates where x[i] != 0. f32 input.
baracuda_kernels_nonzero_f64_can_implement
Implementability check for nonzero_f64.
baracuda_kernels_nonzero_f64_run
nonzero — f64 input.
baracuda_kernels_nonzero_i32_can_implement
Implementability check for nonzero_i32.
baracuda_kernels_nonzero_i32_run
nonzero — i32 input.
baracuda_kernels_nonzero_i64idx_bool_can_implement
Implementability check for nonzero_i64idx_bool.
baracuda_kernels_nonzero_i64idx_bool_run
nonzero — bool input, i64 output coords.
baracuda_kernels_nonzero_i64idx_f32_can_implement
Implementability check for nonzero_i64idx_f32.
baracuda_kernels_nonzero_i64idx_f32_run
nonzero — f32 input, i64 output coords.
baracuda_kernels_nonzero_i64idx_f64_can_implement
Implementability check for nonzero_i64idx_f64.
baracuda_kernels_nonzero_i64idx_f64_run
nonzero — f64 input, i64 output coords.
baracuda_kernels_nonzero_i64idx_i32_can_implement
Implementability check for nonzero_i64idx_i32.
baracuda_kernels_nonzero_i64idx_i32_run
nonzero — i32 input, i64 output coords.
baracuda_kernels_one_hot_bool_can_implement
Implementability check for one_hot_bool.
baracuda_kernels_one_hot_bool_run
one_hot — bool output (u8 storage).
baracuda_kernels_one_hot_f32_can_implement
Implementability check for one_hot_f32.
baracuda_kernels_one_hot_f32_run
out[..., c] = 1 if c == src[...] else 0. Output last axis has extent num_classes. Input dtype is always i32; output is f32.
baracuda_kernels_one_hot_f64_can_implement
Implementability check for one_hot_f64.
baracuda_kernels_one_hot_f64_run
one_hot — f64 output.
baracuda_kernels_one_hot_i32_can_implement
Implementability check for one_hot_i32.
baracuda_kernels_one_hot_i32_run
one_hot — i32 output.
baracuda_kernels_one_hot_i64idx_bool_can_implement
Implementability check for one_hot_i64idx_bool.
baracuda_kernels_one_hot_i64idx_bool_run
one_hot — bool output, i64 indices.
baracuda_kernels_one_hot_i64idx_f32_can_implement
Implementability check for one_hot_i64idx_f32.
baracuda_kernels_one_hot_i64idx_f32_run
one_hot — f32 output, i64 input class indices.
baracuda_kernels_one_hot_i64idx_f64_can_implement
Implementability check for one_hot_i64idx_f64.
baracuda_kernels_one_hot_i64idx_f64_run
one_hot — f64 output, i64 indices.
baracuda_kernels_one_hot_i64idx_i32_can_implement
Implementability check for one_hot_i64idx_i32.
baracuda_kernels_one_hot_i64idx_i32_run
one_hot — i32 output, i64 indices.
baracuda_kernels_ormqr_f32_run
Apply Householder-encoded Q (from a prior geqrf) to c_inout. side ∈ {0=Left, 1=Right}; op ∈ {0=N, 1=T, 2=C}. On Left + op=N, computes C := Q · C; pair with a pre-staged identity C to materialize dense Q.
baracuda_kernels_ormqr_f64_run
Apply Householder-encoded Q (from a prior geqrf) to c_inout. side ∈ {0=Left, 1=Right}; op ∈ {0=N, 1=T, 2=C}. On Left + op=N, computes C := Q · C; pair with a pre-staged identity C to materialize dense Q.
baracuda_kernels_pad_circular_bf16_can_implement
baracuda_kernels_pad_circular_bf16_can_implement (baracuda kernels pad circular bf16 can implement).
baracuda_kernels_pad_circular_bf16_run
Pad circular, bf16.
baracuda_kernels_pad_circular_f16_can_implement
baracuda_kernels_pad_circular_f16_can_implement (baracuda kernels pad circular f16 can implement).
baracuda_kernels_pad_circular_f16_run
Pad circular, f16.
baracuda_kernels_pad_circular_f32_can_implement
baracuda_kernels_pad_circular_f32_can_implement (baracuda kernels pad circular f32 can implement).
baracuda_kernels_pad_circular_f32_run
Pad circular, f32. Cyclic wrap from the opposite end of each axis.
baracuda_kernels_pad_circular_f64_can_implement
baracuda_kernels_pad_circular_f64_can_implement (baracuda kernels pad circular f64 can implement).
baracuda_kernels_pad_circular_f64_run
Pad circular, f64.
baracuda_kernels_pad_constant_backward_bf16_can_implement
baracuda_kernels_pad_constant_backward_bf16_can_implement (baracuda kernels pad constant backward bf16 can implement).
baracuda_kernels_pad_constant_backward_bf16_run
Pad-constant backward (slice), bf16.
baracuda_kernels_pad_constant_backward_f16_can_implement
baracuda_kernels_pad_constant_backward_f16_can_implement (baracuda kernels pad constant backward f16 can implement).
baracuda_kernels_pad_constant_backward_f16_run
Pad-constant backward (slice), f16.
baracuda_kernels_pad_constant_backward_f32_can_implement
baracuda_kernels_pad_constant_backward_f32_can_implement (baracuda kernels pad constant backward f32 can implement).
baracuda_kernels_pad_constant_backward_f32_run
Pad-constant backward (slice), f32.
baracuda_kernels_pad_constant_backward_f64_can_implement
baracuda_kernels_pad_constant_backward_f64_can_implement (baracuda kernels pad constant backward f64 can implement).
baracuda_kernels_pad_constant_backward_f64_run
Pad-constant backward (slice), f64.
baracuda_kernels_pad_constant_bf16_can_implement
baracuda_kernels_pad_constant_bf16_can_implement (baracuda kernels pad constant bf16 can implement).
baracuda_kernels_pad_constant_bf16_run
Pad with a constant value, bf16, contig output. The value argument carries the __nv_bfloat16 bit pattern as u16 — Rust callers can produce it via half::bf16::to_bits().
baracuda_kernels_pad_constant_f16_can_implement
baracuda_kernels_pad_constant_f16_can_implement (baracuda kernels pad constant f16 can implement).
baracuda_kernels_pad_constant_f16_run
Pad with a constant value, f16, contig output. The value argument carries the __half bit pattern as u16 — Rust callers can produce it via half::f16::to_bits(). ABI-compatible because __half is a 2-byte __CUDA_ALIGN__(2) POD struct passed in the same register slot as unsigned short.
baracuda_kernels_pad_constant_f32_can_implement
baracuda_kernels_pad_constant_f32_can_implement (baracuda kernels pad constant f32 can implement).
baracuda_kernels_pad_constant_f32_run
Pad with a constant value, f32, contig output.
baracuda_kernels_pad_constant_f64_can_implement
baracuda_kernels_pad_constant_f64_can_implement (baracuda kernels pad constant f64 can implement).
baracuda_kernels_pad_constant_f64_run
Pad with a constant value, f64, contig output.
baracuda_kernels_pad_reflect_bf16_can_implement
baracuda_kernels_pad_reflect_bf16_can_implement (baracuda kernels pad reflect bf16 can implement).
baracuda_kernels_pad_reflect_bf16_run
Pad reflect, bf16.
baracuda_kernels_pad_reflect_f16_can_implement
baracuda_kernels_pad_reflect_f16_can_implement (baracuda kernels pad reflect f16 can implement).
baracuda_kernels_pad_reflect_f16_run
Pad reflect, f16.
baracuda_kernels_pad_reflect_f32_can_implement
baracuda_kernels_pad_reflect_f32_can_implement (baracuda kernels pad reflect f32 can implement).
baracuda_kernels_pad_reflect_f32_run
Pad reflect, f32. Mirror input across the boundary (no edge duplication).
baracuda_kernels_pad_reflect_f64_can_implement
baracuda_kernels_pad_reflect_f64_can_implement (baracuda kernels pad reflect f64 can implement).
baracuda_kernels_pad_reflect_f64_run
Pad reflect, f64.
baracuda_kernels_pad_replicate_bf16_can_implement
Implementability check for baracuda_kernels_pad_replicate_bf16. Host-side only.
baracuda_kernels_pad_replicate_bf16_run
Pad replicate, bf16.
baracuda_kernels_pad_replicate_f16_can_implement
Implementability check for baracuda_kernels_pad_replicate_f16. Host-side only.
baracuda_kernels_pad_replicate_f16_run
Pad replicate, f16.
baracuda_kernels_pad_replicate_f32_can_implement
Implementability check for baracuda_kernels_pad_replicate_f32. Host-side only.
baracuda_kernels_pad_replicate_f32_run
Pad replicate, f32. Clamp to the edge value of the input.
baracuda_kernels_pad_replicate_f64_can_implement
Implementability check for baracuda_kernels_pad_replicate_f64. Host-side only.
baracuda_kernels_pad_replicate_f64_run
Pad replicate, f64.
baracuda_kernels_permute_bf16_can_implement
Pre-launch implementability check for permute_bf16.
baracuda_kernels_permute_bf16_run
Materialized permute, bf16. Pure element copy — no math.
baracuda_kernels_permute_bf16_strided_can_implement
permute_bf16_strided_can_implement companion.
baracuda_kernels_permute_bf16_strided_run
Permute strided sibling, bf16.
baracuda_kernels_permute_f16_can_implement
Pre-launch implementability check for permute_f16.
baracuda_kernels_permute_f16_run
Materialized permute, f16. Pure element copy — no math.
baracuda_kernels_permute_f16_strided_can_implement
permute_f16_strided_can_implement companion.
baracuda_kernels_permute_f16_strided_run
Permute strided sibling, f16.
baracuda_kernels_permute_f32_can_implement
Pre-launch implementability check for permute_f32.
baracuda_kernels_permute_f32_run
Materialized permute, f32.
baracuda_kernels_permute_f32_strided_can_implement
permute_f32_strided_can_implement companion.
baracuda_kernels_permute_f32_strided_run
Permute strided sibling, f32.
baracuda_kernels_permute_f64_can_implement
Pre-launch implementability check for permute_f64.
baracuda_kernels_permute_f64_run
Materialized permute, f64. Pure element copy — no math.
baracuda_kernels_permute_f64_strided_can_implement
permute_f64_strided_can_implement companion.
baracuda_kernels_permute_f64_strided_run
Permute strided sibling, f64.
baracuda_kernels_pixel_shuffle_bf16_can_implement
baracuda_kernels_pixel_shuffle_bf16_can_implement (baracuda kernels pixel shuffle bf16 can implement).
baracuda_kernels_pixel_shuffle_bf16_run
pixel_shuffle, bf16. # Safety: as f32.
baracuda_kernels_pixel_shuffle_f16_can_implement
baracuda_kernels_pixel_shuffle_f16_can_implement (baracuda kernels pixel shuffle f16 can implement).
baracuda_kernels_pixel_shuffle_f16_run
pixel_shuffle, f16. # Safety: as f32.
baracuda_kernels_pixel_shuffle_f32_can_implement
baracuda_kernels_pixel_shuffle_f32_can_implement (baracuda kernels pixel shuffle f32 can implement).
baracuda_kernels_pixel_shuffle_f32_run
pixel_shuffle(x, r)[N, C·r², H, W] → [N, C, H·r, W·r]. f32. # Safety: as above.
baracuda_kernels_pixel_shuffle_f64_can_implement
baracuda_kernels_pixel_shuffle_f64_can_implement (baracuda kernels pixel shuffle f64 can implement).
baracuda_kernels_pixel_shuffle_f64_run
pixel_shuffle, f64. # Safety: as f32.
baracuda_kernels_pixel_unshuffle_bf16_can_implement
baracuda_kernels_pixel_unshuffle_bf16_can_implement (baracuda kernels pixel unshuffle bf16 can implement).
baracuda_kernels_pixel_unshuffle_bf16_run
pixel_unshuffle, bf16. # Safety: as f32.
baracuda_kernels_pixel_unshuffle_f16_can_implement
baracuda_kernels_pixel_unshuffle_f16_can_implement (baracuda kernels pixel unshuffle f16 can implement).
baracuda_kernels_pixel_unshuffle_f16_run
pixel_unshuffle, f16. # Safety: as f32.
baracuda_kernels_pixel_unshuffle_f32_can_implement
baracuda_kernels_pixel_unshuffle_f32_can_implement (baracuda kernels pixel unshuffle f32 can implement).
baracuda_kernels_pixel_unshuffle_f32_run
pixel_unshuffle(x, r)[N, C, H·r, W·r] → [N, C·r², H, W]. Inverse of pixel_shuffle (and each is the other’s BW). f32.
baracuda_kernels_pixel_unshuffle_f64_can_implement
baracuda_kernels_pixel_unshuffle_f64_can_implement (baracuda kernels pixel unshuffle f64 can implement).
baracuda_kernels_pixel_unshuffle_f64_run
pixel_unshuffle, f64. # Safety: as f32.
baracuda_kernels_prelu_backward_bf16_can_implement
Implementability check for baracuda_kernels_prelu_backward_bf16. Host-side only.
baracuda_kernels_prelu_backward_bf16_run
PReLU BW, bf16.
baracuda_kernels_prelu_backward_f16_can_implement
Implementability check for baracuda_kernels_prelu_backward_f16. Host-side only.
baracuda_kernels_prelu_backward_f16_run
PReLU BW, f16.
baracuda_kernels_prelu_backward_f32_can_implement
Implementability check for baracuda_kernels_prelu_backward_f32. Host-side only.
baracuda_kernels_prelu_backward_f32_run
PReLU BW, f32. ABI: (numel, channel_stride, channel_extent, scalar_weight, dy, x, weight, dx, dweight, workspace, workspace_bytes, stream).
baracuda_kernels_prelu_backward_f64_can_implement
Implementability check for baracuda_kernels_prelu_backward_f64. Host-side only.
baracuda_kernels_prelu_backward_f64_run
PReLU BW, f64.
baracuda_kernels_prelu_bf16_can_implement
baracuda_kernels_prelu_bf16_can_implement (baracuda kernels prelu bf16 can implement).
baracuda_kernels_prelu_bf16_run
PReLU FW, bf16.
baracuda_kernels_prelu_f16_can_implement
baracuda_kernels_prelu_f16_can_implement (baracuda kernels prelu f16 can implement).
baracuda_kernels_prelu_f16_run
PReLU FW, f16.
baracuda_kernels_prelu_f32_can_implement
baracuda_kernels_prelu_f32_can_implement (baracuda kernels prelu f32 can implement).
baracuda_kernels_prelu_f32_run
PReLU FW, f32. ABI: (numel, channel_stride, channel_extent, scalar_weight, x, weight, y, workspace, workspace_bytes, stream).
baracuda_kernels_prelu_f64_can_implement
baracuda_kernels_prelu_f64_can_implement (baracuda kernels prelu f64 can implement).
baracuda_kernels_prelu_f64_run
PReLU FW, f64.
baracuda_kernels_qr_f32_run
QR factorization (packed Householder output, m >= n required). a_inout is overwritten with R (upper triangle) + Householder reflectors (strict lower); tau_out is [min(m, n)].
baracuda_kernels_qr_f32_workspace_size
QR factorization workspace size in bytes for geqrf.
baracuda_kernels_qr_f64_run
QR factorization (packed Householder output, m >= n required). a_inout is overwritten with R (upper triangle) + Householder reflectors (strict lower); tau_out is [min(m, n)].
baracuda_kernels_qr_f64_workspace_size
QR factorization workspace size in bytes for geqrf.
baracuda_kernels_quantize_per_channel_backward_bf16_can_implement
Implementability check for quantize_per_channel_backward_bf16.
baracuda_kernels_quantize_per_channel_backward_bf16_run
quantize_per_channel_backward — bf16.
baracuda_kernels_quantize_per_channel_backward_f16_can_implement
Implementability check for quantize_per_channel_backward_f16.
baracuda_kernels_quantize_per_channel_backward_f16_run
quantize_per_channel_backward — f16.
baracuda_kernels_quantize_per_channel_backward_f32_can_implement
Implementability check for quantize_per_channel_backward_f32.
baracuda_kernels_quantize_per_channel_backward_f32_run
dx[i] = (dy[i] / scale[c]) * in_range_mask(x[i]). f32.
baracuda_kernels_quantize_per_channel_backward_f64_can_implement
Implementability check for quantize_per_channel_backward_f64.
baracuda_kernels_quantize_per_channel_backward_f64_run
quantize_per_channel_backward — f64.
baracuda_kernels_quantize_per_channel_bf16_s8_can_implement
Implementability check for quantize_per_channel_bf16_s8.
baracuda_kernels_quantize_per_channel_bf16_s8_run
quantize_per_channel — bf16 → s8.
baracuda_kernels_quantize_per_channel_bf16_u8_can_implement
Implementability check for quantize_per_channel_bf16_u8.
baracuda_kernels_quantize_per_channel_bf16_u8_run
quantize_per_channel — bf16 → u8.
baracuda_kernels_quantize_per_channel_f16_s8_can_implement
Implementability check for quantize_per_channel_f16_s8.
baracuda_kernels_quantize_per_channel_f16_s8_run
quantize_per_channel — f16 → s8.
baracuda_kernels_quantize_per_channel_f16_u8_can_implement
Implementability check for quantize_per_channel_f16_u8.
baracuda_kernels_quantize_per_channel_f16_u8_run
quantize_per_channel — f16 → u8.
baracuda_kernels_quantize_per_channel_f32_s8_can_implement
Implementability check for quantize_per_channel_f32_s8.
baracuda_kernels_quantize_per_channel_f32_s8_run
q[i] = clamp(round(x[i]/scale[c])+zp[c], qmin, qmax) where c = coord[axis]. f32 → s8.
baracuda_kernels_quantize_per_channel_f32_u8_can_implement
Implementability check for quantize_per_channel_f32_u8.
baracuda_kernels_quantize_per_channel_f32_u8_run
quantize_per_channel — f32 → u8.
baracuda_kernels_quantize_per_channel_f64_s8_can_implement
Implementability check for quantize_per_channel_f64_s8.
baracuda_kernels_quantize_per_channel_f64_s8_run
quantize_per_channel — f64 → s8.
baracuda_kernels_quantize_per_channel_f64_u8_can_implement
Implementability check for quantize_per_channel_f64_u8.
baracuda_kernels_quantize_per_channel_f64_u8_run
quantize_per_channel — f64 → u8.
baracuda_kernels_quantize_per_group_backward_bf16_can_implement
Implementability check for quantize_per_group_backward_bf16.
baracuda_kernels_quantize_per_group_backward_bf16_run
STE BW — bf16.
baracuda_kernels_quantize_per_group_backward_f16_can_implement
Implementability check for quantize_per_group_backward_f16.
baracuda_kernels_quantize_per_group_backward_f16_run
STE BW — f16.
baracuda_kernels_quantize_per_group_backward_f32_can_implement
Implementability check for quantize_per_group_backward_f32.
baracuda_kernels_quantize_per_group_backward_f32_run
STE BW — f32.
baracuda_kernels_quantize_per_group_backward_f64_can_implement
Implementability check for quantize_per_group_backward_f64.
baracuda_kernels_quantize_per_group_backward_f64_run
STE BW — f64.
baracuda_kernels_quantize_per_group_bf16_s8_can_implement
Implementability check for quantize_per_group_bf16_s8.
baracuda_kernels_quantize_per_group_bf16_s8_run
quantize_per_group — bf16 → s8.
baracuda_kernels_quantize_per_group_bf16_u8_can_implement
Implementability check for quantize_per_group_bf16_u8.
baracuda_kernels_quantize_per_group_bf16_u8_run
quantize_per_group — bf16 → u8.
baracuda_kernels_quantize_per_group_f16_s8_can_implement
Implementability check for quantize_per_group_f16_s8.
baracuda_kernels_quantize_per_group_f16_s8_run
quantize_per_group — f16 → s8.
baracuda_kernels_quantize_per_group_f16_u8_can_implement
Implementability check for quantize_per_group_f16_u8.
baracuda_kernels_quantize_per_group_f16_u8_run
quantize_per_group — f16 → u8.
baracuda_kernels_quantize_per_group_f32_s8_can_implement
Implementability check for quantize_per_group_f32_s8.
baracuda_kernels_quantize_per_group_f32_s8_run
quantize_per_group — f32 → s8.
baracuda_kernels_quantize_per_group_f32_u8_can_implement
Implementability check for quantize_per_group_f32_u8.
baracuda_kernels_quantize_per_group_f32_u8_run
quantize_per_group — f32 → u8.
baracuda_kernels_quantize_per_group_f64_s8_can_implement
Implementability check for quantize_per_group_f64_s8.
baracuda_kernels_quantize_per_group_f64_s8_run
quantize_per_group — f64 → s8.
baracuda_kernels_quantize_per_group_f64_u8_can_implement
Implementability check for quantize_per_group_f64_u8.
baracuda_kernels_quantize_per_group_f64_u8_run
quantize_per_group — f64 → u8.
baracuda_kernels_quantize_per_tensor_backward_bf16_can_implement
Implementability check for quantize_per_tensor_backward_bf16.
baracuda_kernels_quantize_per_tensor_backward_bf16_run
quantize_per_tensor_backward — bf16.
baracuda_kernels_quantize_per_tensor_backward_f16_can_implement
Implementability check for quantize_per_tensor_backward_f16.
baracuda_kernels_quantize_per_tensor_backward_f16_run
quantize_per_tensor_backward — f16.
baracuda_kernels_quantize_per_tensor_backward_f32_can_implement
Implementability check for quantize_per_tensor_backward_f32.
baracuda_kernels_quantize_per_tensor_backward_f32_run
dx = (dy / scale) * in_range_mask(x). f32.
baracuda_kernels_quantize_per_tensor_backward_f64_can_implement
Implementability check for quantize_per_tensor_backward_f64.
baracuda_kernels_quantize_per_tensor_backward_f64_run
quantize_per_tensor_backward — f64 (f64 scale).
baracuda_kernels_quantize_per_tensor_bf16_s8_can_implement
Implementability check for quantize_per_tensor_bf16_s8.
baracuda_kernels_quantize_per_tensor_bf16_s8_run
quantize_per_tensor — bf16 → s8.
baracuda_kernels_quantize_per_tensor_bf16_u8_can_implement
Implementability check for quantize_per_tensor_bf16_u8.
baracuda_kernels_quantize_per_tensor_bf16_u8_run
quantize_per_tensor — bf16 → u8.
baracuda_kernels_quantize_per_tensor_f16_s8_can_implement
Implementability check for quantize_per_tensor_f16_s8.
baracuda_kernels_quantize_per_tensor_f16_s8_run
quantize_per_tensor — f16 → s8.
baracuda_kernels_quantize_per_tensor_f16_u8_can_implement
Implementability check for quantize_per_tensor_f16_u8.
baracuda_kernels_quantize_per_tensor_f16_u8_run
quantize_per_tensor — f16 → u8.
baracuda_kernels_quantize_per_tensor_f32_s8_can_implement
Implementability check for quantize_per_tensor_f32_s8.
baracuda_kernels_quantize_per_tensor_f32_s8_run
q = clamp(round(x/scale)+zp, qmin, qmax). f32 input, s8 output.
baracuda_kernels_quantize_per_tensor_f32_u8_can_implement
Implementability check for quantize_per_tensor_f32_u8.
baracuda_kernels_quantize_per_tensor_f32_u8_run
quantize_per_tensor — f32 → u8.
baracuda_kernels_quantize_per_tensor_f64_s8_can_implement
Implementability check for quantize_per_tensor_f64_s8.
baracuda_kernels_quantize_per_tensor_f64_s8_run
quantize_per_tensor — f64 → s8 (f64 scale).
baracuda_kernels_quantize_per_tensor_f64_u8_can_implement
Implementability check for quantize_per_tensor_f64_u8.
baracuda_kernels_quantize_per_tensor_f64_u8_run
quantize_per_tensor — f64 → u8 (f64 scale).
baracuda_kernels_quantize_per_token_backward_bf16_can_implement
Implementability check for quantize_per_token_backward_bf16.
baracuda_kernels_quantize_per_token_backward_bf16_run
STE backward — bf16.
baracuda_kernels_quantize_per_token_backward_f16_can_implement
Implementability check for quantize_per_token_backward_f16.
baracuda_kernels_quantize_per_token_backward_f16_run
STE backward — f16.
baracuda_kernels_quantize_per_token_backward_f32_can_implement
Implementability check for quantize_per_token_backward_f32.
baracuda_kernels_quantize_per_token_backward_f32_run
STE backward — f32.
baracuda_kernels_quantize_per_token_backward_f64_can_implement
Implementability check for quantize_per_token_backward_f64.
baracuda_kernels_quantize_per_token_backward_f64_run
STE backward — f64.
baracuda_kernels_quantize_per_token_bf16_s8_can_implement
Implementability check for quantize_per_token_bf16_s8.
baracuda_kernels_quantize_per_token_bf16_s8_run
quantize_per_token — bf16 → s8.
baracuda_kernels_quantize_per_token_bf16_u8_can_implement
Implementability check for quantize_per_token_bf16_u8.
baracuda_kernels_quantize_per_token_bf16_u8_run
quantize_per_token — bf16 → u8.
baracuda_kernels_quantize_per_token_f16_s8_can_implement
Implementability check for quantize_per_token_f16_s8.
baracuda_kernels_quantize_per_token_f16_s8_run
quantize_per_token — f16 → s8.
baracuda_kernels_quantize_per_token_f16_u8_can_implement
Implementability check for quantize_per_token_f16_u8.
baracuda_kernels_quantize_per_token_f16_u8_run
quantize_per_token — f16 → u8.
baracuda_kernels_quantize_per_token_f32_s8_can_implement
Implementability check for quantize_per_token_f32_s8.
baracuda_kernels_quantize_per_token_f32_s8_run
quantize_per_token — TIn f32, TOut s8. Status codes as elsewhere.
baracuda_kernels_quantize_per_token_f32_u8_can_implement
Implementability check for quantize_per_token_f32_u8.
baracuda_kernels_quantize_per_token_f32_u8_run
quantize_per_token — f32 → u8.
baracuda_kernels_quantize_per_token_f64_s8_can_implement
Implementability check for quantize_per_token_f64_s8.
baracuda_kernels_quantize_per_token_f64_s8_run
quantize_per_token — f64 → s8.
baracuda_kernels_quantize_per_token_f64_u8_can_implement
Implementability check for quantize_per_token_f64_u8.
baracuda_kernels_quantize_per_token_f64_u8_run
quantize_per_token — f64 → u8.
baracuda_kernels_quantize_q8_1_bf16_can_implement
baracuda_kernels_quantize_q8_1_bf16_can_implement (baracuda kernels quantize q8 1 bf16 can implement).
baracuda_kernels_quantize_q8_1_bf16_run
Q8_1 activation staging — bf16 source. # Safety: as f32 variant.
baracuda_kernels_quantize_q8_1_f16_can_implement
baracuda_kernels_quantize_q8_1_f16_can_implement (baracuda kernels quantize q8 1 f16 can implement).
baracuda_kernels_quantize_q8_1_f16_run
Q8_1 activation staging — f16 source. # Safety: as f32 variant.
baracuda_kernels_quantize_q8_1_f32_can_implement
baracuda_kernels_quantize_q8_1_f32_can_implement (baracuda kernels quantize q8 1 f32 can implement).
baracuda_kernels_quantize_q8_1_f32_run
Q8_1 activation staging — f32 source.
baracuda_kernels_quantize_q8_1_workspace_bytes
Returns workspace bytes needed to stage ny × kx activations into Q8_1. = ny * ceil(kx / 32) * 36. Returns 0 on invalid (non-positive) arguments.
baracuda_kernels_quantized_linear_w8a8_f32_can_implement
Implementability check for quantized_linear_w8a8_f32.
baracuda_kernels_quantized_linear_w8a8_f32_run
quantized_linear_w8a8 — TIn = f32.
baracuda_kernels_quantized_linear_w8a8_f64_can_implement
Implementability check for quantized_linear_w8a8_f64.
baracuda_kernels_quantized_linear_w8a8_f64_run
quantized_linear_w8a8 — TIn = f64.
baracuda_kernels_reduce_all_bf16_can_implement
Pre-launch implementability check for reduce_all_bf16.
baracuda_kernels_reduce_all_bf16_run
all(x, axis=k) with bf16 input, uint8_t Bool output.
baracuda_kernels_reduce_all_bool_can_implement
Pre-launch implementability check for reduce_all_bool.
baracuda_kernels_reduce_all_bool_run
all(x, axis=k) with Bool (uint8_t) input, uint8_t Bool output.
baracuda_kernels_reduce_all_f16_can_implement
Pre-launch implementability check for reduce_all_f16.
baracuda_kernels_reduce_all_f16_run
all(x, axis=k) with f16 input, uint8_t Bool output.
baracuda_kernels_reduce_all_f32_can_implement
Pre-launch implementability check for reduce_all_f32.
baracuda_kernels_reduce_all_f32_run
all(x, axis=k) with f32 input, uint8_t Bool output.
baracuda_kernels_reduce_all_f64_can_implement
Pre-launch implementability check for reduce_all_f64.
baracuda_kernels_reduce_all_f64_run
all(x, axis=k) with f64 input, uint8_t Bool output.
baracuda_kernels_reduce_all_i32_can_implement
Pre-launch implementability check for reduce_all_i32.
baracuda_kernels_reduce_all_i32_run
all(x, axis=k) with i32 input, uint8_t Bool output.
baracuda_kernels_reduce_all_i64_can_implement
Pre-launch implementability check for reduce_all_i64.
baracuda_kernels_reduce_all_i64_run
all(x, axis=k) with i64 input, uint8_t Bool output.
baracuda_kernels_reduce_any_bf16_can_implement
Pre-launch implementability check for reduce_any_bf16.
baracuda_kernels_reduce_any_bf16_run
any(x, axis=k) with bf16 input, uint8_t Bool output.
baracuda_kernels_reduce_any_bool_can_implement
Pre-launch implementability check for reduce_any_bool.
baracuda_kernels_reduce_any_bool_run
any(x, axis=k) with Bool (uint8_t) input, uint8_t Bool output.
baracuda_kernels_reduce_any_f16_can_implement
Pre-launch implementability check for reduce_any_f16.
baracuda_kernels_reduce_any_f16_run
any(x, axis=k) with f16 input, uint8_t Bool output.
baracuda_kernels_reduce_any_f32_can_implement
Pre-launch implementability check for reduce_any_f32.
baracuda_kernels_reduce_any_f32_run
any(x, axis=k) with f32 input, uint8_t Bool output.
baracuda_kernels_reduce_any_f64_can_implement
Pre-launch implementability check for reduce_any_f64.
baracuda_kernels_reduce_any_f64_run
any(x, axis=k) with f64 input, uint8_t Bool output.
baracuda_kernels_reduce_any_i32_can_implement
Pre-launch implementability check for reduce_any_i32.
baracuda_kernels_reduce_any_i32_run
any(x, axis=k) with i32 input, uint8_t Bool output.
baracuda_kernels_reduce_any_i64_can_implement
Pre-launch implementability check for reduce_any_i64.
baracuda_kernels_reduce_any_i64_run
any(x, axis=k) with i64 input, uint8_t Bool output.
baracuda_kernels_reduce_count_nonzero_bf16_can_implement
Pre-launch implementability check for reduce_count_nonzero_bf16.
baracuda_kernels_reduce_count_nonzero_bf16_run
count_nonzero(x, axis=k) with bf16 input, i64 output.
baracuda_kernels_reduce_count_nonzero_bool_can_implement
Pre-launch implementability check for reduce_count_nonzero_bool.
baracuda_kernels_reduce_count_nonzero_bool_run
count_nonzero(x, axis=k) with Bool (uint8_t) input, i64 output.
baracuda_kernels_reduce_count_nonzero_f16_can_implement
Pre-launch implementability check for reduce_count_nonzero_f16.
baracuda_kernels_reduce_count_nonzero_f16_run
count_nonzero(x, axis=k) with f16 input, i64 output.
baracuda_kernels_reduce_count_nonzero_f32_can_implement
Pre-launch implementability check for reduce_count_nonzero_f32.
baracuda_kernels_reduce_count_nonzero_f32_run
count_nonzero(x, axis=k) with f32 input, i64 output.
baracuda_kernels_reduce_count_nonzero_f64_can_implement
Pre-launch implementability check for reduce_count_nonzero_f64.
baracuda_kernels_reduce_count_nonzero_f64_run
count_nonzero(x, axis=k) with f64 input, i64 output.
baracuda_kernels_reduce_count_nonzero_i32_can_implement
Pre-launch implementability check for reduce_count_nonzero_i32.
baracuda_kernels_reduce_count_nonzero_i32_run
count_nonzero(x, axis=k) with i32 input, i64 output.
baracuda_kernels_reduce_count_nonzero_i64_can_implement
Pre-launch implementability check for reduce_count_nonzero_i64.
baracuda_kernels_reduce_count_nonzero_i64_run
count_nonzero(x, axis=k) with i64 input, i64 output.
baracuda_kernels_reduce_logsumexp_backward_bf16_can_implement
Pre-launch implementability check for reduce_logsumexp_backward_bf16.
baracuda_kernels_reduce_logsumexp_backward_bf16_run
LogSumExp reduction backward, bf16.
baracuda_kernels_reduce_logsumexp_backward_f16_can_implement
Pre-launch implementability check for reduce_logsumexp_backward_f16.
baracuda_kernels_reduce_logsumexp_backward_f16_run
LogSumExp reduction backward, f16.
baracuda_kernels_reduce_logsumexp_backward_f32_can_implement
Pre-launch implementability check for reduce_logsumexp_backward_f32.
baracuda_kernels_reduce_logsumexp_backward_f32_run
LogSumExp reduction backward, f32.
baracuda_kernels_reduce_logsumexp_backward_f64_can_implement
Pre-launch implementability check for reduce_logsumexp_backward_f64.
baracuda_kernels_reduce_logsumexp_backward_f64_run
LogSumExp reduction backward, f64.
baracuda_kernels_reduce_logsumexp_bf16_can_implement
Implementability check for baracuda_kernels_reduce_logsumexp_bf16. Host-side only.
baracuda_kernels_reduce_logsumexp_bf16_run
LogSumExp reduction along one axis, bf16 (f32-detour throughout).
baracuda_kernels_reduce_logsumexp_f16_can_implement
Implementability check for baracuda_kernels_reduce_logsumexp_f16. Host-side only.
baracuda_kernels_reduce_logsumexp_f16_run
LogSumExp reduction along one axis, f16 (f32-detour throughout).
baracuda_kernels_reduce_logsumexp_f32_can_implement
Implementability check for baracuda_kernels_reduce_logsumexp_f32. Host-side only.
baracuda_kernels_reduce_logsumexp_f32_run
LogSumExp reduction along one axis, f32 — numerically stable two-pass max-then-sum-exp. Shares the simple-reduce parameter shape so the Rust dispatcher can reach it through the same FFI signature; the kernel internally performs two passes over the reduce axis.
baracuda_kernels_reduce_logsumexp_f64_can_implement
Implementability check for baracuda_kernels_reduce_logsumexp_f64. Host-side only.
baracuda_kernels_reduce_logsumexp_f64_run
LogSumExp reduction along one axis, f64.
baracuda_kernels_reduce_max_bf16_can_implement
Pre-launch implementability check for reduce_max_bf16.
baracuda_kernels_reduce_max_bf16_run
Max reduction along one axis, bf16 (f32-detour fmaxf).
baracuda_kernels_reduce_max_f16_can_implement
Pre-launch implementability check for reduce_max_f16.
baracuda_kernels_reduce_max_f16_run
Max reduction along one axis, f16 (f32-detour fmaxf).
baracuda_kernels_reduce_max_f32_can_implement
Pre-launch implementability check for reduce_max_f32.
baracuda_kernels_reduce_max_f32_run
Max reduction along one axis, f32. init = -INFINITY, fmaxf.
baracuda_kernels_reduce_max_f64_can_implement
Pre-launch implementability check for reduce_max_f64.
baracuda_kernels_reduce_max_f64_run
Max reduction along one axis, f64.
baracuda_kernels_reduce_max_i8_can_implement
Pre-launch implementability check for reduce_max_i8.
baracuda_kernels_reduce_max_i8_run
max(x, axis=k) with i8 input/output (init = INT8_MIN).
baracuda_kernels_reduce_max_i16_can_implement
Pre-launch implementability check for reduce_max_i16.
baracuda_kernels_reduce_max_i16_run
max(x, axis=k) with i16 input/output (init = INT16_MIN).
baracuda_kernels_reduce_max_i32_can_implement
Pre-launch implementability check for reduce_max_i32.
baracuda_kernels_reduce_max_i32_run
max(x, axis=k) with i32 input/output (init = INT32_MIN).
baracuda_kernels_reduce_max_i64_can_implement
Pre-launch implementability check for reduce_max_i64.
baracuda_kernels_reduce_max_i64_run
max(x, axis=k) with i64 input/output (init = INT64_MIN).
baracuda_kernels_reduce_max_min_backward_bf16_can_implement
Pre-launch implementability check for reduce_max_min_backward_bf16.
baracuda_kernels_reduce_max_min_backward_bf16_run
Max/Min reduction backward, bf16.
baracuda_kernels_reduce_max_min_backward_f16_can_implement
Pre-launch implementability check for reduce_max_min_backward_f16.
baracuda_kernels_reduce_max_min_backward_f16_run
Max/Min reduction backward, f16.
baracuda_kernels_reduce_max_min_backward_f32_can_implement
Pre-launch implementability check for reduce_max_min_backward_f32.
baracuda_kernels_reduce_max_min_backward_f32_run
Max/Min reduction backward, f32.
baracuda_kernels_reduce_max_min_backward_f64_can_implement
Pre-launch implementability check for reduce_max_min_backward_f64.
baracuda_kernels_reduce_max_min_backward_f64_run
Max/Min reduction backward, f64.
baracuda_kernels_reduce_max_to_bf16_can_implement
baracuda_kernels_reduce_max_to_bf16_can_implement (baracuda kernels reduce max to bf16 can implement).
baracuda_kernels_reduce_max_to_bf16_run
reduce_max_to, bf16.
baracuda_kernels_reduce_max_to_f16_can_implement
baracuda_kernels_reduce_max_to_f16_can_implement (baracuda kernels reduce max to f16 can implement).
baracuda_kernels_reduce_max_to_f16_run
reduce_max_to, f16. Identity is -FLT_MAX in f32 accumulator space, narrowed back to f16 on store.
baracuda_kernels_reduce_max_to_f32_can_implement
baracuda_kernels_reduce_max_to_f32_can_implement (baracuda kernels reduce max to f32 can implement).
baracuda_kernels_reduce_max_to_f32_run
reduce_max_to, f32. Identity is -FLT_MAX when the broadcast set is empty.
baracuda_kernels_reduce_max_to_f64_can_implement
baracuda_kernels_reduce_max_to_f64_can_implement (baracuda kernels reduce max to f64 can implement).
baracuda_kernels_reduce_max_to_f64_run
reduce_max_to, f64. Identity is -DBL_MAX.
baracuda_kernels_reduce_max_u8_can_implement
Pre-launch implementability check for reduce_max_u8.
baracuda_kernels_reduce_max_u8_run
max(x, axis=k) with u8 input/output (init = 0).
baracuda_kernels_reduce_max_u32_can_implement
Pre-launch implementability check for reduce_max_u32.
baracuda_kernels_reduce_max_u32_run
max(x, axis=k) with u32 input/output (init = 0).
baracuda_kernels_reduce_mean_backward_bf16_can_implement
Pre-launch implementability check for reduce_mean_backward_bf16.
baracuda_kernels_reduce_mean_backward_bf16_run
Mean reduction backward, bf16.
baracuda_kernels_reduce_mean_backward_f16_can_implement
Pre-launch implementability check for reduce_mean_backward_f16.
baracuda_kernels_reduce_mean_backward_f16_run
Mean reduction backward, f16.
baracuda_kernels_reduce_mean_backward_f32_can_implement
Pre-launch implementability check for reduce_mean_backward_f32.
baracuda_kernels_reduce_mean_backward_f32_run
Mean reduction backward, f32. Same as Sum BW with extra 1/k scale (inv_extent is 1.0 / reduced_extent computed in f64 on the host).
baracuda_kernels_reduce_mean_backward_f64_can_implement
Pre-launch implementability check for reduce_mean_backward_f64.
baracuda_kernels_reduce_mean_backward_f64_run
Mean reduction backward, f64.
baracuda_kernels_reduce_mean_bf16_can_implement
Pre-launch implementability check for reduce_mean_bf16.
baracuda_kernels_reduce_mean_bf16_run
Mean reduction along one axis, bf16 (f32-detour for sum + divide).
baracuda_kernels_reduce_mean_f16_can_implement
Pre-launch implementability check for reduce_mean_f16.
baracuda_kernels_reduce_mean_f16_run
Mean reduction along one axis, f16 (f32-detour for sum + divide).
baracuda_kernels_reduce_mean_f32_can_implement
Pre-launch implementability check for reduce_mean_f32.
baracuda_kernels_reduce_mean_f32_run
Mean reduction along one axis, f32. Sum then divide by extent.
baracuda_kernels_reduce_mean_f64_can_implement
Pre-launch implementability check for reduce_mean_f64.
baracuda_kernels_reduce_mean_f64_run
Mean reduction along one axis, f64.
baracuda_kernels_reduce_min_bf16_can_implement
Pre-launch implementability check for reduce_min_bf16.
baracuda_kernels_reduce_min_bf16_run
Min reduction along one axis, bf16 (f32-detour fminf).
baracuda_kernels_reduce_min_f16_can_implement
Pre-launch implementability check for reduce_min_f16.
baracuda_kernels_reduce_min_f16_run
Min reduction along one axis, f16 (f32-detour fminf).
baracuda_kernels_reduce_min_f32_can_implement
Pre-launch implementability check for reduce_min_f32.
baracuda_kernels_reduce_min_f32_run
Min reduction along one axis, f32. init = +INFINITY, fminf.
baracuda_kernels_reduce_min_f64_can_implement
Pre-launch implementability check for reduce_min_f64.
baracuda_kernels_reduce_min_f64_run
Min reduction along one axis, f64.
baracuda_kernels_reduce_min_i8_can_implement
Pre-launch implementability check for reduce_min_i8.
baracuda_kernels_reduce_min_i8_run
min(x, axis=k) with i8 input/output (init = INT8_MAX).
baracuda_kernels_reduce_min_i16_can_implement
Pre-launch implementability check for reduce_min_i16.
baracuda_kernels_reduce_min_i16_run
min(x, axis=k) with i16 input/output (init = INT16_MAX).
baracuda_kernels_reduce_min_i32_can_implement
Pre-launch implementability check for reduce_min_i32.
baracuda_kernels_reduce_min_i32_run
min(x, axis=k) with i32 input/output (init = INT32_MAX).
baracuda_kernels_reduce_min_i64_can_implement
Pre-launch implementability check for reduce_min_i64.
baracuda_kernels_reduce_min_i64_run
min(x, axis=k) with i64 input/output (init = INT64_MAX).
baracuda_kernels_reduce_min_to_bf16_can_implement
baracuda_kernels_reduce_min_to_bf16_can_implement (baracuda kernels reduce min to bf16 can implement).
baracuda_kernels_reduce_min_to_bf16_run
reduce_min_to, bf16.
baracuda_kernels_reduce_min_to_f16_can_implement
baracuda_kernels_reduce_min_to_f16_can_implement (baracuda kernels reduce min to f16 can implement).
baracuda_kernels_reduce_min_to_f16_run
reduce_min_to, f16. Accumulator widens to f32; identity is +FLT_MAX in f32 accumulator space, narrowing to +inf on store.
baracuda_kernels_reduce_min_to_f32_can_implement
baracuda_kernels_reduce_min_to_f32_can_implement (baracuda kernels reduce min to f32 can implement).
baracuda_kernels_reduce_min_to_f32_run
reduce_min_to, f32. Identity is +FLT_MAX when the broadcast set is empty.
baracuda_kernels_reduce_min_to_f64_can_implement
baracuda_kernels_reduce_min_to_f64_can_implement (baracuda kernels reduce min to f64 can implement).
baracuda_kernels_reduce_min_to_f64_run
reduce_min_to, f64. Identity is +DBL_MAX.
baracuda_kernels_reduce_min_u8_can_implement
Pre-launch implementability check for reduce_min_u8.
baracuda_kernels_reduce_min_u8_run
min(x, axis=k) with u8 input/output (same-dtype, init = UINT8_MAX).
baracuda_kernels_reduce_min_u32_can_implement
Pre-launch implementability check for reduce_min_u32.
baracuda_kernels_reduce_min_u32_run
min(x, axis=k) with u32 input/output (init = UINT32_MAX).
baracuda_kernels_reduce_norm2_backward_bf16_can_implement
Pre-launch implementability check for reduce_norm2_backward_bf16.
baracuda_kernels_reduce_norm2_backward_bf16_run
Norm2 reduction backward, bf16.
baracuda_kernels_reduce_norm2_backward_f16_can_implement
Pre-launch implementability check for reduce_norm2_backward_f16.
baracuda_kernels_reduce_norm2_backward_f16_run
Norm2 reduction backward, f16.
baracuda_kernels_reduce_norm2_backward_f32_can_implement
Pre-launch implementability check for reduce_norm2_backward_f32.
baracuda_kernels_reduce_norm2_backward_f32_run
Norm2 reduction backward, f32.
baracuda_kernels_reduce_norm2_backward_f64_can_implement
Pre-launch implementability check for reduce_norm2_backward_f64.
baracuda_kernels_reduce_norm2_backward_f64_run
Norm2 reduction backward, f64.
baracuda_kernels_reduce_norm2_bf16_can_implement
Pre-launch implementability check for reduce_norm2_bf16.
baracuda_kernels_reduce_norm2_bf16_run
Norm2 reduction along one axis, bf16 (f32-detour functor + sqrt).
baracuda_kernels_reduce_norm2_f16_can_implement
Pre-launch implementability check for reduce_norm2_f16.
baracuda_kernels_reduce_norm2_f16_run
Norm2 reduction along one axis, f16 (f32-detour functor + sqrt).
baracuda_kernels_reduce_norm2_f32_can_implement
Pre-launch implementability check for reduce_norm2_f32.
baracuda_kernels_reduce_norm2_f32_run
Norm2 reduction along one axis, f32. y = sqrt(sum(x*x)) — shares the simple-reduce parameter shape.
baracuda_kernels_reduce_norm2_f64_can_implement
Pre-launch implementability check for reduce_norm2_f64.
baracuda_kernels_reduce_norm2_f64_run
Norm2 reduction along one axis, f64.
baracuda_kernels_reduce_prod_backward_bf16_can_implement
Pre-launch implementability check for reduce_prod_backward_bf16.
baracuda_kernels_reduce_prod_backward_bf16_run
Prod reduction backward, bf16.
baracuda_kernels_reduce_prod_backward_f16_can_implement
Pre-launch implementability check for reduce_prod_backward_f16.
baracuda_kernels_reduce_prod_backward_f16_run
Prod reduction backward, f16.
baracuda_kernels_reduce_prod_backward_f32_can_implement
Pre-launch implementability check for reduce_prod_backward_f32.
baracuda_kernels_reduce_prod_backward_f32_run
Prod reduction backward, f32.
baracuda_kernels_reduce_prod_backward_f64_can_implement
Pre-launch implementability check for reduce_prod_backward_f64.
baracuda_kernels_reduce_prod_backward_f64_run
Prod reduction backward, f64.
baracuda_kernels_reduce_prod_bf16_can_implement
Pre-launch implementability check for reduce_prod_bf16.
baracuda_kernels_reduce_prod_bf16_run
Product reduction along one axis, bf16 (f32-detour multiply).
baracuda_kernels_reduce_prod_f16_can_implement
Pre-launch implementability check for reduce_prod_f16.
baracuda_kernels_reduce_prod_f16_run
Product reduction along one axis, f16 (f32-detour multiply).
baracuda_kernels_reduce_prod_f32_can_implement
Pre-launch implementability check for reduce_prod_f32.
baracuda_kernels_reduce_prod_f32_run
Product reduction along one axis, f32. init = 1, op = *.
baracuda_kernels_reduce_prod_f64_can_implement
Pre-launch implementability check for reduce_prod_f64.
baracuda_kernels_reduce_prod_f64_run
Product reduction along one axis, f64.
baracuda_kernels_reduce_prod_i8_can_implement
Pre-launch implementability check for reduce_prod_i8.
baracuda_kernels_reduce_prod_i8_run
prod(x, axis=k) with i8 input/output (wider i64 accumulator).
baracuda_kernels_reduce_prod_i16_can_implement
Pre-launch implementability check for reduce_prod_i16.
baracuda_kernels_reduce_prod_i16_run
prod(x, axis=k) with i16 input/output (wider i64 accumulator).
baracuda_kernels_reduce_prod_i32_can_implement
Pre-launch implementability check for reduce_prod_i32.
baracuda_kernels_reduce_prod_i32_run
prod(x, axis=k) with i32 input/output (wider i64 accumulator).
baracuda_kernels_reduce_prod_i64_can_implement
Pre-launch implementability check for reduce_prod_i64.
baracuda_kernels_reduce_prod_i64_run
prod(x, axis=k) with i64 input/output. Modulo-2^64 wrap.
baracuda_kernels_reduce_prod_to_bf16_can_implement
baracuda_kernels_reduce_prod_to_bf16_can_implement (baracuda kernels reduce prod to bf16 can implement).
baracuda_kernels_reduce_prod_to_bf16_run
reduce_prod_to, bf16.
baracuda_kernels_reduce_prod_to_f16_can_implement
baracuda_kernels_reduce_prod_to_f16_can_implement (baracuda kernels reduce prod to f16 can implement).
baracuda_kernels_reduce_prod_to_f16_run
reduce_prod_to, f16. Cumulative product overflows fast in half-precision; callers should keep values close to 1.
baracuda_kernels_reduce_prod_to_f32_can_implement
baracuda_kernels_reduce_prod_to_f32_can_implement (baracuda kernels reduce prod to f32 can implement).
baracuda_kernels_reduce_prod_to_f32_run
reduce_prod_to, f32. Identity is 1 (multiplicative). Half dtypes accumulate in f32 then narrow on store.
baracuda_kernels_reduce_prod_to_f64_can_implement
baracuda_kernels_reduce_prod_to_f64_can_implement (baracuda kernels reduce prod to f64 can implement).
baracuda_kernels_reduce_prod_to_f64_run
reduce_prod_to, f64.
baracuda_kernels_reduce_prod_u8_can_implement
Pre-launch implementability check for reduce_prod_u8.
baracuda_kernels_reduce_prod_u8_run
prod(x, axis=k) with u8 input/output (wider u64 accumulator, wrap-on-overflow narrow on store).
baracuda_kernels_reduce_prod_u32_can_implement
Pre-launch implementability check for reduce_prod_u32.
baracuda_kernels_reduce_prod_u32_run
prod(x, axis=k) with u32 input/output (wider u64 accumulator).
baracuda_kernels_reduce_std_backward_bf16_can_implement
Pre-launch implementability check for reduce_std_backward_bf16.
baracuda_kernels_reduce_std_backward_bf16_run
Std-dev reduction backward, bf16.
baracuda_kernels_reduce_std_backward_f16_can_implement
Pre-launch implementability check for reduce_std_backward_f16.
baracuda_kernels_reduce_std_backward_f16_run
Std-dev reduction backward, f16.
baracuda_kernels_reduce_std_backward_f32_can_implement
Pre-launch implementability check for reduce_std_backward_f32.
baracuda_kernels_reduce_std_backward_f32_run
Std-dev reduction backward, f32 (Welford BW + sqrt term).
baracuda_kernels_reduce_std_backward_f64_can_implement
Pre-launch implementability check for reduce_std_backward_f64.
baracuda_kernels_reduce_std_backward_f64_run
Std-dev reduction backward, f64 (Welford BW in f64 + sqrt term).
baracuda_kernels_reduce_std_bf16_can_implement
Pre-launch implementability check for reduce_std_bf16.
baracuda_kernels_reduce_std_bf16_run
Std-dev along one axis, bf16.
baracuda_kernels_reduce_std_f16_can_implement
Pre-launch implementability check for reduce_std_f16.
baracuda_kernels_reduce_std_f16_run
Std-dev along one axis, f16.
baracuda_kernels_reduce_std_f32_can_implement
Pre-launch implementability check for reduce_std_f32.
baracuda_kernels_reduce_std_f32_run
Std-dev along one axis, f32, Welford + sqrt.
baracuda_kernels_reduce_std_f64_can_implement
Pre-launch implementability check for reduce_std_f64.
baracuda_kernels_reduce_std_f64_run
Std-dev along one axis, f64 (Welford in f64 + sqrt).
baracuda_kernels_reduce_sum_backward_bf16_can_implement
Pre-launch implementability check for reduce_sum_backward_bf16.
baracuda_kernels_reduce_sum_backward_bf16_run
Sum reduction backward, bf16.
baracuda_kernels_reduce_sum_backward_f16_can_implement
Pre-launch implementability check for reduce_sum_backward_f16.
baracuda_kernels_reduce_sum_backward_f16_run
Sum reduction backward, f16.
baracuda_kernels_reduce_sum_backward_f32_can_implement
Pre-launch implementability check for reduce_sum_backward_f32.
baracuda_kernels_reduce_sum_backward_f32_run
Sum reduction backward, f32. dx[c] = dy[c_with_reduce_axis_0] realized via stride-0 broadcast on the reduce axis.
baracuda_kernels_reduce_sum_backward_f64_can_implement
Pre-launch implementability check for reduce_sum_backward_f64.
baracuda_kernels_reduce_sum_backward_f64_run
Sum reduction backward, f64.
baracuda_kernels_reduce_sum_bf16_can_implement
Pre-launch implementability check for reduce_sum_bf16.
baracuda_kernels_reduce_sum_bf16_run
Sum reduction along one axis, bf16 (f32-detour functor).
baracuda_kernels_reduce_sum_f16_can_implement
Pre-launch implementability check for reduce_sum_f16.
baracuda_kernels_reduce_sum_f16_run
Sum reduction along one axis, f16.
baracuda_kernels_reduce_sum_f32_can_implement
Pre-launch implementability check for reduce_sum_f32.
baracuda_kernels_reduce_sum_f32_run
Sum reduction along one axis, f32, naive thread-per-output-cell.
baracuda_kernels_reduce_sum_f64_can_implement
Pre-launch implementability check for reduce_sum_f64.
baracuda_kernels_reduce_sum_f64_run
Sum reduction along one axis, f64.
baracuda_kernels_reduce_sum_i8_can_implement
Pre-launch implementability check for reduce_sum_i8.
baracuda_kernels_reduce_sum_i8_run
sum(x, axis=k) with i8 input/output (wider i64 accumulator).
baracuda_kernels_reduce_sum_i16_can_implement
Pre-launch implementability check for reduce_sum_i16.
baracuda_kernels_reduce_sum_i16_run
sum(x, axis=k) with i16 input/output (wider i64 accumulator).
baracuda_kernels_reduce_sum_i32_can_implement
Pre-launch implementability check for reduce_sum_i32.
baracuda_kernels_reduce_sum_i32_run
sum(x, axis=k) with i32 input/output (wider i64 accumulator).
baracuda_kernels_reduce_sum_i64_can_implement
Pre-launch implementability check for reduce_sum_i64.
baracuda_kernels_reduce_sum_i64_run
sum(x, axis=k) with i64 input/output. Accumulator and output share dtype; modulo-2^64 wrap is the natural device behaviour.
baracuda_kernels_reduce_sum_to_bf16_can_implement
baracuda_kernels_reduce_sum_to_bf16_can_implement (baracuda kernels reduce sum to bf16 can implement).
baracuda_kernels_reduce_sum_to_bf16_run
reduce_sum_to, bf16. Accumulator widens to f32.
baracuda_kernels_reduce_sum_to_f16_can_implement
baracuda_kernels_reduce_sum_to_f16_can_implement (baracuda kernels reduce sum to f16 can implement).
baracuda_kernels_reduce_sum_to_f16_run
reduce_sum_to, f16. Accumulator widens to f32 per the rest of the family’s convention.
baracuda_kernels_reduce_sum_to_f32_can_implement
baracuda_kernels_reduce_sum_to_f32_can_implement (baracuda kernels reduce sum to f32 can implement).
baracuda_kernels_reduce_sum_to_f32_run
reduce_sum_to, f32. Broadcast-reverse Σ. Phase 31.
baracuda_kernels_reduce_sum_to_f64_can_implement
baracuda_kernels_reduce_sum_to_f64_can_implement (baracuda kernels reduce sum to f64 can implement).
baracuda_kernels_reduce_sum_to_f64_run
reduce_sum_to, f64.
baracuda_kernels_reduce_sum_u8_can_implement
Pre-launch implementability check for reduce_sum_u8.
baracuda_kernels_reduce_sum_u8_run
sum(x, axis=k) with u8 input/output (wider u64 accumulator, wrap-on-overflow narrow on store).
baracuda_kernels_reduce_sum_u32_can_implement
Pre-launch implementability check for reduce_sum_u32.
baracuda_kernels_reduce_sum_u32_run
sum(x, axis=k) with u32 input/output (wider u64 accumulator).
baracuda_kernels_reduce_var_backward_bf16_can_implement
Pre-launch implementability check for reduce_var_backward_bf16.
baracuda_kernels_reduce_var_backward_bf16_run
Variance reduction backward, bf16.
baracuda_kernels_reduce_var_backward_f16_can_implement
Pre-launch implementability check for reduce_var_backward_f16.
baracuda_kernels_reduce_var_backward_f16_run
Variance reduction backward, f16.
baracuda_kernels_reduce_var_backward_f32_can_implement
Pre-launch implementability check for reduce_var_backward_f32.
baracuda_kernels_reduce_var_backward_f32_run
Variance reduction backward, f32 (Welford BW).
baracuda_kernels_reduce_var_backward_f64_can_implement
Pre-launch implementability check for reduce_var_backward_f64.
baracuda_kernels_reduce_var_backward_f64_run
Variance reduction backward, f64 (Welford BW in f64).
baracuda_kernels_reduce_var_bf16_can_implement
Pre-launch implementability check for reduce_var_bf16.
baracuda_kernels_reduce_var_bf16_run
Variance reduction along one axis, bf16.
baracuda_kernels_reduce_var_f16_can_implement
Pre-launch implementability check for reduce_var_f16.
baracuda_kernels_reduce_var_f16_run
Variance reduction along one axis, f16.
baracuda_kernels_reduce_var_f32_can_implement
Pre-launch implementability check for reduce_var_f32.
baracuda_kernels_reduce_var_f32_run
Variance reduction along one axis, f32, Welford one-pass. correction = 1 for Bessel-corrected sample variance, 0 for population variance.
baracuda_kernels_reduce_var_f64_can_implement
Pre-launch implementability check for reduce_var_f64.
baracuda_kernels_reduce_var_f64_run
Variance reduction along one axis, f64 (Welford in f64).
baracuda_kernels_repeat_backward_bf16_can_implement
baracuda_kernels_repeat_backward_bf16_can_implement (baracuda kernels repeat backward bf16 can implement).
baracuda_kernels_repeat_backward_bf16_run
Repeat backward (gather-adjoint sum), bf16. Accumulates in float.
baracuda_kernels_repeat_backward_f16_can_implement
baracuda_kernels_repeat_backward_f16_can_implement (baracuda kernels repeat backward f16 can implement).
baracuda_kernels_repeat_backward_f16_run
Repeat backward (gather-adjoint sum), f16. Accumulates in float.
baracuda_kernels_repeat_backward_f32_can_implement
baracuda_kernels_repeat_backward_f32_can_implement (baracuda kernels repeat backward f32 can implement).
baracuda_kernels_repeat_backward_f32_run
Repeat backward (gather-adjoint sum), f32.
baracuda_kernels_repeat_backward_f64_can_implement
baracuda_kernels_repeat_backward_f64_can_implement (baracuda kernels repeat backward f64 can implement).
baracuda_kernels_repeat_backward_f64_run
Repeat backward (gather-adjoint sum), f64.
baracuda_kernels_repeat_bf16_can_implement
Pre-launch implementability check for repeat_bf16.
baracuda_kernels_repeat_bf16_run
Repeat (per-axis tile), bf16.
baracuda_kernels_repeat_f16_can_implement
Pre-launch implementability check for repeat_f16.
baracuda_kernels_repeat_f16_run
Repeat (per-axis tile), f16. Same parameter shape as the f32 variant — pure copy, no arithmetic.
baracuda_kernels_repeat_f32_can_implement
Pre-launch implementability check for repeat_f32.
baracuda_kernels_repeat_f32_run
Repeat (per-axis tile), f32. output.shape[d] = input.shape[d] * repeats[d]. Kernel computes input_coord[d] = output_coord[d] % input.shape[d].
baracuda_kernels_repeat_f64_can_implement
Pre-launch implementability check for repeat_f64.
baracuda_kernels_repeat_f64_run
Repeat (per-axis tile), f64.
baracuda_kernels_rfft_1d_f32_run
1-D R2C FFT (real → Hermitian-half complex). Unnormalized (matches PyTorch’s norm="backward").
baracuda_kernels_rfft_1d_f32_workspace_size
1-D R2C FFT workspace size in bytes — always 0.
baracuda_kernels_rfft_1d_f64_run
1-D R2C FFT (real → Hermitian-half complex). Unnormalized (matches PyTorch’s norm="backward").
baracuda_kernels_rfft_1d_f64_workspace_size
1-D R2C FFT workspace size in bytes — always 0.
baracuda_kernels_rfft_nd_f32_run
ND R2C FFT (real → Hermitian-half complex). Unnormalized. dims[..rank] are real-side extents; complex output has dims[rank-1] / 2 + 1 on the last transformed axis.
baracuda_kernels_rfft_nd_f32_workspace_size
ND R2C FFT workspace size in bytes — always 0.
baracuda_kernels_rfft_nd_f64_run
ND R2C FFT (real → Hermitian-half complex). Unnormalized. dims[..rank] are real-side extents; complex output has dims[rank-1] / 2 + 1 on the last transformed axis.
baracuda_kernels_rfft_nd_f64_workspace_size
ND R2C FFT workspace size in bytes — always 0.
baracuda_kernels_rms_norm_backward_bf16_can_implement
baracuda_kernels_rms_norm_backward_bf16_can_implement (baracuda kernels rms norm backward bf16 can implement).
baracuda_kernels_rms_norm_backward_bf16_run
RMSNorm BW, bf16.
baracuda_kernels_rms_norm_backward_bf16_strided_can_implement
rms_norm_backward_bf16_strided_can_implement companion.
baracuda_kernels_rms_norm_backward_bf16_strided_run
RMSNorm BW strided sibling, bf16.
baracuda_kernels_rms_norm_backward_f16_can_implement
baracuda_kernels_rms_norm_backward_f16_can_implement (baracuda kernels rms norm backward f16 can implement).
baracuda_kernels_rms_norm_backward_f16_run
RMSNorm BW, f16.
baracuda_kernels_rms_norm_backward_f16_strided_can_implement
rms_norm_backward_f16_strided_can_implement companion.
baracuda_kernels_rms_norm_backward_f16_strided_run
RMSNorm BW strided sibling, f16.
baracuda_kernels_rms_norm_backward_f32_can_implement
baracuda_kernels_rms_norm_backward_f32_can_implement (baracuda kernels rms norm backward f32 can implement).
baracuda_kernels_rms_norm_backward_f32_run
RMSNorm BW, f32. Computes dx and (when dgamma != null) dgamma[i] = Σ over outer cells dy[..., i] · (x[..., i] / rms[..., 0]) where i ranges over the joint normalized region of length norm_total_extent.
baracuda_kernels_rms_norm_backward_f32_strided_can_implement
rms_norm_backward_f32_strided_can_implement companion.
baracuda_kernels_rms_norm_backward_f32_strided_run
RMSNorm BW strided sibling, f32. Same contract as baracuda_kernels_rms_norm_backward_f32_run; identical underlying launcher.
baracuda_kernels_rms_norm_backward_f64_can_implement
baracuda_kernels_rms_norm_backward_f64_can_implement (baracuda kernels rms norm backward f64 can implement).
baracuda_kernels_rms_norm_backward_f64_run
RMSNorm BW, f64.
baracuda_kernels_rms_norm_backward_f64_strided_can_implement
rms_norm_backward_f64_strided_can_implement companion.
baracuda_kernels_rms_norm_backward_f64_strided_run
RMSNorm BW strided sibling, f64.
baracuda_kernels_rms_norm_bf16_can_implement
baracuda_kernels_rms_norm_bf16_can_implement (baracuda kernels rms norm bf16 can implement).
baracuda_kernels_rms_norm_bf16_run
RMSNorm FW, bf16. f32 accumulator inside the kernel.
baracuda_kernels_rms_norm_bf16_strided_can_implement
rms_norm_bf16_strided_can_implement companion.
baracuda_kernels_rms_norm_bf16_strided_run
RMSNorm FW strided sibling, bf16. See rms_norm_f32_strided_run.
baracuda_kernels_rms_norm_f16_can_implement
baracuda_kernels_rms_norm_f16_can_implement (baracuda kernels rms norm f16 can implement).
baracuda_kernels_rms_norm_f16_run
RMSNorm FW, f16. f32 accumulator inside the kernel.
baracuda_kernels_rms_norm_f16_strided_can_implement
rms_norm_f16_strided_can_implement companion.
baracuda_kernels_rms_norm_f16_strided_run
RMSNorm FW strided sibling, f16. See rms_norm_f32_strided_run.
baracuda_kernels_rms_norm_f32_can_implement
baracuda_kernels_rms_norm_f32_can_implement (baracuda kernels rms norm f32 can implement).
baracuda_kernels_rms_norm_f32_run
RMSNorm FW, f32. y = x / sqrt(mean(x², over norm_axes) + eps) * gamma. norm_axes_mask is a bitmask over input axes (suffix of [0, rank)); norm_total_extent is the product of those axes’ extents. gamma may be null (treated as 1). rms_out shape equals input shape with norm axes collapsed to 1; only the slot at inner_lin == 0 within each row is written.
baracuda_kernels_rms_norm_f32_strided_can_implement
rms_norm_f32_strided_can_implement companion.
baracuda_kernels_rms_norm_f32_strided_run
RMSNorm FW strided sibling, f32. Same contract as baracuda_kernels_rms_norm_f32_run; identical underlying launcher.
baracuda_kernels_rms_norm_f64_can_implement
baracuda_kernels_rms_norm_f64_can_implement (baracuda kernels rms norm f64 can implement).
baracuda_kernels_rms_norm_f64_run
RMSNorm FW, f64.
baracuda_kernels_rms_norm_f64_strided_can_implement
rms_norm_f64_strided_can_implement companion.
baracuda_kernels_rms_norm_f64_strided_run
RMSNorm FW strided sibling, f64. See rms_norm_f32_strided_run.
baracuda_kernels_roi_align_backward_f32_can_implement
baracuda_kernels_roi_align_backward_f32_can_implement (baracuda kernels roi align backward f32 can implement).
baracuda_kernels_roi_align_backward_f32_run
roi_align BW, f32. Caller pre-zeros dinput. # Safety: as FW.
baracuda_kernels_roi_align_backward_f64_can_implement
baracuda_kernels_roi_align_backward_f64_can_implement (baracuda kernels roi align backward f64 can implement).
baracuda_kernels_roi_align_backward_f64_run
roi_align BW, f64. # Safety: as f32 BW.
baracuda_kernels_roi_align_f32_can_implement
baracuda_kernels_roi_align_f32_can_implement (baracuda kernels roi align f32 can implement).
baracuda_kernels_roi_align_f32_run
roi_align, f32. rois: [num_rois, 5] (batch_idx, x1, y1, x2, y2) in INPUT-pixel coords (scaled by spatial_scale inside the kernel). sampling_ratio == 0 selects adaptive sampling. aligned == 0 is PyTorch’s pre-0.6 convention.
baracuda_kernels_roi_align_f64_can_implement
baracuda_kernels_roi_align_f64_can_implement (baracuda kernels roi align f64 can implement).
baracuda_kernels_roi_align_f64_run
roi_align, f64. # Safety: as f32.
baracuda_kernels_roi_pool_backward_f32_can_implement
baracuda_kernels_roi_pool_backward_f32_can_implement (baracuda kernels roi pool backward f32 can implement).
baracuda_kernels_roi_pool_backward_f32_run
roi_pool BW, f32. Caller pre-zeros dinput. # Safety: as FW.
baracuda_kernels_roi_pool_backward_f64_can_implement
baracuda_kernels_roi_pool_backward_f64_can_implement (baracuda kernels roi pool backward f64 can implement).
baracuda_kernels_roi_pool_backward_f64_run
roi_pool BW, f64. # Safety: as f32 BW.
baracuda_kernels_roi_pool_f32_can_implement
baracuda_kernels_roi_pool_f32_can_implement (baracuda kernels roi pool f32 can implement).
baracuda_kernels_roi_pool_f32_run
roi_pool, f32. Writes output AND argmax (i32 linear plane-relative index per output cell; -1 for empty bins).
baracuda_kernels_roi_pool_f64_can_implement
baracuda_kernels_roi_pool_f64_can_implement (baracuda kernels roi pool f64 can implement).
baracuda_kernels_roi_pool_f64_run
roi_pool, f64. # Safety: as f32.
baracuda_kernels_roll_bf16_can_implement
Pre-launch implementability check for roll_bf16.
baracuda_kernels_roll_bf16_run
Roll, bf16. Pure element copy — no math.
baracuda_kernels_roll_bf16_strided_can_implement
roll_bf16_strided_can_implement companion.
baracuda_kernels_roll_bf16_strided_run
Roll strided sibling, bf16.
baracuda_kernels_roll_f16_can_implement
Pre-launch implementability check for roll_f16.
baracuda_kernels_roll_f16_run
Roll, f16. Pure element copy — no math.
baracuda_kernels_roll_f16_strided_can_implement
roll_f16_strided_can_implement companion.
baracuda_kernels_roll_f16_strided_run
Roll strided sibling, f16.
baracuda_kernels_roll_f32_can_implement
Pre-launch implementability check for roll_f32.
baracuda_kernels_roll_f32_run
Roll (cyclic shift along axes), f32. shifts[d] is the shift amount on axis d (positive or negative, mod shape[d]).
baracuda_kernels_roll_f32_strided_can_implement
roll_f32_strided_can_implement companion.
baracuda_kernels_roll_f32_strided_run
Roll strided sibling, f32.
baracuda_kernels_roll_f64_can_implement
Pre-launch implementability check for roll_f64.
baracuda_kernels_roll_f64_run
Roll, f64. Pure element copy — no math.
baracuda_kernels_roll_f64_strided_can_implement
roll_f64_strided_can_implement companion.
baracuda_kernels_roll_f64_strided_run
Roll strided sibling, f64.
baracuda_kernels_rope_apply_backward_bf16_can_implement
baracuda_kernels_rope_apply_backward_bf16_can_implement (baracuda kernels rope apply backward bf16 can implement).
baracuda_kernels_rope_apply_backward_bf16_run
RoPE apply BW, bf16.
baracuda_kernels_rope_apply_backward_f16_can_implement
baracuda_kernels_rope_apply_backward_f16_can_implement (baracuda kernels rope apply backward f16 can implement).
baracuda_kernels_rope_apply_backward_f16_run
RoPE apply BW, f16.
baracuda_kernels_rope_apply_backward_f32_can_implement
baracuda_kernels_rope_apply_backward_f32_can_implement (baracuda kernels rope apply backward f32 can implement).
baracuda_kernels_rope_apply_backward_f32_run
RoPE apply BW, f32. Same cos/sin tables as FW; orthogonal-rotation reverse.
baracuda_kernels_rope_apply_backward_f64_can_implement
baracuda_kernels_rope_apply_backward_f64_can_implement (baracuda kernels rope apply backward f64 can implement).
baracuda_kernels_rope_apply_backward_f64_run
RoPE apply BW, f64.
baracuda_kernels_rope_apply_bf16_can_implement
baracuda_kernels_rope_apply_bf16_can_implement (baracuda kernels rope apply bf16 can implement).
baracuda_kernels_rope_apply_bf16_run
RoPE apply FW, bf16 (f32 trig table, f32 multiply detour).
baracuda_kernels_rope_apply_f16_can_implement
baracuda_kernels_rope_apply_f16_can_implement (baracuda kernels rope apply f16 can implement).
baracuda_kernels_rope_apply_f16_run
RoPE apply FW, f16 (f32 trig table, f32 multiply detour).
baracuda_kernels_rope_apply_f32_can_implement
Implementability check for rope_apply_f32. Host-side only.
baracuda_kernels_rope_apply_f32_run
RoPE apply FW, f32. Cos/sin tables provided by caller.
baracuda_kernels_rope_apply_f64_can_implement
baracuda_kernels_rope_apply_f64_can_implement (baracuda kernels rope apply f64 can implement).
baracuda_kernels_rope_apply_f64_run
RoPE apply FW, f64 (f32 trig table promoted to double at load).
baracuda_kernels_rope_apply_interleaved_backward_bf16_can_implement
baracuda_kernels_rope_apply_interleaved_backward_bf16_can_implement (baracuda kernels rope apply interleaved backward bf16 can implement).
baracuda_kernels_rope_apply_interleaved_backward_bf16_run
RoPE apply interleaved BW, bf16.
baracuda_kernels_rope_apply_interleaved_backward_f16_can_implement
baracuda_kernels_rope_apply_interleaved_backward_f16_can_implement (baracuda kernels rope apply interleaved backward f16 can implement).
baracuda_kernels_rope_apply_interleaved_backward_f16_run
RoPE apply interleaved BW, f16.
baracuda_kernels_rope_apply_interleaved_backward_f32_can_implement
baracuda_kernels_rope_apply_interleaved_backward_f32_can_implement (baracuda kernels rope apply interleaved backward f32 can implement).
baracuda_kernels_rope_apply_interleaved_backward_f32_run
RoPE apply interleaved BW, f32.
baracuda_kernels_rope_apply_interleaved_backward_f64_can_implement
baracuda_kernels_rope_apply_interleaved_backward_f64_can_implement (baracuda kernels rope apply interleaved backward f64 can implement).
baracuda_kernels_rope_apply_interleaved_backward_f64_run
RoPE apply interleaved BW, f64.
baracuda_kernels_rope_apply_interleaved_bf16_can_implement
baracuda_kernels_rope_apply_interleaved_bf16_can_implement (baracuda kernels rope apply interleaved bf16 can implement).
baracuda_kernels_rope_apply_interleaved_bf16_run
RoPE apply interleaved FW, bf16.
baracuda_kernels_rope_apply_interleaved_f16_can_implement
baracuda_kernels_rope_apply_interleaved_f16_can_implement (baracuda kernels rope apply interleaved f16 can implement).
baracuda_kernels_rope_apply_interleaved_f16_run
RoPE apply interleaved FW, f16.
baracuda_kernels_rope_apply_interleaved_f32_can_implement
baracuda_kernels_rope_apply_interleaved_f32_can_implement (baracuda kernels rope apply interleaved f32 can implement).
baracuda_kernels_rope_apply_interleaved_f32_run
RoPE apply interleaved FW, f32.
baracuda_kernels_rope_apply_interleaved_f64_can_implement
baracuda_kernels_rope_apply_interleaved_f64_can_implement (baracuda kernels rope apply interleaved f64 can implement).
baracuda_kernels_rope_apply_interleaved_f64_run
RoPE apply interleaved FW, f64.
baracuda_kernels_rope_apply_thd_backward_bf16_can_implement
baracuda_kernels_rope_apply_thd_backward_bf16_can_implement (baracuda kernels rope apply thd backward bf16 can implement).
baracuda_kernels_rope_apply_thd_backward_bf16_run
RoPE apply THD BW, bf16.
baracuda_kernels_rope_apply_thd_backward_f16_can_implement
baracuda_kernels_rope_apply_thd_backward_f16_can_implement (baracuda kernels rope apply thd backward f16 can implement).
baracuda_kernels_rope_apply_thd_backward_f16_run
RoPE apply THD BW, f16.
baracuda_kernels_rope_apply_thd_backward_f32_can_implement
baracuda_kernels_rope_apply_thd_backward_f32_can_implement (baracuda kernels rope apply thd backward f32 can implement).
baracuda_kernels_rope_apply_thd_backward_f32_run
RoPE apply THD BW, f32.
baracuda_kernels_rope_apply_thd_backward_f64_can_implement
baracuda_kernels_rope_apply_thd_backward_f64_can_implement (baracuda kernels rope apply thd backward f64 can implement).
baracuda_kernels_rope_apply_thd_backward_f64_run
RoPE apply THD BW, f64.
baracuda_kernels_rope_apply_thd_bf16_can_implement
baracuda_kernels_rope_apply_thd_bf16_can_implement (baracuda kernels rope apply thd bf16 can implement).
baracuda_kernels_rope_apply_thd_bf16_run
RoPE apply THD FW, bf16.
baracuda_kernels_rope_apply_thd_f16_can_implement
baracuda_kernels_rope_apply_thd_f16_can_implement (baracuda kernels rope apply thd f16 can implement).
baracuda_kernels_rope_apply_thd_f16_run
RoPE apply THD FW, f16.
baracuda_kernels_rope_apply_thd_f32_can_implement
baracuda_kernels_rope_apply_thd_f32_can_implement (baracuda kernels rope apply thd f32 can implement).
baracuda_kernels_rope_apply_thd_f32_run
RoPE apply THD FW, f32.
baracuda_kernels_rope_apply_thd_f64_can_implement
baracuda_kernels_rope_apply_thd_f64_can_implement (baracuda kernels rope apply thd f64 can implement).
baracuda_kernels_rope_apply_thd_f64_run
RoPE apply THD FW, f64.
baracuda_kernels_rope_backward_bf16_can_implement
Implementability check for rope_backward_bf16. Host-side only.
baracuda_kernels_rope_backward_bf16_run
RoPE BW, bf16.
baracuda_kernels_rope_backward_bf16_strided_can_implement
Implementability check for rope_backward_bf16_strided. Host-side only.
baracuda_kernels_rope_backward_bf16_strided_run
RoPE BW strided, bf16.
baracuda_kernels_rope_backward_f16_can_implement
Implementability check for rope_backward_f16. Host-side only.
baracuda_kernels_rope_backward_f16_run
RoPE BW, f16.
baracuda_kernels_rope_backward_f16_strided_can_implement
Implementability check for rope_backward_f16_strided. Host-side only.
baracuda_kernels_rope_backward_f16_strided_run
RoPE BW strided, f16.
baracuda_kernels_rope_backward_f32_can_implement
Implementability check for rope_backward_f32. Host-side only.
baracuda_kernels_rope_backward_f32_run
RoPE BW, f32. Same shape as FW; computes dx from dy by rotation through .
baracuda_kernels_rope_backward_f32_strided_can_implement
Implementability check for rope_backward_f32_strided. Host-side only.
baracuda_kernels_rope_backward_f32_strided_run
RoPE BW strided, f32. Strides apply to dy (input) and dx (output).
baracuda_kernels_rope_backward_f64_can_implement
Implementability check for rope_backward_f64. Host-side only.
baracuda_kernels_rope_backward_f64_run
RoPE BW, f64.
baracuda_kernels_rope_backward_f64_strided_can_implement
Implementability check for rope_backward_f64_strided. Host-side only.
baracuda_kernels_rope_backward_f64_strided_run
RoPE BW strided, f64.
baracuda_kernels_rope_bf16_can_implement
Implementability check for rope_bf16. Host-side only.
baracuda_kernels_rope_bf16_run
RoPE FW, bf16.
baracuda_kernels_rope_bf16_strided_can_implement
Implementability check for rope_bf16_strided. Host-side only.
baracuda_kernels_rope_bf16_strided_run
RoPE FW strided, bf16.
baracuda_kernels_rope_f16_can_implement
Implementability check for rope_f16. Host-side only.
baracuda_kernels_rope_f16_run
RoPE FW, f16 (f32 trig detour internally).
baracuda_kernels_rope_f16_strided_can_implement
Implementability check for rope_f16_strided. Host-side only.
baracuda_kernels_rope_f16_strided_run
RoPE FW strided, f16.
baracuda_kernels_rope_f32_can_implement
Implementability check for rope_f32. Host-side only.
baracuda_kernels_rope_f32_run
RoPE FW, f32. Input/output are [B, H, S, D] contiguous row-major; head_dim (D) must be even. When pos_default_flag != 0, the kernel ignores positions and uses position index = sequence index; otherwise positions is int64_t[seq].
baracuda_kernels_rope_f32_strided_can_implement
Implementability check for rope_f32_strided. Host-side only.
baracuda_kernels_rope_f32_strided_run
RoPE FW strided, f32.
baracuda_kernels_rope_f64_can_implement
Implementability check for rope_f64. Host-side only.
baracuda_kernels_rope_f64_run
RoPE FW, f64.
baracuda_kernels_rope_f64_strided_can_implement
Implementability check for rope_f64_strided. Host-side only.
baracuda_kernels_rope_f64_strided_run
RoPE FW strided, f64.
baracuda_kernels_scale_inplace_c32_can_implement
Implementability check for baracuda_kernels_scale_inplace_c32. Host-side only.
baracuda_kernels_scale_inplace_c32_run
In-place scale of a cufftComplex buffer by a real scalar: y[i].x *= scale; y[i].y *= scale;. Applied after cufftExecC2C in the inverse direction to bake in the 1/N normalization PyTorch expects.
baracuda_kernels_scale_inplace_c64_can_implement
Implementability check for baracuda_kernels_scale_inplace_c64. Host-side only.
baracuda_kernels_scale_inplace_c64_run
In-place scale of a cufftDoubleComplex buffer by a real scalar. f64 analogue of baracuda_kernels_scale_inplace_c32_run.
baracuda_kernels_scale_inplace_real_f32_can_implement
Implementability check for baracuda_kernels_scale_inplace_real_f32. Host-side only.
baracuda_kernels_scale_inplace_real_f32_run
In-place scale of a real f32 buffer. Used to bake the 1/N normalization into the output of cufftExecC2R (IRFFT).
baracuda_kernels_scale_inplace_real_f64_can_implement
Implementability check for baracuda_kernels_scale_inplace_real_f64. Host-side only.
baracuda_kernels_scale_inplace_real_f64_run
In-place scale of a real f64 buffer. f64 analogue.
baracuda_kernels_scan_cummax_backward_bf16_can_implement
Pre-launch implementability check for scan_cummax_backward_bf16.
baracuda_kernels_scan_cummax_backward_bf16_run
Cummax backward, bf16.
baracuda_kernels_scan_cummax_backward_f16_can_implement
Pre-launch implementability check for scan_cummax_backward_f16.
baracuda_kernels_scan_cummax_backward_f16_run
Cummax backward, f16.
baracuda_kernels_scan_cummax_backward_f32_can_implement
Pre-launch implementability check for scan_cummax_backward_f32.
baracuda_kernels_scan_cummax_backward_f32_run
Cummax backward, f32. Walks the forward scan tracking first-occurrence argmax; gradient flows to the source position.
baracuda_kernels_scan_cummax_backward_f64_can_implement
Pre-launch implementability check for scan_cummax_backward_f64.
baracuda_kernels_scan_cummax_backward_f64_run
Cummax backward, f64.
baracuda_kernels_scan_cummax_bf16_can_implement
Pre-launch implementability check for scan_cummax_bf16.
baracuda_kernels_scan_cummax_bf16_run
Cummax, bf16.
baracuda_kernels_scan_cummax_f16_can_implement
Pre-launch implementability check for scan_cummax_f16.
baracuda_kernels_scan_cummax_f16_run
Cummax, f16.
baracuda_kernels_scan_cummax_f32_can_implement
Pre-launch implementability check for scan_cummax_f32.
baracuda_kernels_scan_cummax_f32_run
Cummax (inclusive prefix running max), f32.
baracuda_kernels_scan_cummax_f64_can_implement
Pre-launch implementability check for scan_cummax_f64.
baracuda_kernels_scan_cummax_f64_run
Cummax, f64.
baracuda_kernels_scan_cummin_backward_bf16_can_implement
Pre-launch implementability check for scan_cummin_backward_bf16.
baracuda_kernels_scan_cummin_backward_bf16_run
Cummin backward, bf16.
baracuda_kernels_scan_cummin_backward_f16_can_implement
Pre-launch implementability check for scan_cummin_backward_f16.
baracuda_kernels_scan_cummin_backward_f16_run
Cummin backward, f16.
baracuda_kernels_scan_cummin_backward_f32_can_implement
Pre-launch implementability check for scan_cummin_backward_f32.
baracuda_kernels_scan_cummin_backward_f32_run
Cummin backward, f32. Same kernel shape as Cummax BW with < instead of > for the tie-tracking comparison.
baracuda_kernels_scan_cummin_backward_f64_can_implement
Pre-launch implementability check for scan_cummin_backward_f64.
baracuda_kernels_scan_cummin_backward_f64_run
Cummin backward, f64.
baracuda_kernels_scan_cummin_bf16_can_implement
Pre-launch implementability check for scan_cummin_bf16.
baracuda_kernels_scan_cummin_bf16_run
Cummin, bf16.
baracuda_kernels_scan_cummin_f16_can_implement
Pre-launch implementability check for scan_cummin_f16.
baracuda_kernels_scan_cummin_f16_run
Cummin, f16.
baracuda_kernels_scan_cummin_f32_can_implement
Pre-launch implementability check for scan_cummin_f32.
baracuda_kernels_scan_cummin_f32_run
Cummin (inclusive prefix running min), f32.
baracuda_kernels_scan_cummin_f64_can_implement
Pre-launch implementability check for scan_cummin_f64.
baracuda_kernels_scan_cummin_f64_run
Cummin, f64.
baracuda_kernels_scan_cumprod_backward_bf16_can_implement
Pre-launch implementability check for scan_cumprod_backward_bf16.
baracuda_kernels_scan_cumprod_backward_bf16_run
Cumprod backward, bf16.
baracuda_kernels_scan_cumprod_backward_f16_can_implement
Pre-launch implementability check for scan_cumprod_backward_f16.
baracuda_kernels_scan_cumprod_backward_f16_run
Cumprod backward, f16. f32-detour accumulator.
baracuda_kernels_scan_cumprod_backward_f32_can_implement
Pre-launch implementability check for scan_cumprod_backward_f32.
baracuda_kernels_scan_cumprod_backward_f32_run
Cumprod backward, f32. Per-cell suffix accumulator of dy[i] * y[i] / x[j]. Caller must ensure x has no zeros along the scan axis.
baracuda_kernels_scan_cumprod_backward_f64_can_implement
Pre-launch implementability check for scan_cumprod_backward_f64.
baracuda_kernels_scan_cumprod_backward_f64_run
Cumprod backward, f64.
baracuda_kernels_scan_cumprod_bf16_can_implement
Pre-launch implementability check for scan_cumprod_bf16.
baracuda_kernels_scan_cumprod_bf16_run
Cumprod, bf16.
baracuda_kernels_scan_cumprod_f16_can_implement
Pre-launch implementability check for scan_cumprod_f16.
baracuda_kernels_scan_cumprod_f16_run
Cumprod, f16. f32-detour accumulator.
baracuda_kernels_scan_cumprod_f32_can_implement
Pre-launch implementability check for scan_cumprod_f32.
baracuda_kernels_scan_cumprod_f32_run
Cumprod (inclusive prefix product), f32. Same ABI as cumsum.
baracuda_kernels_scan_cumprod_f64_can_implement
Pre-launch implementability check for scan_cumprod_f64.
baracuda_kernels_scan_cumprod_f64_run
Cumprod, f64.
baracuda_kernels_scan_cumsum_bf16_can_implement
Pre-launch implementability check for scan_cumsum_bf16.
baracuda_kernels_scan_cumsum_bf16_run
Cumsum, bf16.
baracuda_kernels_scan_cumsum_f16_can_implement
Pre-launch implementability check for scan_cumsum_f16.
baracuda_kernels_scan_cumsum_f16_run
Cumsum, f16. f32-detour accumulator inside the kernel.
baracuda_kernels_scan_cumsum_f32_can_implement
Pre-launch implementability check for scan_cumsum_f32.
baracuda_kernels_scan_cumsum_f32_run
Inclusive prefix sum (cumsum) along a single axis, f32. reverse != 0 flips the scan direction.
baracuda_kernels_scan_cumsum_f64_can_implement
Pre-launch implementability check for scan_cumsum_f64.
baracuda_kernels_scan_cumsum_f64_run
Cumsum, f64.
baracuda_kernels_scan_log_cumsum_exp_backward_bf16_can_implement
baracuda_kernels_scan_log_cumsum_exp_backward_bf16_can_implement (baracuda kernels scan log cumsum exp backward bf16 can implement).
baracuda_kernels_scan_log_cumsum_exp_backward_bf16_run
LogCumsumExp BW, bf16.
baracuda_kernels_scan_log_cumsum_exp_backward_f16_can_implement
baracuda_kernels_scan_log_cumsum_exp_backward_f16_can_implement (baracuda kernels scan log cumsum exp backward f16 can implement).
baracuda_kernels_scan_log_cumsum_exp_backward_f16_run
LogCumsumExp BW, f16. f32-detour accumulator.
baracuda_kernels_scan_log_cumsum_exp_backward_f32_can_implement
baracuda_kernels_scan_log_cumsum_exp_backward_f32_can_implement (baracuda kernels scan log cumsum exp backward f32 can implement).
baracuda_kernels_scan_log_cumsum_exp_backward_f32_run
LogCumsumExp BW, f32. Per-cell accumulator of Σ dy[i] * exp(x[k] - y[i]) over the FW-direction-dependent i range. Needs both saved x and saved y (same shape since scans are length-preserving). Stable by construction: x[k] - y[i] ≤ 0 so exp(.) ∈ [0, 1].
baracuda_kernels_scan_log_cumsum_exp_backward_f64_can_implement
baracuda_kernels_scan_log_cumsum_exp_backward_f64_can_implement (baracuda kernels scan log cumsum exp backward f64 can implement).
baracuda_kernels_scan_log_cumsum_exp_backward_f64_run
LogCumsumExp BW, f64.
baracuda_kernels_scan_log_cumsum_exp_bf16_can_implement
baracuda_kernels_scan_log_cumsum_exp_bf16_can_implement (baracuda kernels scan log cumsum exp bf16 can implement).
baracuda_kernels_scan_log_cumsum_exp_bf16_run
LogCumsumExp FW, bf16.
baracuda_kernels_scan_log_cumsum_exp_f16_can_implement
baracuda_kernels_scan_log_cumsum_exp_f16_can_implement (baracuda kernels scan log cumsum exp f16 can implement).
baracuda_kernels_scan_log_cumsum_exp_f16_run
LogCumsumExp FW, f16. f32-detour accumulator inside the kernel.
baracuda_kernels_scan_log_cumsum_exp_f32_can_implement
baracuda_kernels_scan_log_cumsum_exp_f32_can_implement (baracuda kernels scan log cumsum exp f32 can implement).
baracuda_kernels_scan_log_cumsum_exp_f32_run
LogCumsumExp FW, f32. y[k] = log(Σ_{j ≤ k} exp(x[j])) (or suffix-LSE when reverse != 0). Numerically stable via the online running-max algorithm. Same ABI as cumsum.
baracuda_kernels_scan_log_cumsum_exp_f64_can_implement
baracuda_kernels_scan_log_cumsum_exp_f64_can_implement (baracuda kernels scan log cumsum exp f64 can implement).
baracuda_kernels_scan_log_cumsum_exp_f64_run
LogCumsumExp FW, f64.
baracuda_kernels_scatter_add_f32_can_implement
Implementability check for scatter_add_f32.
baracuda_kernels_scatter_add_f32_run
out[..., index[..., j, ...], ...] += updates[..., j, ...] along scatter_dim. f32 (atomicAdd).
baracuda_kernels_scatter_add_f64_can_implement
Implementability check for scatter_add_f64.
baracuda_kernels_scatter_add_f64_run
scatter_add — f64 (atomicAdd).
baracuda_kernels_scatter_add_i64idx_f32_can_implement
Implementability check for scatter_add_i64idx_f32.
baracuda_kernels_scatter_add_i64idx_f32_run
scatter_add — f32, i64 indices (atomicAdd).
baracuda_kernels_scatter_add_i64idx_f64_can_implement
Implementability check for scatter_add_i64idx_f64.
baracuda_kernels_scatter_add_i64idx_f64_run
scatter_add — f64, i64 indices.
baracuda_kernels_scatter_bf16_can_implement
Implementability check for scatter_bf16.
baracuda_kernels_scatter_bf16_run
scatter — bf16, i32 idx.
baracuda_kernels_scatter_f16_can_implement
Implementability check for scatter_f16.
baracuda_kernels_scatter_f16_run
scatter — f16, i32 idx.
baracuda_kernels_scatter_f32_can_implement
Implementability check for scatter_f32.
baracuda_kernels_scatter_f32_run
scatterout[index] = updates, f32, i32 idx. NO accumulation.
baracuda_kernels_scatter_f64_can_implement
Implementability check for scatter_f64.
baracuda_kernels_scatter_f64_run
scatter — f64, i32 idx.
baracuda_kernels_scatter_i8_can_implement
baracuda_kernels_scatter_i8_can_implement (baracuda kernels scatter i8 can implement).
baracuda_kernels_scatter_i8_run
baracuda_kernels_scatter_i8_run (baracuda kernels scatter i8 run).
baracuda_kernels_scatter_i16_can_implement
baracuda_kernels_scatter_i16_can_implement (baracuda kernels scatter i16 can implement).
baracuda_kernels_scatter_i16_run
baracuda_kernels_scatter_i16_run (baracuda kernels scatter i16 run).
baracuda_kernels_scatter_i32_can_implement
baracuda_kernels_scatter_i32_can_implement (baracuda kernels scatter i32 can implement).
baracuda_kernels_scatter_i32_run
baracuda_kernels_scatter_i32_run (baracuda kernels scatter i32 run).
baracuda_kernels_scatter_i64_can_implement
baracuda_kernels_scatter_i64_can_implement (baracuda kernels scatter i64 can implement).
baracuda_kernels_scatter_i64_run
baracuda_kernels_scatter_i64_run (baracuda kernels scatter i64 run).
baracuda_kernels_scatter_i64idx_bf16_can_implement
Implementability check for scatter_i64idx_bf16.
baracuda_kernels_scatter_i64idx_bf16_run
scatter — bf16, i64 idx.
baracuda_kernels_scatter_i64idx_f16_can_implement
Implementability check for scatter_i64idx_f16.
baracuda_kernels_scatter_i64idx_f16_run
scatter — f16, i64 idx.
baracuda_kernels_scatter_i64idx_f32_can_implement
Implementability check for scatter_i64idx_f32.
baracuda_kernels_scatter_i64idx_f32_run
scatter — f32, i64 idx.
baracuda_kernels_scatter_i64idx_f64_can_implement
Implementability check for scatter_i64idx_f64.
baracuda_kernels_scatter_i64idx_f64_run
scatter — f64, i64 idx.
baracuda_kernels_scatter_i64idx_i8_can_implement
baracuda_kernels_scatter_i64idx_i8_can_implement (baracuda kernels scatter i64idx i8 can implement).
baracuda_kernels_scatter_i64idx_i8_run
baracuda_kernels_scatter_i64idx_i8_run (baracuda kernels scatter i64idx i8 run).
baracuda_kernels_scatter_i64idx_i16_can_implement
baracuda_kernels_scatter_i64idx_i16_can_implement (baracuda kernels scatter i64idx i16 can implement).
baracuda_kernels_scatter_i64idx_i16_run
baracuda_kernels_scatter_i64idx_i16_run (baracuda kernels scatter i64idx i16 run).
baracuda_kernels_scatter_i64idx_i32_can_implement
baracuda_kernels_scatter_i64idx_i32_can_implement (baracuda kernels scatter i64idx i32 can implement).
baracuda_kernels_scatter_i64idx_i32_run
baracuda_kernels_scatter_i64idx_i32_run (baracuda kernels scatter i64idx i32 run).
baracuda_kernels_scatter_i64idx_i64_can_implement
baracuda_kernels_scatter_i64idx_i64_can_implement (baracuda kernels scatter i64idx i64 can implement).
baracuda_kernels_scatter_i64idx_i64_run
baracuda_kernels_scatter_i64idx_i64_run (baracuda kernels scatter i64idx i64 run).
baracuda_kernels_scatter_i64idx_u8_can_implement
baracuda_kernels_scatter_i64idx_u8_can_implement (baracuda kernels scatter i64idx u8 can implement).
baracuda_kernels_scatter_i64idx_u8_run
baracuda_kernels_scatter_i64idx_u8_run (baracuda kernels scatter i64idx u8 run).
baracuda_kernels_scatter_i64idx_u16_can_implement
baracuda_kernels_scatter_i64idx_u16_can_implement (baracuda kernels scatter i64idx u16 can implement).
baracuda_kernels_scatter_i64idx_u16_run
baracuda_kernels_scatter_i64idx_u16_run (baracuda kernels scatter i64idx u16 run).
baracuda_kernels_scatter_i64idx_u32_can_implement
baracuda_kernels_scatter_i64idx_u32_can_implement (baracuda kernels scatter i64idx u32 can implement).
baracuda_kernels_scatter_i64idx_u32_run
baracuda_kernels_scatter_i64idx_u32_run (baracuda kernels scatter i64idx u32 run).
baracuda_kernels_scatter_u8_can_implement
baracuda_kernels_scatter_u8_can_implement (baracuda kernels scatter u8 can implement).
baracuda_kernels_scatter_u8_run
baracuda_kernels_scatter_u8_run (baracuda kernels scatter u8 run).
baracuda_kernels_scatter_u16_can_implement
baracuda_kernels_scatter_u16_can_implement (baracuda kernels scatter u16 can implement).
baracuda_kernels_scatter_u16_run
baracuda_kernels_scatter_u16_run (baracuda kernels scatter u16 run).
baracuda_kernels_scatter_u32_can_implement
baracuda_kernels_scatter_u32_can_implement (baracuda kernels scatter u32 can implement).
baracuda_kernels_scatter_u32_run
baracuda_kernels_scatter_u32_run (baracuda kernels scatter u32 run).
baracuda_kernels_sdpa_backward_bf16_can_implement
Implementability check for sdpa_backward_bf16. Host-side only.
baracuda_kernels_sdpa_backward_bf16_run
SDPA BW, bf16.
baracuda_kernels_sdpa_backward_bf16_strided_can_implement
Implementability check for sdpa_backward_bf16_strided. Host-side only.
baracuda_kernels_sdpa_backward_bf16_strided_run
SDPA BW strided, bf16.
baracuda_kernels_sdpa_backward_f16_can_implement
Implementability check for sdpa_backward_f16. Host-side only.
baracuda_kernels_sdpa_backward_f16_run
SDPA BW, f16.
baracuda_kernels_sdpa_backward_f16_strided_can_implement
Implementability check for sdpa_backward_f16_strided. Host-side only.
baracuda_kernels_sdpa_backward_f16_strided_run
SDPA BW strided, f16.
baracuda_kernels_sdpa_backward_f32_can_implement
Implementability check for sdpa_backward_f32. Host-side only.
baracuda_kernels_sdpa_backward_f32_run
SDPA BW, f32. Given the FW-saved attn ([B, H, Q, K]), Q, K, V, and upstream dy, computes dQ, dK, dV. The dscores_ws argument is a caller-allocated [B, H, Q, K] scratch buffer reused as the dattn → dscores intermediate; size matches the FW attn tensor.
baracuda_kernels_sdpa_backward_f32_strided_can_implement
Implementability check for sdpa_backward_f32_strided. Host-side only.
baracuda_kernels_sdpa_backward_f32_strided_run
SDPA BW strided, f32.
baracuda_kernels_sdpa_backward_f64_can_implement
Implementability check for sdpa_backward_f64. Host-side only.
baracuda_kernels_sdpa_backward_f64_run
SDPA BW, f64.
baracuda_kernels_sdpa_backward_f64_strided_can_implement
Implementability check for sdpa_backward_f64_strided. Host-side only.
baracuda_kernels_sdpa_backward_f64_strided_run
SDPA BW strided, f64.
baracuda_kernels_sdpa_bf16_arbmask_can_implement
Arbitrary-mask SDPA host-side can-implement, bf16.
baracuda_kernels_sdpa_bf16_arbmask_run
Arbitrary additive-mask SDPA FW, bf16 (f32 accumulators).
baracuda_kernels_sdpa_bf16_can_implement
Implementability check for sdpa_bf16. Host-side only.
baracuda_kernels_sdpa_bf16_run
SDPA FW, bf16 (f32 accumulators).
baracuda_kernels_sdpa_bf16_strided_can_implement
Implementability check for sdpa_bf16_strided. Host-side only.
baracuda_kernels_sdpa_bf16_strided_run
SDPA FW strided, bf16.
baracuda_kernels_sdpa_f16_arbmask_can_implement
Arbitrary-mask SDPA host-side can-implement, f16.
baracuda_kernels_sdpa_f16_arbmask_run
Arbitrary additive-mask SDPA FW, f16 (f32 accumulators).
baracuda_kernels_sdpa_f16_can_implement
Implementability check for sdpa_f16. Host-side only.
baracuda_kernels_sdpa_f16_run
SDPA FW, f16 (f32 accumulators).
baracuda_kernels_sdpa_f16_strided_can_implement
Implementability check for sdpa_f16_strided. Host-side only.
baracuda_kernels_sdpa_f16_strided_run
SDPA FW strided, f16.
baracuda_kernels_sdpa_f32_arbmask_can_implement
Arbitrary-mask SDPA host-side can-implement, f32.
baracuda_kernels_sdpa_f32_arbmask_run
Arbitrary additive-mask SDPA FW, f32. mask shape [B, H, Q, K] f32, applied as an additive bias on the score tile before softmax.
baracuda_kernels_sdpa_f32_can_implement
Implementability check for sdpa_f32. Host-side only.
baracuda_kernels_sdpa_f32_run
SDPA FW, f32. Computes y = softmax(Q·K^T·scale + mask) · V. The attn buffer ([B, H, Q, K]) doubles as the scores intermediate and is overwritten in place with the softmax output (saved for BW). Pass has_mask = 0 and mask = nullptr to skip the mask add. is_causal = 1 applies an upper-triangular -inf mask inside the scores kernel.
baracuda_kernels_sdpa_f32_strided_can_implement
Implementability check for sdpa_f32_strided. Host-side only.
baracuda_kernels_sdpa_f32_strided_run
SDPA FW strided, f32.
baracuda_kernels_sdpa_f64_arbmask_can_implement
Arbitrary-mask SDPA host-side can-implement, f64.
baracuda_kernels_sdpa_f64_arbmask_run
Arbitrary additive-mask SDPA FW, f64.
baracuda_kernels_sdpa_f64_can_implement
Implementability check for sdpa_f64. Host-side only.
baracuda_kernels_sdpa_f64_run
SDPA FW, f64.
baracuda_kernels_sdpa_f64_strided_can_implement
Implementability check for sdpa_f64_strided. Host-side only.
baracuda_kernels_sdpa_f64_strided_run
SDPA FW strided, f64.
baracuda_kernels_searchsorted_f32_can_implement
baracuda_kernels_searchsorted_f32_can_implement (baracuda kernels searchsorted f32 can implement).
baracuda_kernels_searchsorted_f32_run
searchsorted, f32. right == 0 = lower_bound; right == 1 = upper_bound.
baracuda_kernels_searchsorted_f64_can_implement
baracuda_kernels_searchsorted_f64_can_implement (baracuda kernels searchsorted f64 can implement).
baracuda_kernels_searchsorted_f64_run
searchsorted, f64.
baracuda_kernels_searchsorted_i32_can_implement
baracuda_kernels_searchsorted_i32_can_implement (baracuda kernels searchsorted i32 can implement).
baracuda_kernels_searchsorted_i32_run
searchsorted, i32.
baracuda_kernels_searchsorted_i64_can_implement
baracuda_kernels_searchsorted_i64_can_implement (baracuda kernels searchsorted i64 can implement).
baracuda_kernels_searchsorted_i64_run
searchsorted, i64.
baracuda_kernels_segment_max_backward_f32_can_implement
Implementability check for segment_max_backward_f32.
baracuda_kernels_segment_max_backward_f32_run
d_input[k, d] = d_output[seg, d] iff k is the (first) max-argument of the segment in column d, else 0. Sorted seg ids. f32.
baracuda_kernels_segment_max_backward_f64_can_implement
Implementability check for segment_max_backward_f64.
baracuda_kernels_segment_max_backward_f64_run
segment_max_backward — f64.
baracuda_kernels_segment_max_f32_can_implement
Implementability check for segment_max_f32.
baracuda_kernels_segment_max_f32_run
out[s, d] = max_{n : seg[n] == s} input[n, d] — sorted. f32.
baracuda_kernels_segment_max_f64_can_implement
Implementability check for segment_max_f64.
baracuda_kernels_segment_max_f64_run
segment_max — f64.
baracuda_kernels_segment_max_i64idx_f32_can_implement
baracuda_kernels_segment_max_i64idx_f32_can_implement (baracuda kernels segment max i64idx f32 can implement).
baracuda_kernels_segment_max_i64idx_f32_run
baracuda_kernels_segment_max_i64idx_f32_run (baracuda kernels segment max i64idx f32 run).
baracuda_kernels_segment_max_i64idx_f64_can_implement
baracuda_kernels_segment_max_i64idx_f64_can_implement (baracuda kernels segment max i64idx f64 can implement).
baracuda_kernels_segment_max_i64idx_f64_run
baracuda_kernels_segment_max_i64idx_f64_run (baracuda kernels segment max i64idx f64 run).
baracuda_kernels_segment_mean_backward_f32_can_implement
Implementability check for segment_mean_backward_f32.
baracuda_kernels_segment_mean_backward_f32_run
d_input[n, d] = d_output[seg[n], d] / count[seg[n]]. Workspace: num_segments * sizeof(i32). f32.
baracuda_kernels_segment_mean_backward_f64_can_implement
Implementability check for segment_mean_backward_f64.
baracuda_kernels_segment_mean_backward_f64_run
segment_mean_backward — f64.
baracuda_kernels_segment_mean_backward_i64idx_f32_can_implement
baracuda_kernels_segment_mean_backward_i64idx_f32_can_implement (baracuda kernels segment mean backward i64idx f32 can implement).
baracuda_kernels_segment_mean_backward_i64idx_f32_run
baracuda_kernels_segment_mean_backward_i64idx_f32_run (baracuda kernels segment mean backward i64idx f32 run).
baracuda_kernels_segment_mean_backward_i64idx_f64_can_implement
baracuda_kernels_segment_mean_backward_i64idx_f64_can_implement (baracuda kernels segment mean backward i64idx f64 can implement).
baracuda_kernels_segment_mean_backward_i64idx_f64_run
baracuda_kernels_segment_mean_backward_i64idx_f64_run (baracuda kernels segment mean backward i64idx f64 run).
baracuda_kernels_segment_mean_f32_can_implement
Implementability check for segment_mean_f32.
baracuda_kernels_segment_mean_f32_run
out[s, d] = mean_{n : seg[n] == s} input[n, d] — sorted. f32.
baracuda_kernels_segment_mean_f64_can_implement
Implementability check for segment_mean_f64.
baracuda_kernels_segment_mean_f64_run
segment_mean — f64.
baracuda_kernels_segment_mean_i64idx_f32_can_implement
baracuda_kernels_segment_mean_i64idx_f32_can_implement (baracuda kernels segment mean i64idx f32 can implement).
baracuda_kernels_segment_mean_i64idx_f32_run
baracuda_kernels_segment_mean_i64idx_f32_run (baracuda kernels segment mean i64idx f32 run).
baracuda_kernels_segment_mean_i64idx_f64_can_implement
baracuda_kernels_segment_mean_i64idx_f64_can_implement (baracuda kernels segment mean i64idx f64 can implement).
baracuda_kernels_segment_mean_i64idx_f64_run
baracuda_kernels_segment_mean_i64idx_f64_run (baracuda kernels segment mean i64idx f64 run).
baracuda_kernels_segment_min_backward_f32_can_implement
Implementability check for segment_min_backward_f32.
baracuda_kernels_segment_min_backward_f32_run
segment_min_backward — f32.
baracuda_kernels_segment_min_backward_f64_can_implement
Implementability check for segment_min_backward_f64.
baracuda_kernels_segment_min_backward_f64_run
segment_min_backward — f64.
baracuda_kernels_segment_min_f32_can_implement
Implementability check for segment_min_f32.
baracuda_kernels_segment_min_f32_run
out[s, d] = min_{n : seg[n] == s} input[n, d] — sorted. f32.
baracuda_kernels_segment_min_f64_can_implement
Implementability check for segment_min_f64.
baracuda_kernels_segment_min_f64_run
segment_min — f64.
baracuda_kernels_segment_min_i64idx_f32_can_implement
baracuda_kernels_segment_min_i64idx_f32_can_implement (baracuda kernels segment min i64idx f32 can implement).
baracuda_kernels_segment_min_i64idx_f32_run
baracuda_kernels_segment_min_i64idx_f32_run (baracuda kernels segment min i64idx f32 run).
baracuda_kernels_segment_min_i64idx_f64_can_implement
baracuda_kernels_segment_min_i64idx_f64_can_implement (baracuda kernels segment min i64idx f64 can implement).
baracuda_kernels_segment_min_i64idx_f64_run
baracuda_kernels_segment_min_i64idx_f64_run (baracuda kernels segment min i64idx f64 run).
baracuda_kernels_segment_prod_backward_f32_can_implement
Implementability check for segment_prod_backward_f32.
baracuda_kernels_segment_prod_backward_f32_run
segment_prod_backward — f32.
baracuda_kernels_segment_prod_backward_f64_can_implement
Implementability check for segment_prod_backward_f64.
baracuda_kernels_segment_prod_backward_f64_run
segment_prod_backward — f64.
baracuda_kernels_segment_prod_f32_can_implement
Implementability check for segment_prod_f32.
baracuda_kernels_segment_prod_f32_run
out[s, d] = prod_{n : seg[n] == s} input[n, d] — sorted. f32.
baracuda_kernels_segment_prod_f64_can_implement
Implementability check for segment_prod_f64.
baracuda_kernels_segment_prod_f64_run
segment_prod — f64.
baracuda_kernels_segment_prod_i64idx_f32_can_implement
baracuda_kernels_segment_prod_i64idx_f32_can_implement (baracuda kernels segment prod i64idx f32 can implement).
baracuda_kernels_segment_prod_i64idx_f32_run
baracuda_kernels_segment_prod_i64idx_f32_run (baracuda kernels segment prod i64idx f32 run).
baracuda_kernels_segment_prod_i64idx_f64_can_implement
baracuda_kernels_segment_prod_i64idx_f64_can_implement (baracuda kernels segment prod i64idx f64 can implement).
baracuda_kernels_segment_prod_i64idx_f64_run
baracuda_kernels_segment_prod_i64idx_f64_run (baracuda kernels segment prod i64idx f64 run).
baracuda_kernels_segment_sum_backward_f32_can_implement
Implementability check for segment_sum_backward_f32.
baracuda_kernels_segment_sum_backward_f32_run
d_input[n, d] = d_output[seg[n], d]. f32.
baracuda_kernels_segment_sum_backward_f64_can_implement
Implementability check for segment_sum_backward_f64.
baracuda_kernels_segment_sum_backward_f64_run
segment_sum_backward — f64.
baracuda_kernels_segment_sum_backward_i64idx_f32_can_implement
baracuda_kernels_segment_sum_backward_i64idx_f32_can_implement (baracuda kernels segment sum backward i64idx f32 can implement).
baracuda_kernels_segment_sum_backward_i64idx_f32_run
baracuda_kernels_segment_sum_backward_i64idx_f32_run (baracuda kernels segment sum backward i64idx f32 run).
baracuda_kernels_segment_sum_backward_i64idx_f64_can_implement
baracuda_kernels_segment_sum_backward_i64idx_f64_can_implement (baracuda kernels segment sum backward i64idx f64 can implement).
baracuda_kernels_segment_sum_backward_i64idx_f64_run
baracuda_kernels_segment_sum_backward_i64idx_f64_run (baracuda kernels segment sum backward i64idx f64 run).
baracuda_kernels_segment_sum_f32_can_implement
Implementability check for segment_sum_f32.
baracuda_kernels_segment_sum_f32_run
out[s, d] = Σ_{n : seg[n] == s} input[n, d] — sorted seg ids (monotonically non-decreasing). f32.
baracuda_kernels_segment_sum_f64_can_implement
Implementability check for segment_sum_f64.
baracuda_kernels_segment_sum_f64_run
segment_sum — f64.
baracuda_kernels_segment_sum_i64idx_f32_can_implement
baracuda_kernels_segment_sum_i64idx_f32_can_implement (baracuda kernels segment sum i64idx f32 can implement).
baracuda_kernels_segment_sum_i64idx_f32_run
baracuda_kernels_segment_sum_i64idx_f32_run (baracuda kernels segment sum i64idx f32 run).
baracuda_kernels_segment_sum_i64idx_f64_can_implement
baracuda_kernels_segment_sum_i64idx_f64_can_implement (baracuda kernels segment sum i64idx f64 can implement).
baracuda_kernels_segment_sum_i64idx_f64_run
baracuda_kernels_segment_sum_i64idx_f64_run (baracuda kernels segment sum i64idx f64 run).
baracuda_kernels_softmax_backward_bf16_can_implement
baracuda_kernels_softmax_backward_bf16_can_implement (baracuda kernels softmax backward bf16 can implement).
baracuda_kernels_softmax_backward_bf16_run
Softmax BW, bf16.
baracuda_kernels_softmax_backward_bf16_strided_can_implement
softmax_backward_bf16_strided_can_implement companion.
baracuda_kernels_softmax_backward_bf16_strided_run
Softmax BW strided sibling, bf16.
baracuda_kernels_softmax_backward_f16_can_implement
baracuda_kernels_softmax_backward_f16_can_implement (baracuda kernels softmax backward f16 can implement).
baracuda_kernels_softmax_backward_f16_run
Softmax BW, f16.
baracuda_kernels_softmax_backward_f16_strided_can_implement
softmax_backward_f16_strided_can_implement companion.
baracuda_kernels_softmax_backward_f16_strided_run
Softmax BW strided sibling, f16.
baracuda_kernels_softmax_backward_f32_can_implement
baracuda_kernels_softmax_backward_f32_can_implement (baracuda kernels softmax backward f32 can implement).
baracuda_kernels_softmax_backward_f32_run
Softmax BW, f32. dx[k] = y[k] * (dy[k] - Σ_j y[j] * dy[j]). Caller passes the saved forward output y.
baracuda_kernels_softmax_backward_f32_strided_can_implement
softmax_backward_f32_strided_can_implement companion.
baracuda_kernels_softmax_backward_f32_strided_run
Softmax BW strided sibling, f32.
baracuda_kernels_softmax_backward_f64_can_implement
baracuda_kernels_softmax_backward_f64_can_implement (baracuda kernels softmax backward f64 can implement).
baracuda_kernels_softmax_backward_f64_run
Softmax BW, f64.
baracuda_kernels_softmax_backward_f64_strided_can_implement
softmax_backward_f64_strided_can_implement companion.
baracuda_kernels_softmax_backward_f64_strided_run
Softmax BW strided sibling, f64.
baracuda_kernels_softmax_bf16_can_implement
baracuda_kernels_softmax_bf16_can_implement (baracuda kernels softmax bf16 can implement).
baracuda_kernels_softmax_bf16_run
Softmax FW, bf16.
baracuda_kernels_softmax_bf16_strided_can_implement
softmax_bf16_strided_can_implement companion.
baracuda_kernels_softmax_bf16_strided_run
Softmax FW strided sibling, bf16.
baracuda_kernels_softmax_f16_can_implement
baracuda_kernels_softmax_f16_can_implement (baracuda kernels softmax f16 can implement).
baracuda_kernels_softmax_f16_run
Softmax FW, f16. f32 accumulator inside the kernel.
baracuda_kernels_softmax_f16_strided_can_implement
softmax_f16_strided_can_implement companion.
baracuda_kernels_softmax_f16_strided_run
Softmax FW strided sibling, f16.
baracuda_kernels_softmax_f32_can_implement
baracuda_kernels_softmax_f32_can_implement (baracuda kernels softmax f32 can implement).
baracuda_kernels_softmax_f32_run
Softmax FW, f32. y[k] = exp(x[k] - max) / Σ exp(x[j] - max) along softmax_axis. Numerically stable.
baracuda_kernels_softmax_f32_strided_can_implement
softmax_f32_strided_can_implement companion.
baracuda_kernels_softmax_f32_strided_run
Softmax FW strided sibling, f32. Same contract as baracuda_kernels_softmax_f32_run; identical underlying launcher.
baracuda_kernels_softmax_f64_can_implement
baracuda_kernels_softmax_f64_can_implement (baracuda kernels softmax f64 can implement).
baracuda_kernels_softmax_f64_run
Softmax FW, f64.
baracuda_kernels_softmax_f64_strided_can_implement
softmax_f64_strided_can_implement companion.
baracuda_kernels_softmax_f64_strided_run
Softmax FW strided sibling, f64.
baracuda_kernels_solve_f32_run
Linear-system solve A · X = B via fused getrf + getrs. a_inout is overwritten with packed LU factors; b_inout is overwritten with the solution X. pivots_out is [n] (1-based per LAPACK convention).
baracuda_kernels_solve_f32_workspace_size
Solve workspace size — uses the getrf query (cuSOLVER’s getrs is workspace-free).
baracuda_kernels_solve_f64_run
Linear-system solve A · X = B via fused getrf + getrs. a_inout is overwritten with packed LU factors; b_inout is overwritten with the solution X. pivots_out is [n] (1-based per LAPACK convention).
baracuda_kernels_solve_f64_workspace_size
Solve workspace size — uses the getrf query (cuSOLVER’s getrs is workspace-free).
baracuda_kernels_sort_backward_f32_can_implement
baracuda_kernels_sort_backward_f32_can_implement (baracuda kernels sort backward f32 can implement).
baracuda_kernels_sort_backward_f32_run
Sort BW, f32. dx[indices[i]] = dy[i]; launcher zeros dx first.
baracuda_kernels_sort_backward_f64_can_implement
baracuda_kernels_sort_backward_f64_can_implement (baracuda kernels sort backward f64 can implement).
baracuda_kernels_sort_backward_f64_run
Sort BW, f64.
baracuda_kernels_sort_f32_can_implement
baracuda_kernels_sort_f32_can_implement (baracuda kernels sort f32 can implement).
baracuda_kernels_sort_f32_run
Block-bitonic sort, f32. Emits sorted values + sorted indices (saved-indices contract for BW). descending == 0 = ascending.
baracuda_kernels_sort_f64_can_implement
baracuda_kernels_sort_f64_can_implement (baracuda kernels sort f64 can implement).
baracuda_kernels_sort_f64_run
Block-bitonic sort, f64.
baracuda_kernels_sort_i32_can_implement
baracuda_kernels_sort_i32_can_implement (baracuda kernels sort i32 can implement).
baracuda_kernels_sort_i32_run
Block-bitonic sort, i32.
baracuda_kernels_sort_i64_can_implement
baracuda_kernels_sort_i64_can_implement (baracuda kernels sort i64 can implement).
baracuda_kernels_sort_i64_run
Block-bitonic sort, i64.
baracuda_kernels_sparsemax_backward_bf16_can_implement
baracuda_kernels_sparsemax_backward_bf16_can_implement (baracuda kernels sparsemax backward bf16 can implement).
baracuda_kernels_sparsemax_backward_bf16_run
Sparsemax BW, bf16.
baracuda_kernels_sparsemax_backward_f16_can_implement
baracuda_kernels_sparsemax_backward_f16_can_implement (baracuda kernels sparsemax backward f16 can implement).
baracuda_kernels_sparsemax_backward_f16_run
Sparsemax BW, f16.
baracuda_kernels_sparsemax_backward_f32_can_implement
baracuda_kernels_sparsemax_backward_f32_can_implement (baracuda kernels sparsemax backward f32 can implement).
baracuda_kernels_sparsemax_backward_f32_run
Sparsemax BW, f32. dx[i] = dy[i] - sum_dy_active / n_active for active positions (y > 0), 0 elsewhere.
baracuda_kernels_sparsemax_backward_f64_can_implement
baracuda_kernels_sparsemax_backward_f64_can_implement (baracuda kernels sparsemax backward f64 can implement).
baracuda_kernels_sparsemax_backward_f64_run
Sparsemax BW, f64.
baracuda_kernels_sparsemax_bf16_can_implement
baracuda_kernels_sparsemax_bf16_can_implement (baracuda kernels sparsemax bf16 can implement).
baracuda_kernels_sparsemax_bf16_run
Sparsemax FW, bf16.
baracuda_kernels_sparsemax_f16_can_implement
baracuda_kernels_sparsemax_f16_can_implement (baracuda kernels sparsemax f16 can implement).
baracuda_kernels_sparsemax_f16_run
Sparsemax FW, f16.
baracuda_kernels_sparsemax_f32_can_implement
baracuda_kernels_sparsemax_f32_can_implement (baracuda kernels sparsemax f32 can implement).
baracuda_kernels_sparsemax_f32_run
Sparsemax FW, f32. y = ProjSimplex(x) via threshold τ found after sorting the row descending. Row extent limited to 64.
baracuda_kernels_sparsemax_f64_can_implement
baracuda_kernels_sparsemax_f64_can_implement (baracuda kernels sparsemax f64 can implement).
baracuda_kernels_sparsemax_f64_run
Sparsemax FW, f64.
baracuda_kernels_svd_batched_f32_run
Batched Jacobi-SVD on square input. Returns V (not V^T). When jobz == 0, u_out / v_out may be null.
baracuda_kernels_svd_batched_f32_workspace_size
Batched Jacobi-SVD workspace size in bytes.
baracuda_kernels_svd_batched_f64_run
Batched Jacobi-SVD on square input. Returns V (not V^T). When jobz == 0, u_out / v_out may be null.
baracuda_kernels_svd_batched_f64_workspace_size
Batched Jacobi-SVD workspace size in bytes.
baracuda_kernels_svd_f32_run
SVD A = U · diag(S) · V^T. Requires m >= n. a_inout is overwritten by cuSOLVER as scratch.
baracuda_kernels_svd_f32_workspace_size
SVD workspace size in bytes for gesvd.
baracuda_kernels_svd_f64_run
SVD A = U · diag(S) · V^T. Requires m >= n. a_inout is overwritten by cuSOLVER as scratch.
baracuda_kernels_svd_f64_workspace_size
SVD workspace size in bytes for gesvd.
baracuda_kernels_svda_batched_f32_run
Approximate (Jacobi-bidiagonal) batched SVD on rectangular input. Returns V (not V^T). The h_r_nrm_f_out buffer is host-resident and receives per-slot residual Frobenius norms (cuSOLVER signature). Pass null to discard — but cuSOLVER may dereference even when “discarding”, so callers should pass a real buffer of batch_size f64s.
baracuda_kernels_svda_batched_f32_workspace_size
Approximate batched SVD workspace size in bytes.
baracuda_kernels_svda_batched_f64_run
Approximate (Jacobi-bidiagonal) batched SVD on rectangular input. Returns V (not V^T). The h_r_nrm_f_out buffer is host-resident and receives per-slot residual Frobenius norms (cuSOLVER signature). Pass null to discard — but cuSOLVER may dereference even when “discarding”, so callers should pass a real buffer of batch_size f64s.
baracuda_kernels_svda_batched_f64_workspace_size
Approximate batched SVD workspace size in bytes.
baracuda_kernels_ternary_addcdiv_backward_bf16_can_implement
Pre-launch check for ternary_addcdiv_backward_bf16.
baracuda_kernels_ternary_addcdiv_backward_bf16_run
Addcdiv backward, bf16.
baracuda_kernels_ternary_addcdiv_backward_f16_can_implement
Pre-launch check for ternary_addcdiv_backward_f16.
baracuda_kernels_ternary_addcdiv_backward_f16_run
Addcdiv backward, f16.
baracuda_kernels_ternary_addcdiv_backward_f32_can_implement
Pre-launch check for ternary_addcdiv_backward_f32.
baracuda_kernels_ternary_addcdiv_backward_f32_run
Addcdiv backward, f32. Reads desc.scale. Writes da = dy, db = dy*scale/c, dc = -dy*scale*b/c².
baracuda_kernels_ternary_addcdiv_backward_f64_can_implement
Pre-launch check for ternary_addcdiv_backward_f64.
baracuda_kernels_ternary_addcdiv_backward_f64_run
Addcdiv backward, f64.
baracuda_kernels_ternary_addcdiv_bf16_can_implement
Pre-launch check for addcdiv_bf16.
baracuda_kernels_ternary_addcdiv_bf16_run
addcdiv, bf16, contig.
baracuda_kernels_ternary_addcdiv_bf16_strided_can_implement
Pre-launch check for addcdiv_bf16_strided.
baracuda_kernels_ternary_addcdiv_bf16_strided_run
addcdiv, bf16, strided.
baracuda_kernels_ternary_addcdiv_f16_can_implement
Pre-launch check for addcdiv_f16.
baracuda_kernels_ternary_addcdiv_f16_run
addcdiv, f16, contig.
baracuda_kernels_ternary_addcdiv_f16_strided_can_implement
Pre-launch check for addcdiv_f16_strided.
baracuda_kernels_ternary_addcdiv_f16_strided_run
addcdiv, f16, strided.
baracuda_kernels_ternary_addcdiv_f32_can_implement
Pre-launch check for addcdiv_f32.
baracuda_kernels_ternary_addcdiv_f32_run
y = a + scale * (b / c), f32, contig.
baracuda_kernels_ternary_addcdiv_f32_strided_can_implement
Pre-launch check for addcdiv_f32_strided.
baracuda_kernels_ternary_addcdiv_f32_strided_run
addcdiv, f32, strided.
baracuda_kernels_ternary_addcdiv_f64_can_implement
Pre-launch check for addcdiv_f64.
baracuda_kernels_ternary_addcdiv_f64_run
addcdiv, f64, contig.
baracuda_kernels_ternary_addcdiv_f64_strided_can_implement
Pre-launch check for addcdiv_f64_strided.
baracuda_kernels_ternary_addcdiv_f64_strided_run
addcdiv, f64, strided.
baracuda_kernels_ternary_addcmul_backward_bf16_can_implement
Pre-launch check for ternary_addcmul_backward_bf16.
baracuda_kernels_ternary_addcmul_backward_bf16_run
Addcmul backward, bf16.
baracuda_kernels_ternary_addcmul_backward_f16_can_implement
Pre-launch check for ternary_addcmul_backward_f16.
baracuda_kernels_ternary_addcmul_backward_f16_run
Addcmul backward, f16.
baracuda_kernels_ternary_addcmul_backward_f32_can_implement
Pre-launch check for ternary_addcmul_backward_f32.
baracuda_kernels_ternary_addcmul_backward_f32_run
Addcmul backward, f32. Reads desc.scale. Writes da = dy, db = dy*scale*c, dc = dy*scale*b.
baracuda_kernels_ternary_addcmul_backward_f64_can_implement
Pre-launch check for ternary_addcmul_backward_f64.
baracuda_kernels_ternary_addcmul_backward_f64_run
Addcmul backward, f64.
baracuda_kernels_ternary_addcmul_bf16_can_implement
Pre-launch check for addcmul_bf16.
baracuda_kernels_ternary_addcmul_bf16_run
addcmul, bf16, contig.
baracuda_kernels_ternary_addcmul_bf16_strided_can_implement
Pre-launch check for addcmul_bf16_strided.
baracuda_kernels_ternary_addcmul_bf16_strided_run
addcmul, bf16, strided.
baracuda_kernels_ternary_addcmul_f16_can_implement
Pre-launch check for addcmul_f16.
baracuda_kernels_ternary_addcmul_f16_run
addcmul, f16, contig.
baracuda_kernels_ternary_addcmul_f16_strided_can_implement
Pre-launch check for addcmul_f16_strided.
baracuda_kernels_ternary_addcmul_f16_strided_run
addcmul, f16, strided.
baracuda_kernels_ternary_addcmul_f32_can_implement
Pre-launch implementability check for addcmul_f32.
baracuda_kernels_ternary_addcmul_f32_run
y = a + scale * b * c, f32, contig fast path.
baracuda_kernels_ternary_addcmul_f32_strided_can_implement
Pre-launch check for addcmul_f32_strided.
baracuda_kernels_ternary_addcmul_f32_strided_run
y = a + scale * b * c, f32, strided / broadcast path.
baracuda_kernels_ternary_addcmul_f64_can_implement
Pre-launch check for addcmul_f64.
baracuda_kernels_ternary_addcmul_f64_run
addcmul, f64, contig.
baracuda_kernels_ternary_addcmul_f64_strided_can_implement
Pre-launch check for addcmul_f64_strided.
baracuda_kernels_ternary_addcmul_f64_strided_run
addcmul, f64, strided.
baracuda_kernels_ternary_clamp_backward_bf16_can_implement
Pre-launch check for ternary_clamp_backward_bf16.
baracuda_kernels_ternary_clamp_backward_bf16_run
Clamp backward, bf16.
baracuda_kernels_ternary_clamp_backward_f16_can_implement
Pre-launch check for ternary_clamp_backward_f16.
baracuda_kernels_ternary_clamp_backward_f16_run
Clamp backward, f16.
baracuda_kernels_ternary_clamp_backward_f32_can_implement
Pre-launch check for ternary_clamp_backward_f32.
baracuda_kernels_ternary_clamp_backward_f32_run
Clamp backward, f32. Writes mask × dy per axis (a/b/c).
baracuda_kernels_ternary_clamp_backward_f64_can_implement
Pre-launch check for ternary_clamp_backward_f64.
baracuda_kernels_ternary_clamp_backward_f64_run
Clamp backward, f64.
baracuda_kernels_ternary_clamp_bf16_can_implement
Pre-launch implementability check for ternary_clamp_bf16.
baracuda_kernels_ternary_clamp_bf16_run
Ternary elementwise clamp, bf16, contig fast path.
baracuda_kernels_ternary_clamp_bf16_strided_can_implement
Pre-launch implementability check for ternary_clamp_bf16_strided.
baracuda_kernels_ternary_clamp_bf16_strided_run
Ternary elementwise clamp, bf16, strided / broadcast path.
baracuda_kernels_ternary_clamp_f16_can_implement
Pre-launch implementability check for ternary_clamp_f16.
baracuda_kernels_ternary_clamp_f16_run
Ternary elementwise clamp, f16, contig fast path.
baracuda_kernels_ternary_clamp_f16_strided_can_implement
Pre-launch implementability check for ternary_clamp_f16_strided.
baracuda_kernels_ternary_clamp_f16_strided_run
Ternary elementwise clamp, f16, strided / broadcast path.
baracuda_kernels_ternary_clamp_f32_can_implement
Pre-launch implementability check for ternary_clamp_f32.
baracuda_kernels_ternary_clamp_f32_run
Ternary elementwise clamp, f32, contig fast path.
baracuda_kernels_ternary_clamp_f32_strided_can_implement
Pre-launch implementability check for ternary_clamp_f32_strided.
baracuda_kernels_ternary_clamp_f32_strided_run
Ternary elementwise clamp, f32, strided / broadcast path. This is the ternary-strided trailblazer — its safety contract (including aliasing) carries over to every ternary strided launcher across all dtypes.
baracuda_kernels_ternary_clamp_f64_can_implement
Pre-launch implementability check for ternary_clamp_f64.
baracuda_kernels_ternary_clamp_f64_run
Ternary elementwise clamp, f64, contig fast path.
baracuda_kernels_ternary_clamp_f64_strided_can_implement
Pre-launch implementability check for ternary_clamp_f64_strided.
baracuda_kernels_ternary_clamp_f64_strided_run
Ternary elementwise clamp, f64, strided / broadcast path.
baracuda_kernels_ternary_fma_backward_bf16_can_implement
Pre-launch check for ternary_fma_backward_bf16.
baracuda_kernels_ternary_fma_backward_bf16_run
Fma backward, bf16.
baracuda_kernels_ternary_fma_backward_f16_can_implement
Pre-launch check for ternary_fma_backward_f16.
baracuda_kernels_ternary_fma_backward_f16_run
Fma backward, f16.
baracuda_kernels_ternary_fma_backward_f32_can_implement
Pre-launch check for ternary_fma_backward_f32.
baracuda_kernels_ternary_fma_backward_f32_run
Fma backward, f32. Writes da = dy*b, db = dy*a, dc = dy.
baracuda_kernels_ternary_fma_backward_f64_can_implement
Pre-launch check for ternary_fma_backward_f64.
baracuda_kernels_ternary_fma_backward_f64_run
Fma backward, f64.
baracuda_kernels_ternary_fma_bf16_can_implement
Pre-launch implementability check for ternary_fma_bf16.
baracuda_kernels_ternary_fma_bf16_run
Ternary elementwise fma, bf16, contig fast path.
baracuda_kernels_ternary_fma_bf16_strided_can_implement
Pre-launch implementability check for ternary_fma_bf16_strided.
baracuda_kernels_ternary_fma_bf16_strided_run
Ternary elementwise fma, bf16, strided / broadcast path.
baracuda_kernels_ternary_fma_f16_can_implement
Pre-launch implementability check for ternary_fma_f16.
baracuda_kernels_ternary_fma_f16_run
Ternary elementwise fma, f16, contig fast path.
baracuda_kernels_ternary_fma_f16_strided_can_implement
Pre-launch implementability check for ternary_fma_f16_strided.
baracuda_kernels_ternary_fma_f16_strided_run
Ternary elementwise fma, f16, strided / broadcast path.
baracuda_kernels_ternary_fma_f32_can_implement
Pre-launch implementability check for ternary_fma_f32.
baracuda_kernels_ternary_fma_f32_run
Ternary elementwise fma, f32, contig fast path.
baracuda_kernels_ternary_fma_f32_strided_can_implement
Pre-launch implementability check for ternary_fma_f32_strided.
baracuda_kernels_ternary_fma_f32_strided_run
Ternary elementwise fma, f32, strided / broadcast path.
baracuda_kernels_ternary_fma_f64_can_implement
Pre-launch implementability check for ternary_fma_f64.
baracuda_kernels_ternary_fma_f64_run
Ternary elementwise fma, f64, contig fast path.
baracuda_kernels_ternary_fma_f64_strided_can_implement
Pre-launch implementability check for ternary_fma_f64_strided.
baracuda_kernels_ternary_fma_f64_strided_run
Ternary elementwise fma, f64, strided / broadcast path.
baracuda_kernels_topk_backward_f32_can_implement
baracuda_kernels_topk_backward_f32_can_implement (baracuda kernels topk backward f32 can implement).
baracuda_kernels_topk_backward_f32_run
Top-k BW, f32. Scatter k-wide dy into row_len-wide dx (zero-init) via saved indices.
baracuda_kernels_topk_backward_f64_can_implement
baracuda_kernels_topk_backward_f64_can_implement (baracuda kernels topk backward f64 can implement).
baracuda_kernels_topk_backward_f64_run
Top-k BW, f64.
baracuda_kernels_topk_f32_can_implement
baracuda_kernels_topk_f32_can_implement (baracuda kernels topk f32 can implement).
baracuda_kernels_topk_f32_run
Block-bitonic top-k, f32. Caps k ≤ 64 and row_len ≤ 1024. largest == 1 = top-k by value; largest == 0 = bottom-k.
baracuda_kernels_topk_f64_can_implement
baracuda_kernels_topk_f64_can_implement (baracuda kernels topk f64 can implement).
baracuda_kernels_topk_f64_run
Block-bitonic top-k, f64.
baracuda_kernels_trace_bf16_can_implement
baracuda_kernels_trace_bf16_can_implement (baracuda kernels trace bf16 can implement).
baracuda_kernels_trace_bf16_run
Trace, bf16 (f32-detour accumulator).
baracuda_kernels_trace_f16_can_implement
baracuda_kernels_trace_f16_can_implement (baracuda kernels trace f16 can implement).
baracuda_kernels_trace_f16_run
Trace, f16 (f32-detour accumulator).
baracuda_kernels_trace_f32_can_implement
baracuda_kernels_trace_f32_can_implement (baracuda kernels trace f32 can implement).
baracuda_kernels_trace_f32_run
Trace of a 2-D square matrix, f32. y[0] = Σ x[i * stride_row + i * stride_col] for i in 0..rows. Output is a single scalar.
baracuda_kernels_trace_f64_can_implement
baracuda_kernels_trace_f64_can_implement (baracuda kernels trace f64 can implement).
baracuda_kernels_trace_f64_run
Trace, f64.
baracuda_kernels_tril_bf16_can_implement
Implementability check for tril_bf16.
baracuda_kernels_tril_bf16_run
Tril, bf16.
baracuda_kernels_tril_bf16_strided_can_implement
Implementability check for tril_bf16_strided.
baracuda_kernels_tril_bf16_strided_run
Tril strided, bf16.
baracuda_kernels_tril_bool_can_implement
Implementability check for tril_bool.
baracuda_kernels_tril_bool_run
Tril, Bool (storage = u8).
baracuda_kernels_tril_bool_strided_can_implement
Implementability check for tril_bool_strided.
baracuda_kernels_tril_bool_strided_run
Tril strided, Bool (storage = u8).
baracuda_kernels_tril_f16_can_implement
Implementability check for tril_f16.
baracuda_kernels_tril_f16_run
Tril, f16.
baracuda_kernels_tril_f16_strided_can_implement
Implementability check for tril_f16_strided.
baracuda_kernels_tril_f16_strided_run
Tril strided, f16.
baracuda_kernels_tril_f32_can_implement
Implementability check for tril_f32.
baracuda_kernels_tril_f32_run
Tril, f32.
baracuda_kernels_tril_f32_strided_can_implement
Implementability check for tril_f32_strided.
baracuda_kernels_tril_f32_strided_run
Tril strided, f32.
baracuda_kernels_tril_f64_can_implement
Implementability check for tril_f64.
baracuda_kernels_tril_f64_run
Tril, f64.
baracuda_kernels_tril_f64_strided_can_implement
Implementability check for tril_f64_strided.
baracuda_kernels_tril_f64_strided_run
Tril strided, f64.
baracuda_kernels_tril_i32_can_implement
Implementability check for tril_i32.
baracuda_kernels_tril_i32_run
Tril, i32.
baracuda_kernels_tril_i32_strided_can_implement
Implementability check for tril_i32_strided.
baracuda_kernels_tril_i32_strided_run
Tril strided, i32.
baracuda_kernels_tril_i64_can_implement
Implementability check for tril_i64.
baracuda_kernels_tril_i64_run
Tril, i64.
baracuda_kernels_tril_i64_strided_can_implement
Implementability check for tril_i64_strided.
baracuda_kernels_tril_i64_strided_run
Tril strided, i64.
baracuda_kernels_triu_bf16_can_implement
Implementability check for triu_bf16.
baracuda_kernels_triu_bf16_run
Triu, bf16.
baracuda_kernels_triu_bf16_strided_can_implement
Implementability check for triu_bf16_strided.
baracuda_kernels_triu_bf16_strided_run
Triu strided, bf16.
baracuda_kernels_triu_bool_can_implement
Implementability check for triu_bool.
baracuda_kernels_triu_bool_run
Triu, Bool (storage = u8).
baracuda_kernels_triu_bool_strided_can_implement
Implementability check for triu_bool_strided.
baracuda_kernels_triu_bool_strided_run
Triu strided, Bool (storage = u8).
baracuda_kernels_triu_f16_can_implement
Implementability check for triu_f16.
baracuda_kernels_triu_f16_run
Triu, f16.
baracuda_kernels_triu_f16_strided_can_implement
Implementability check for triu_f16_strided.
baracuda_kernels_triu_f16_strided_run
Triu strided, f16.
baracuda_kernels_triu_f32_can_implement
Implementability check for triu_f32.
baracuda_kernels_triu_f32_run
Triu, f32. This is the triu trailblazer — its aliasing contract carries over to every other triu_<dt>_run, triu_<dt>_strided_run, and the sibling tril_* family.
baracuda_kernels_triu_f32_strided_can_implement
Implementability check for triu_f32_strided.
baracuda_kernels_triu_f32_strided_run
Triu strided, f32.
baracuda_kernels_triu_f64_can_implement
Implementability check for triu_f64.
baracuda_kernels_triu_f64_run
Triu, f64.
baracuda_kernels_triu_f64_strided_can_implement
Implementability check for triu_f64_strided.
baracuda_kernels_triu_f64_strided_run
Triu strided, f64.
baracuda_kernels_triu_i32_can_implement
Implementability check for triu_i32.
baracuda_kernels_triu_i32_run
Triu, i32.
baracuda_kernels_triu_i32_strided_can_implement
Implementability check for triu_i32_strided.
baracuda_kernels_triu_i32_strided_run
Triu strided, i32.
baracuda_kernels_triu_i64_can_implement
Implementability check for triu_i64.
baracuda_kernels_triu_i64_run
Triu, i64.
baracuda_kernels_triu_i64_strided_can_implement
Implementability check for triu_i64_strided.
baracuda_kernels_triu_i64_strided_run
Triu strided, i64.
baracuda_kernels_unary_abs_bf16_can_implement
Pre-launch implementability check for unary_abs_bf16.
baracuda_kernels_unary_abs_bf16_run
Unary elementwise abs, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_abs_bf16_strided_can_implement
Pre-launch implementability check for unary_abs_bf16_strided.
baracuda_kernels_unary_abs_bf16_strided_run
Unary elementwise abs, bf16 dtype, strided path.
baracuda_kernels_unary_abs_f16_can_implement
Pre-launch implementability check for unary_abs_f16.
baracuda_kernels_unary_abs_f16_run
Unary elementwise abs, f16 dtype, contiguous fast path.
baracuda_kernels_unary_abs_f16_strided_can_implement
Pre-launch implementability check for unary_abs_f16_strided.
baracuda_kernels_unary_abs_f16_strided_run
Unary elementwise abs, f16 dtype, strided path.
baracuda_kernels_unary_abs_f32_can_implement
Pre-launch implementability check for unary_abs_f32.
baracuda_kernels_unary_abs_f32_run
Unary elementwise abs, f32 dtype, contiguous fast path.
baracuda_kernels_unary_abs_f32_strided_can_implement
Pre-launch implementability check for unary_abs_f32_strided.
baracuda_kernels_unary_abs_f32_strided_run
Unary elementwise abs, f32 dtype, strided path.
baracuda_kernels_unary_abs_f64_can_implement
Pre-launch implementability check for unary_abs_f64.
baracuda_kernels_unary_abs_f64_run
Unary elementwise abs, f64 dtype, contiguous fast path.
baracuda_kernels_unary_abs_f64_strided_can_implement
Pre-launch implementability check for unary_abs_f64_strided.
baracuda_kernels_unary_abs_f64_strided_run
Unary elementwise abs, f64 dtype, strided path.
baracuda_kernels_unary_acos_backward_bf16_can_implement
Pre-launch implementability check for unary_acos_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_acos_backward_bf16_run
Acos backward, bf16.
baracuda_kernels_unary_acos_backward_f16_can_implement
Pre-launch implementability check for unary_acos_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_acos_backward_f16_run
Acos backward, f16.
baracuda_kernels_unary_acos_backward_f32_can_implement
Pre-launch implementability check for unary_acos_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_acos_backward_f32_run
Acos backward, f32. dx = -dy / sqrt(1 - x²). Saved-x. Domain: |x| < 1.
baracuda_kernels_unary_acos_backward_f64_can_implement
Pre-launch implementability check for unary_acos_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_acos_backward_f64_run
Acos backward, f64.
baracuda_kernels_unary_acos_bf16_can_implement
Pre-launch implementability check for unary_acos_bf16.
baracuda_kernels_unary_acos_bf16_run
Unary elementwise acos, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_acos_bf16_strided_can_implement
Pre-launch implementability check for unary_acos_bf16_strided.
baracuda_kernels_unary_acos_bf16_strided_run
Unary elementwise acos, bf16 dtype, strided path.
baracuda_kernels_unary_acos_f16_can_implement
Pre-launch implementability check for unary_acos_f16.
baracuda_kernels_unary_acos_f16_run
Unary elementwise acos, f16 dtype, contiguous fast path.
baracuda_kernels_unary_acos_f16_strided_can_implement
Pre-launch implementability check for unary_acos_f16_strided.
baracuda_kernels_unary_acos_f16_strided_run
Unary elementwise acos, f16 dtype, strided path.
baracuda_kernels_unary_acos_f32_can_implement
Pre-launch implementability check for unary_acos_f32.
baracuda_kernels_unary_acos_f32_run
Unary elementwise acos, f32 dtype, contiguous fast path.
baracuda_kernels_unary_acos_f32_strided_can_implement
Pre-launch implementability check for unary_acos_f32_strided.
baracuda_kernels_unary_acos_f32_strided_run
Unary elementwise acos, f32 dtype, strided path.
baracuda_kernels_unary_acos_f64_can_implement
Pre-launch implementability check for unary_acos_f64.
baracuda_kernels_unary_acos_f64_run
Unary elementwise acos, f64 dtype, contiguous fast path.
baracuda_kernels_unary_acos_f64_strided_can_implement
Pre-launch implementability check for unary_acos_f64_strided.
baracuda_kernels_unary_acos_f64_strided_run
Unary elementwise acos, f64 dtype, strided path.
baracuda_kernels_unary_acosh_backward_bf16_can_implement
Pre-launch implementability check for unary_acosh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_acosh_backward_bf16_run
Acosh backward, bf16.
baracuda_kernels_unary_acosh_backward_f16_can_implement
Pre-launch implementability check for unary_acosh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_acosh_backward_f16_run
Acosh backward, f16.
baracuda_kernels_unary_acosh_backward_f32_can_implement
Pre-launch implementability check for unary_acosh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_acosh_backward_f32_run
Acosh backward, f32. dx = dy / sqrt(x² - 1). Saved-x. Domain: x > 1.
baracuda_kernels_unary_acosh_backward_f64_can_implement
Pre-launch implementability check for unary_acosh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_acosh_backward_f64_run
Acosh backward, f64.
baracuda_kernels_unary_acosh_bf16_can_implement
Pre-launch implementability check for unary_acosh_bf16.
baracuda_kernels_unary_acosh_bf16_run
Unary elementwise acosh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_acosh_bf16_strided_can_implement
Pre-launch implementability check for unary_acosh_bf16_strided.
baracuda_kernels_unary_acosh_bf16_strided_run
Unary elementwise acosh, bf16 dtype, strided path.
baracuda_kernels_unary_acosh_f16_can_implement
Pre-launch implementability check for unary_acosh_f16.
baracuda_kernels_unary_acosh_f16_run
Unary elementwise acosh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_acosh_f16_strided_can_implement
Pre-launch implementability check for unary_acosh_f16_strided.
baracuda_kernels_unary_acosh_f16_strided_run
Unary elementwise acosh, f16 dtype, strided path.
baracuda_kernels_unary_acosh_f32_can_implement
Pre-launch implementability check for unary_acosh_f32.
baracuda_kernels_unary_acosh_f32_run
Unary elementwise acosh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_acosh_f32_strided_can_implement
Pre-launch implementability check for unary_acosh_f32_strided.
baracuda_kernels_unary_acosh_f32_strided_run
Unary elementwise acosh, f32 dtype, strided path.
baracuda_kernels_unary_acosh_f64_can_implement
Pre-launch implementability check for unary_acosh_f64.
baracuda_kernels_unary_acosh_f64_run
Unary elementwise acosh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_acosh_f64_strided_can_implement
Pre-launch implementability check for unary_acosh_f64_strided.
baracuda_kernels_unary_acosh_f64_strided_run
Unary elementwise acosh, f64 dtype, strided path.
baracuda_kernels_unary_asin_backward_bf16_can_implement
Pre-launch implementability check for unary_asin_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_asin_backward_bf16_run
Asin backward, bf16.
baracuda_kernels_unary_asin_backward_f16_can_implement
Pre-launch implementability check for unary_asin_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_asin_backward_f16_run
Asin backward, f16.
baracuda_kernels_unary_asin_backward_f32_can_implement
Pre-launch implementability check for unary_asin_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_asin_backward_f32_run
Asin backward, f32. dx = dy / sqrt(1 - x²). Saved-x. Domain: |x| < 1.
baracuda_kernels_unary_asin_backward_f64_can_implement
Pre-launch implementability check for unary_asin_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_asin_backward_f64_run
Asin backward, f64.
baracuda_kernels_unary_asin_bf16_can_implement
Pre-launch implementability check for unary_asin_bf16.
baracuda_kernels_unary_asin_bf16_run
Unary elementwise asin, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_asin_bf16_strided_can_implement
Pre-launch implementability check for unary_asin_bf16_strided.
baracuda_kernels_unary_asin_bf16_strided_run
Unary elementwise asin, bf16 dtype, strided path.
baracuda_kernels_unary_asin_f16_can_implement
Pre-launch implementability check for unary_asin_f16.
baracuda_kernels_unary_asin_f16_run
Unary elementwise asin, f16 dtype, contiguous fast path.
baracuda_kernels_unary_asin_f16_strided_can_implement
Pre-launch implementability check for unary_asin_f16_strided.
baracuda_kernels_unary_asin_f16_strided_run
Unary elementwise asin, f16 dtype, strided path.
baracuda_kernels_unary_asin_f32_can_implement
Pre-launch implementability check for unary_asin_f32.
baracuda_kernels_unary_asin_f32_run
Unary elementwise asin, f32 dtype, contiguous fast path.
baracuda_kernels_unary_asin_f32_strided_can_implement
Pre-launch implementability check for unary_asin_f32_strided.
baracuda_kernels_unary_asin_f32_strided_run
Unary elementwise asin, f32 dtype, strided path.
baracuda_kernels_unary_asin_f64_can_implement
Pre-launch implementability check for unary_asin_f64.
baracuda_kernels_unary_asin_f64_run
Unary elementwise asin, f64 dtype, contiguous fast path.
baracuda_kernels_unary_asin_f64_strided_can_implement
Pre-launch implementability check for unary_asin_f64_strided.
baracuda_kernels_unary_asin_f64_strided_run
Unary elementwise asin, f64 dtype, strided path.
baracuda_kernels_unary_asinh_backward_bf16_can_implement
Pre-launch implementability check for unary_asinh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_asinh_backward_bf16_run
Asinh backward, bf16.
baracuda_kernels_unary_asinh_backward_f16_can_implement
Pre-launch implementability check for unary_asinh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_asinh_backward_f16_run
Asinh backward, f16.
baracuda_kernels_unary_asinh_backward_f32_can_implement
Pre-launch implementability check for unary_asinh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_asinh_backward_f32_run
Asinh backward, f32. dx = dy / sqrt(1 + x²). Saved-x.
baracuda_kernels_unary_asinh_backward_f64_can_implement
Pre-launch implementability check for unary_asinh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_asinh_backward_f64_run
Asinh backward, f64.
baracuda_kernels_unary_asinh_bf16_can_implement
Pre-launch implementability check for unary_asinh_bf16.
baracuda_kernels_unary_asinh_bf16_run
Unary elementwise asinh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_asinh_bf16_strided_can_implement
Pre-launch implementability check for unary_asinh_bf16_strided.
baracuda_kernels_unary_asinh_bf16_strided_run
Unary elementwise asinh, bf16 dtype, strided path.
baracuda_kernels_unary_asinh_f16_can_implement
Pre-launch implementability check for unary_asinh_f16.
baracuda_kernels_unary_asinh_f16_run
Unary elementwise asinh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_asinh_f16_strided_can_implement
Pre-launch implementability check for unary_asinh_f16_strided.
baracuda_kernels_unary_asinh_f16_strided_run
Unary elementwise asinh, f16 dtype, strided path.
baracuda_kernels_unary_asinh_f32_can_implement
Pre-launch implementability check for unary_asinh_f32.
baracuda_kernels_unary_asinh_f32_run
Unary elementwise asinh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_asinh_f32_strided_can_implement
Pre-launch implementability check for unary_asinh_f32_strided.
baracuda_kernels_unary_asinh_f32_strided_run
Unary elementwise asinh, f32 dtype, strided path.
baracuda_kernels_unary_asinh_f64_can_implement
Pre-launch implementability check for unary_asinh_f64.
baracuda_kernels_unary_asinh_f64_run
Unary elementwise asinh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_asinh_f64_strided_can_implement
Pre-launch implementability check for unary_asinh_f64_strided.
baracuda_kernels_unary_asinh_f64_strided_run
Unary elementwise asinh, f64 dtype, strided path.
baracuda_kernels_unary_atan_backward_bf16_can_implement
Pre-launch implementability check for unary_atan_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_atan_backward_bf16_run
Atan backward, bf16.
baracuda_kernels_unary_atan_backward_f16_can_implement
Pre-launch implementability check for unary_atan_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_atan_backward_f16_run
Atan backward, f16.
baracuda_kernels_unary_atan_backward_f32_can_implement
Pre-launch implementability check for unary_atan_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_atan_backward_f32_run
Atan backward, f32. dx = dy / (1 + x²).
baracuda_kernels_unary_atan_backward_f64_can_implement
Pre-launch implementability check for unary_atan_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_atan_backward_f64_run
Atan backward, f64.
baracuda_kernels_unary_atan_bf16_can_implement
Pre-launch implementability check for unary_atan_bf16.
baracuda_kernels_unary_atan_bf16_run
Unary elementwise atan, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_atan_bf16_strided_can_implement
Pre-launch implementability check for unary_atan_bf16_strided.
baracuda_kernels_unary_atan_bf16_strided_run
Unary elementwise atan, bf16 dtype, strided path.
baracuda_kernels_unary_atan_f16_can_implement
Pre-launch implementability check for unary_atan_f16.
baracuda_kernels_unary_atan_f16_run
Unary elementwise atan, f16 dtype, contiguous fast path.
baracuda_kernels_unary_atan_f16_strided_can_implement
Pre-launch implementability check for unary_atan_f16_strided.
baracuda_kernels_unary_atan_f16_strided_run
Unary elementwise atan, f16 dtype, strided path.
baracuda_kernels_unary_atan_f32_can_implement
Pre-launch implementability check for unary_atan_f32.
baracuda_kernels_unary_atan_f32_run
Unary elementwise atan, f32 dtype, contiguous fast path.
baracuda_kernels_unary_atan_f32_strided_can_implement
Pre-launch implementability check for unary_atan_f32_strided.
baracuda_kernels_unary_atan_f32_strided_run
Unary elementwise atan, f32 dtype, strided path.
baracuda_kernels_unary_atan_f64_can_implement
Pre-launch implementability check for unary_atan_f64.
baracuda_kernels_unary_atan_f64_run
Unary elementwise atan, f64 dtype, contiguous fast path.
baracuda_kernels_unary_atan_f64_strided_can_implement
Pre-launch implementability check for unary_atan_f64_strided.
baracuda_kernels_unary_atan_f64_strided_run
Unary elementwise atan, f64 dtype, strided path.
baracuda_kernels_unary_atanh_backward_bf16_can_implement
Pre-launch implementability check for unary_atanh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_atanh_backward_bf16_run
Atanh backward, bf16.
baracuda_kernels_unary_atanh_backward_f16_can_implement
Pre-launch implementability check for unary_atanh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_atanh_backward_f16_run
Atanh backward, f16.
baracuda_kernels_unary_atanh_backward_f32_can_implement
Pre-launch implementability check for unary_atanh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_atanh_backward_f32_run
Atanh backward, f32. dx = dy / (1 - x²). Saved-x. Domain: |x| < 1.
baracuda_kernels_unary_atanh_backward_f64_can_implement
Pre-launch implementability check for unary_atanh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_atanh_backward_f64_run
Atanh backward, f64.
baracuda_kernels_unary_atanh_bf16_can_implement
Pre-launch implementability check for unary_atanh_bf16.
baracuda_kernels_unary_atanh_bf16_run
Unary elementwise atanh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_atanh_bf16_strided_can_implement
Pre-launch implementability check for unary_atanh_bf16_strided.
baracuda_kernels_unary_atanh_bf16_strided_run
Unary elementwise atanh, bf16 dtype, strided path.
baracuda_kernels_unary_atanh_f16_can_implement
Pre-launch implementability check for unary_atanh_f16.
baracuda_kernels_unary_atanh_f16_run
Unary elementwise atanh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_atanh_f16_strided_can_implement
Pre-launch implementability check for unary_atanh_f16_strided.
baracuda_kernels_unary_atanh_f16_strided_run
Unary elementwise atanh, f16 dtype, strided path.
baracuda_kernels_unary_atanh_f32_can_implement
Pre-launch implementability check for unary_atanh_f32.
baracuda_kernels_unary_atanh_f32_run
Unary elementwise atanh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_atanh_f32_strided_can_implement
Pre-launch implementability check for unary_atanh_f32_strided.
baracuda_kernels_unary_atanh_f32_strided_run
Unary elementwise atanh, f32 dtype, strided path.
baracuda_kernels_unary_atanh_f64_can_implement
Pre-launch implementability check for unary_atanh_f64.
baracuda_kernels_unary_atanh_f64_run
Unary elementwise atanh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_atanh_f64_strided_can_implement
Pre-launch implementability check for unary_atanh_f64_strided.
baracuda_kernels_unary_atanh_f64_strided_run
Unary elementwise atanh, f64 dtype, strided path.
baracuda_kernels_unary_cbrt_bf16_can_implement
Pre-launch implementability check for unary_cbrt_bf16.
baracuda_kernels_unary_cbrt_bf16_run
Unary elementwise cbrt, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_cbrt_bf16_strided_can_implement
Pre-launch implementability check for unary_cbrt_bf16_strided.
baracuda_kernels_unary_cbrt_bf16_strided_run
Unary elementwise cbrt, bf16 dtype, strided path.
baracuda_kernels_unary_cbrt_f16_can_implement
Pre-launch implementability check for unary_cbrt_f16.
baracuda_kernels_unary_cbrt_f16_run
Unary elementwise cbrt, f16 dtype, contiguous fast path.
baracuda_kernels_unary_cbrt_f16_strided_can_implement
Pre-launch implementability check for unary_cbrt_f16_strided.
baracuda_kernels_unary_cbrt_f16_strided_run
Unary elementwise cbrt, f16 dtype, strided path.
baracuda_kernels_unary_cbrt_f32_can_implement
Pre-launch implementability check for unary_cbrt_f32.
baracuda_kernels_unary_cbrt_f32_run
Unary elementwise cbrt, f32 dtype, contiguous fast path.
baracuda_kernels_unary_cbrt_f32_strided_can_implement
Pre-launch implementability check for unary_cbrt_f32_strided.
baracuda_kernels_unary_cbrt_f32_strided_run
Unary elementwise cbrt, f32 dtype, strided path.
baracuda_kernels_unary_cbrt_f64_can_implement
Pre-launch implementability check for unary_cbrt_f64.
baracuda_kernels_unary_cbrt_f64_run
Unary elementwise cbrt, f64 dtype, contiguous fast path.
baracuda_kernels_unary_cbrt_f64_strided_can_implement
Pre-launch implementability check for unary_cbrt_f64_strided.
baracuda_kernels_unary_cbrt_f64_strided_run
Unary elementwise cbrt, f64 dtype, strided path.
baracuda_kernels_unary_ceil_bf16_can_implement
Pre-launch implementability check for unary_ceil_bf16.
baracuda_kernels_unary_ceil_bf16_run
Unary elementwise ceil, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_ceil_bf16_strided_can_implement
Pre-launch implementability check for unary_ceil_bf16_strided.
baracuda_kernels_unary_ceil_bf16_strided_run
Unary elementwise ceil, bf16 dtype, strided path.
baracuda_kernels_unary_ceil_f16_can_implement
Pre-launch implementability check for unary_ceil_f16.
baracuda_kernels_unary_ceil_f16_run
Unary elementwise ceil, f16 dtype, contiguous fast path.
baracuda_kernels_unary_ceil_f16_strided_can_implement
Pre-launch implementability check for unary_ceil_f16_strided.
baracuda_kernels_unary_ceil_f16_strided_run
Unary elementwise ceil, f16 dtype, strided path.
baracuda_kernels_unary_ceil_f32_can_implement
Pre-launch implementability check for unary_ceil_f32.
baracuda_kernels_unary_ceil_f32_run
Unary elementwise ceil, f32 dtype, contiguous fast path.
baracuda_kernels_unary_ceil_f32_strided_can_implement
Pre-launch implementability check for unary_ceil_f32_strided.
baracuda_kernels_unary_ceil_f32_strided_run
Unary elementwise ceil, f32 dtype, strided path.
baracuda_kernels_unary_ceil_f64_can_implement
Pre-launch implementability check for unary_ceil_f64.
baracuda_kernels_unary_ceil_f64_run
Unary elementwise ceil, f64 dtype, contiguous fast path.
baracuda_kernels_unary_ceil_f64_strided_can_implement
Pre-launch implementability check for unary_ceil_f64_strided.
baracuda_kernels_unary_ceil_f64_strided_run
Unary elementwise ceil, f64 dtype, strided path.
baracuda_kernels_unary_cos_backward_bf16_can_implement
Pre-launch implementability check for unary_cos_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_cos_backward_bf16_run
Cos backward, bf16.
baracuda_kernels_unary_cos_backward_f16_can_implement
Pre-launch implementability check for unary_cos_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_cos_backward_f16_run
Cos backward, f16.
baracuda_kernels_unary_cos_backward_f32_can_implement
Pre-launch implementability check for unary_cos_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_cos_backward_f32_run
Cos backward, f32. dx = -dy * sin(x). Saved-x.
baracuda_kernels_unary_cos_backward_f64_can_implement
Pre-launch implementability check for unary_cos_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_cos_backward_f64_run
Cos backward, f64.
baracuda_kernels_unary_cos_bf16_can_implement
Pre-launch implementability check for unary_cos_bf16.
baracuda_kernels_unary_cos_bf16_run
Unary elementwise cos, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_cos_bf16_strided_can_implement
Pre-launch implementability check for unary_cos_bf16_strided.
baracuda_kernels_unary_cos_bf16_strided_run
Unary elementwise cos, bf16 dtype, strided path.
baracuda_kernels_unary_cos_f16_can_implement
Pre-launch implementability check for unary_cos_f16.
baracuda_kernels_unary_cos_f16_run
Unary elementwise cos, f16 dtype, contiguous fast path.
baracuda_kernels_unary_cos_f16_strided_can_implement
Pre-launch implementability check for unary_cos_f16_strided.
baracuda_kernels_unary_cos_f16_strided_run
Unary elementwise cos, f16 dtype, strided path.
baracuda_kernels_unary_cos_f32_can_implement
Pre-launch implementability check for unary_cos_f32.
baracuda_kernels_unary_cos_f32_run
Unary elementwise cos, f32 dtype, contiguous fast path.
baracuda_kernels_unary_cos_f32_strided_can_implement
Pre-launch implementability check for unary_cos_f32_strided.
baracuda_kernels_unary_cos_f32_strided_run
Unary elementwise cos, f32 dtype, strided path.
baracuda_kernels_unary_cos_f64_can_implement
Pre-launch implementability check for unary_cos_f64.
baracuda_kernels_unary_cos_f64_run
Unary elementwise cos, f64 dtype, contiguous fast path.
baracuda_kernels_unary_cos_f64_strided_can_implement
Pre-launch implementability check for unary_cos_f64_strided.
baracuda_kernels_unary_cos_f64_strided_run
Unary elementwise cos, f64 dtype, strided path.
baracuda_kernels_unary_cosh_backward_bf16_can_implement
Pre-launch implementability check for unary_cosh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_cosh_backward_bf16_run
Cosh backward, bf16.
baracuda_kernels_unary_cosh_backward_f16_can_implement
Pre-launch implementability check for unary_cosh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_cosh_backward_f16_run
Cosh backward, f16.
baracuda_kernels_unary_cosh_backward_f32_can_implement
Pre-launch implementability check for unary_cosh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_cosh_backward_f32_run
Cosh backward, f32. dx = dy * sinh(x). Saved-x.
baracuda_kernels_unary_cosh_backward_f64_can_implement
Pre-launch implementability check for unary_cosh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_cosh_backward_f64_run
Cosh backward, f64.
baracuda_kernels_unary_cosh_bf16_can_implement
Pre-launch implementability check for unary_cosh_bf16.
baracuda_kernels_unary_cosh_bf16_run
Unary elementwise cosh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_cosh_bf16_strided_can_implement
Pre-launch implementability check for unary_cosh_bf16_strided.
baracuda_kernels_unary_cosh_bf16_strided_run
Unary elementwise cosh, bf16 dtype, strided path.
baracuda_kernels_unary_cosh_f16_can_implement
Pre-launch implementability check for unary_cosh_f16.
baracuda_kernels_unary_cosh_f16_run
Unary elementwise cosh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_cosh_f16_strided_can_implement
Pre-launch implementability check for unary_cosh_f16_strided.
baracuda_kernels_unary_cosh_f16_strided_run
Unary elementwise cosh, f16 dtype, strided path.
baracuda_kernels_unary_cosh_f32_can_implement
Pre-launch implementability check for unary_cosh_f32.
baracuda_kernels_unary_cosh_f32_run
Unary elementwise cosh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_cosh_f32_strided_can_implement
Pre-launch implementability check for unary_cosh_f32_strided.
baracuda_kernels_unary_cosh_f32_strided_run
Unary elementwise cosh, f32 dtype, strided path.
baracuda_kernels_unary_cosh_f64_can_implement
Pre-launch implementability check for unary_cosh_f64.
baracuda_kernels_unary_cosh_f64_run
Unary elementwise cosh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_cosh_f64_strided_can_implement
Pre-launch implementability check for unary_cosh_f64_strided.
baracuda_kernels_unary_cosh_f64_strided_run
Unary elementwise cosh, f64 dtype, strided path.
baracuda_kernels_unary_cube_backward_bf16_can_implement
Pre-launch implementability check for unary_cube_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_cube_backward_bf16_run
Cube backward, bf16.
baracuda_kernels_unary_cube_backward_f16_can_implement
Pre-launch implementability check for unary_cube_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_cube_backward_f16_run
Cube backward, f16.
baracuda_kernels_unary_cube_backward_f32_can_implement
Pre-launch implementability check for unary_cube_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_cube_backward_f32_run
Cube backward, f32. dx = dy * 3 * x².
baracuda_kernels_unary_cube_backward_f64_can_implement
Pre-launch implementability check for unary_cube_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_cube_backward_f64_run
Cube backward, f64.
baracuda_kernels_unary_cube_bf16_can_implement
Pre-launch implementability check for unary_cube_bf16.
baracuda_kernels_unary_cube_bf16_run
Unary elementwise cube, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_cube_bf16_strided_can_implement
Pre-launch implementability check for unary_cube_bf16_strided.
baracuda_kernels_unary_cube_bf16_strided_run
Unary elementwise cube, bf16 dtype, strided path.
baracuda_kernels_unary_cube_f16_can_implement
Pre-launch implementability check for unary_cube_f16.
baracuda_kernels_unary_cube_f16_run
Unary elementwise cube, f16 dtype, contiguous fast path.
baracuda_kernels_unary_cube_f16_strided_can_implement
Pre-launch implementability check for unary_cube_f16_strided.
baracuda_kernels_unary_cube_f16_strided_run
Unary elementwise cube, f16 dtype, strided path.
baracuda_kernels_unary_cube_f32_can_implement
Pre-launch implementability check for unary_cube_f32.
baracuda_kernels_unary_cube_f32_run
Unary elementwise cube, f32 dtype, contiguous fast path.
baracuda_kernels_unary_cube_f32_strided_can_implement
Pre-launch implementability check for unary_cube_f32_strided.
baracuda_kernels_unary_cube_f32_strided_run
Unary elementwise cube, f32 dtype, strided path.
baracuda_kernels_unary_cube_f64_can_implement
Pre-launch implementability check for unary_cube_f64.
baracuda_kernels_unary_cube_f64_run
Unary elementwise cube, f64 dtype, contiguous fast path.
baracuda_kernels_unary_cube_f64_strided_can_implement
Pre-launch implementability check for unary_cube_f64_strided.
baracuda_kernels_unary_cube_f64_strided_run
Unary elementwise cube, f64 dtype, strided path.
baracuda_kernels_unary_elu_backward_bf16_can_implement
Pre-launch implementability check for unary_elu_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_elu_backward_bf16_run
ELU backward, bf16.
baracuda_kernels_unary_elu_backward_f16_can_implement
Pre-launch implementability check for unary_elu_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_elu_backward_f16_run
ELU backward, f16.
baracuda_kernels_unary_elu_backward_f32_can_implement
Pre-launch implementability check for unary_elu_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_elu_backward_f32_run
ELU backward, f32. dx = (x > 0) ? dy : dy·α·exp(x) with α=1.0. Saved-x.
baracuda_kernels_unary_elu_backward_f64_can_implement
Pre-launch implementability check for unary_elu_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_elu_backward_f64_run
ELU backward, f64.
baracuda_kernels_unary_elu_bf16_can_implement
Pre-launch implementability check for unary_elu_bf16.
baracuda_kernels_unary_elu_bf16_run
Unary elementwise elu(x; α), bf16, contig.
baracuda_kernels_unary_elu_bf16_strided_can_implement
Implementability check for baracuda_kernels_unary_elu_bf16_strided. Host-side only.
baracuda_kernels_unary_elu_bf16_strided_run
Unary elementwise elu(x; α), bf16, strided.
baracuda_kernels_unary_elu_f16_can_implement
Pre-launch implementability check for unary_elu_f16.
baracuda_kernels_unary_elu_f16_run
Unary elementwise elu(x; α), f16, contig.
baracuda_kernels_unary_elu_f16_strided_can_implement
Implementability check for baracuda_kernels_unary_elu_f16_strided. Host-side only.
baracuda_kernels_unary_elu_f16_strided_run
Unary elementwise elu(x; α), f16, strided.
baracuda_kernels_unary_elu_f32_can_implement
Pre-launch implementability check for unary_elu_f32.
baracuda_kernels_unary_elu_f32_run
Unary elementwise elu(x; α) = x if x>0 else α·(exp(x)-1), f32, contig.
baracuda_kernels_unary_elu_f32_strided_can_implement
Implementability check for baracuda_kernels_unary_elu_f32_strided. Host-side only.
baracuda_kernels_unary_elu_f32_strided_run
Unary elementwise elu(x; α), f32, strided.
baracuda_kernels_unary_elu_f64_can_implement
Pre-launch implementability check for unary_elu_f64.
baracuda_kernels_unary_elu_f64_run
Unary elementwise elu(x; α), f64, contig.
baracuda_kernels_unary_elu_f64_strided_can_implement
Implementability check for baracuda_kernels_unary_elu_f64_strided. Host-side only.
baracuda_kernels_unary_elu_f64_strided_run
Unary elementwise elu(x; α), f64, strided.
baracuda_kernels_unary_erf_backward_bf16_can_implement
Pre-launch implementability check for unary_erf_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_erf_backward_bf16_run
Erf backward, bf16.
baracuda_kernels_unary_erf_backward_f16_can_implement
Pre-launch implementability check for unary_erf_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_erf_backward_f16_run
Erf backward, f16.
baracuda_kernels_unary_erf_backward_f32_can_implement
Pre-launch implementability check for unary_erf_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_erf_backward_f32_run
Erf backward, f32. dx = dy * (2/√π) * exp(-x²).
baracuda_kernels_unary_erf_backward_f64_can_implement
Pre-launch implementability check for unary_erf_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_erf_backward_f64_run
Erf backward, f64.
baracuda_kernels_unary_erf_bf16_can_implement
Pre-launch implementability check for unary_erf_bf16.
baracuda_kernels_unary_erf_bf16_run
Unary elementwise erf, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_erf_bf16_strided_can_implement
Pre-launch implementability check for unary_erf_bf16_strided.
baracuda_kernels_unary_erf_bf16_strided_run
Unary elementwise erf, bf16 dtype, strided path.
baracuda_kernels_unary_erf_f16_can_implement
Pre-launch implementability check for unary_erf_f16.
baracuda_kernels_unary_erf_f16_run
Unary elementwise erf, f16 dtype, contiguous fast path.
baracuda_kernels_unary_erf_f16_strided_can_implement
Pre-launch implementability check for unary_erf_f16_strided.
baracuda_kernels_unary_erf_f16_strided_run
Unary elementwise erf, f16 dtype, strided path.
baracuda_kernels_unary_erf_f32_can_implement
Pre-launch implementability check for unary_erf_f32.
baracuda_kernels_unary_erf_f32_run
Unary elementwise erf, f32 dtype, contiguous fast path.
baracuda_kernels_unary_erf_f32_strided_can_implement
Pre-launch implementability check for unary_erf_f32_strided.
baracuda_kernels_unary_erf_f32_strided_run
Unary elementwise erf, f32 dtype, strided path.
baracuda_kernels_unary_erf_f64_can_implement
Pre-launch implementability check for unary_erf_f64.
baracuda_kernels_unary_erf_f64_run
Unary elementwise erf, f64 dtype, contiguous fast path.
baracuda_kernels_unary_erf_f64_strided_can_implement
Pre-launch implementability check for unary_erf_f64_strided.
baracuda_kernels_unary_erf_f64_strided_run
Unary elementwise erf, f64 dtype, strided path.
baracuda_kernels_unary_erfc_backward_bf16_can_implement
Pre-launch implementability check for unary_erfc_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_erfc_backward_bf16_run
Erfc backward, bf16.
baracuda_kernels_unary_erfc_backward_f16_can_implement
Pre-launch implementability check for unary_erfc_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_erfc_backward_f16_run
Erfc backward, f16.
baracuda_kernels_unary_erfc_backward_f32_can_implement
Pre-launch implementability check for unary_erfc_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_erfc_backward_f32_run
Erfc backward, f32. dx = -dy * (2/√π) * exp(-x²).
baracuda_kernels_unary_erfc_backward_f64_can_implement
Pre-launch implementability check for unary_erfc_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_erfc_backward_f64_run
Erfc backward, f64.
baracuda_kernels_unary_erfc_bf16_can_implement
Pre-launch implementability check for unary_erfc_bf16.
baracuda_kernels_unary_erfc_bf16_run
Unary elementwise erfc, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_erfc_bf16_strided_can_implement
Pre-launch implementability check for unary_erfc_bf16_strided.
baracuda_kernels_unary_erfc_bf16_strided_run
Unary elementwise erfc, bf16 dtype, strided path.
baracuda_kernels_unary_erfc_f16_can_implement
Pre-launch implementability check for unary_erfc_f16.
baracuda_kernels_unary_erfc_f16_run
Unary elementwise erfc, f16 dtype, contiguous fast path.
baracuda_kernels_unary_erfc_f16_strided_can_implement
Pre-launch implementability check for unary_erfc_f16_strided.
baracuda_kernels_unary_erfc_f16_strided_run
Unary elementwise erfc, f16 dtype, strided path.
baracuda_kernels_unary_erfc_f32_can_implement
Pre-launch implementability check for unary_erfc_f32.
baracuda_kernels_unary_erfc_f32_run
Unary elementwise erfc, f32 dtype, contiguous fast path.
baracuda_kernels_unary_erfc_f32_strided_can_implement
Pre-launch implementability check for unary_erfc_f32_strided.
baracuda_kernels_unary_erfc_f32_strided_run
Unary elementwise erfc, f32 dtype, strided path.
baracuda_kernels_unary_erfc_f64_can_implement
Pre-launch implementability check for unary_erfc_f64.
baracuda_kernels_unary_erfc_f64_run
Unary elementwise erfc, f64 dtype, contiguous fast path.
baracuda_kernels_unary_erfc_f64_strided_can_implement
Pre-launch implementability check for unary_erfc_f64_strided.
baracuda_kernels_unary_erfc_f64_strided_run
Unary elementwise erfc, f64 dtype, strided path.
baracuda_kernels_unary_exp2_backward_bf16_can_implement
Pre-launch implementability check for unary_exp2_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp2_backward_bf16_run
Exp2 backward, bf16.
baracuda_kernels_unary_exp2_backward_f16_can_implement
Pre-launch implementability check for unary_exp2_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp2_backward_f16_run
Exp2 backward, f16.
baracuda_kernels_unary_exp2_backward_f32_can_implement
Pre-launch implementability check for unary_exp2_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp2_backward_f32_run
Exp2 backward, f32. dx = dy * y * ln(2).
baracuda_kernels_unary_exp2_backward_f64_can_implement
Pre-launch implementability check for unary_exp2_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp2_backward_f64_run
Exp2 backward, f64.
baracuda_kernels_unary_exp2_bf16_can_implement
Pre-launch implementability check for unary_exp2_bf16.
baracuda_kernels_unary_exp2_bf16_run
Unary elementwise exp2, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_exp2_bf16_strided_can_implement
Pre-launch implementability check for unary_exp2_bf16_strided.
baracuda_kernels_unary_exp2_bf16_strided_run
Unary elementwise exp2, bf16 dtype, strided path.
baracuda_kernels_unary_exp2_f16_can_implement
Pre-launch implementability check for unary_exp2_f16.
baracuda_kernels_unary_exp2_f16_run
Unary elementwise exp2, f16 dtype, contiguous fast path.
baracuda_kernels_unary_exp2_f16_strided_can_implement
Pre-launch implementability check for unary_exp2_f16_strided.
baracuda_kernels_unary_exp2_f16_strided_run
Unary elementwise exp2, f16 dtype, strided path.
baracuda_kernels_unary_exp2_f32_can_implement
Pre-launch implementability check for unary_exp2_f32.
baracuda_kernels_unary_exp2_f32_run
Unary elementwise exp2, f32 dtype, contiguous fast path.
baracuda_kernels_unary_exp2_f32_strided_can_implement
Pre-launch implementability check for unary_exp2_f32_strided.
baracuda_kernels_unary_exp2_f32_strided_run
Unary elementwise exp2, f32 dtype, strided path.
baracuda_kernels_unary_exp2_f64_can_implement
Pre-launch implementability check for unary_exp2_f64.
baracuda_kernels_unary_exp2_f64_run
Unary elementwise exp2, f64 dtype, contiguous fast path.
baracuda_kernels_unary_exp2_f64_strided_can_implement
Pre-launch implementability check for unary_exp2_f64_strided.
baracuda_kernels_unary_exp2_f64_strided_run
Unary elementwise exp2, f64 dtype, strided path.
baracuda_kernels_unary_exp_backward_bf16_can_implement
Pre-launch implementability check for unary_exp_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp_backward_bf16_run
Exp backward, bf16.
baracuda_kernels_unary_exp_backward_f16_can_implement
Pre-launch implementability check for unary_exp_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp_backward_f16_run
Exp backward, f16.
baracuda_kernels_unary_exp_backward_f32_can_implement
Pre-launch implementability check for unary_exp_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp_backward_f32_run
Exp backward, f32. dx = dy * y. Caller must pass the forward output y as saved.
baracuda_kernels_unary_exp_backward_f64_can_implement
Pre-launch implementability check for unary_exp_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp_backward_f64_run
Exp backward, f64.
baracuda_kernels_unary_exp_bf16_can_implement
Pre-launch implementability check for unary_exp_bf16.
baracuda_kernels_unary_exp_bf16_run
Unary elementwise exp, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_exp_bf16_strided_can_implement
Pre-launch implementability check for unary_exp_bf16_strided.
baracuda_kernels_unary_exp_bf16_strided_run
Unary elementwise exp, bf16 dtype, strided path.
baracuda_kernels_unary_exp_f16_can_implement
Pre-launch implementability check for unary_exp_f16.
baracuda_kernels_unary_exp_f16_run
Unary elementwise exp, f16 dtype, contiguous fast path.
baracuda_kernels_unary_exp_f16_strided_can_implement
Pre-launch implementability check for unary_exp_f16_strided.
baracuda_kernels_unary_exp_f16_strided_run
Unary elementwise exp, f16 dtype, strided path.
baracuda_kernels_unary_exp_f32_can_implement
Pre-launch implementability check for unary_exp_f32.
baracuda_kernels_unary_exp_f32_run
Unary elementwise exp, f32 dtype, contiguous fast path.
baracuda_kernels_unary_exp_f32_strided_can_implement
Pre-launch implementability check for unary_exp_f32_strided.
baracuda_kernels_unary_exp_f32_strided_run
Unary elementwise exp, f32 dtype, strided path.
baracuda_kernels_unary_exp_f64_can_implement
Pre-launch implementability check for unary_exp_f64.
baracuda_kernels_unary_exp_f64_run
Unary elementwise exp, f64 dtype, contiguous fast path.
baracuda_kernels_unary_exp_f64_strided_can_implement
Pre-launch implementability check for unary_exp_f64_strided.
baracuda_kernels_unary_exp_f64_strided_run
Unary elementwise exp, f64 dtype, strided path.
baracuda_kernels_unary_expm1_backward_bf16_can_implement
Pre-launch implementability check for unary_expm1_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_expm1_backward_bf16_run
Expm1 backward, bf16.
baracuda_kernels_unary_expm1_backward_f16_can_implement
Pre-launch implementability check for unary_expm1_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_expm1_backward_f16_run
Expm1 backward, f16.
baracuda_kernels_unary_expm1_backward_f32_can_implement
Pre-launch implementability check for unary_expm1_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_expm1_backward_f32_run
Expm1 backward, f32. dx = dy * (y + 1). Saved-y.
baracuda_kernels_unary_expm1_backward_f64_can_implement
Pre-launch implementability check for unary_expm1_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_expm1_backward_f64_run
Expm1 backward, f64.
baracuda_kernels_unary_expm1_bf16_can_implement
Pre-launch implementability check for unary_expm1_bf16.
baracuda_kernels_unary_expm1_bf16_run
Unary elementwise expm1, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_expm1_bf16_strided_can_implement
Pre-launch implementability check for unary_expm1_bf16_strided.
baracuda_kernels_unary_expm1_bf16_strided_run
Unary elementwise expm1, bf16 dtype, strided path.
baracuda_kernels_unary_expm1_f16_can_implement
Pre-launch implementability check for unary_expm1_f16.
baracuda_kernels_unary_expm1_f16_run
Unary elementwise expm1, f16 dtype, contiguous fast path.
baracuda_kernels_unary_expm1_f16_strided_can_implement
Pre-launch implementability check for unary_expm1_f16_strided.
baracuda_kernels_unary_expm1_f16_strided_run
Unary elementwise expm1, f16 dtype, strided path.
baracuda_kernels_unary_expm1_f32_can_implement
Pre-launch implementability check for unary_expm1_f32.
baracuda_kernels_unary_expm1_f32_run
Unary elementwise expm1, f32 dtype, contiguous fast path.
baracuda_kernels_unary_expm1_f32_strided_can_implement
Pre-launch implementability check for unary_expm1_f32_strided.
baracuda_kernels_unary_expm1_f32_strided_run
Unary elementwise expm1, f32 dtype, strided path.
baracuda_kernels_unary_expm1_f64_can_implement
Pre-launch implementability check for unary_expm1_f64.
baracuda_kernels_unary_expm1_f64_run
Unary elementwise expm1, f64 dtype, contiguous fast path.
baracuda_kernels_unary_expm1_f64_strided_can_implement
Pre-launch implementability check for unary_expm1_f64_strided.
baracuda_kernels_unary_expm1_f64_strided_run
Unary elementwise expm1, f64 dtype, strided path.
baracuda_kernels_unary_floor_bf16_can_implement
Pre-launch implementability check for unary_floor_bf16.
baracuda_kernels_unary_floor_bf16_run
Unary elementwise floor, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_floor_bf16_strided_can_implement
Pre-launch implementability check for unary_floor_bf16_strided.
baracuda_kernels_unary_floor_bf16_strided_run
Unary elementwise floor, bf16 dtype, strided path.
baracuda_kernels_unary_floor_f16_can_implement
Pre-launch implementability check for unary_floor_f16.
baracuda_kernels_unary_floor_f16_run
Unary elementwise floor, f16 dtype, contiguous fast path.
baracuda_kernels_unary_floor_f16_strided_can_implement
Pre-launch implementability check for unary_floor_f16_strided.
baracuda_kernels_unary_floor_f16_strided_run
Unary elementwise floor, f16 dtype, strided path.
baracuda_kernels_unary_floor_f32_can_implement
Pre-launch implementability check for unary_floor_f32.
baracuda_kernels_unary_floor_f32_run
Unary elementwise floor, f32 dtype, contiguous fast path.
baracuda_kernels_unary_floor_f32_strided_can_implement
Pre-launch implementability check for unary_floor_f32_strided.
baracuda_kernels_unary_floor_f32_strided_run
Unary elementwise floor, f32 dtype, strided path.
baracuda_kernels_unary_floor_f64_can_implement
Pre-launch implementability check for unary_floor_f64.
baracuda_kernels_unary_floor_f64_run
Unary elementwise floor, f64 dtype, contiguous fast path.
baracuda_kernels_unary_floor_f64_strided_can_implement
Pre-launch implementability check for unary_floor_f64_strided.
baracuda_kernels_unary_floor_f64_strided_run
Unary elementwise floor, f64 dtype, strided path.
baracuda_kernels_unary_frac_bf16_can_implement
Pre-launch implementability check for unary_frac_bf16.
baracuda_kernels_unary_frac_bf16_run
Unary elementwise frac, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_frac_bf16_strided_can_implement
Pre-launch implementability check for unary_frac_bf16_strided.
baracuda_kernels_unary_frac_bf16_strided_run
Unary elementwise frac, bf16 dtype, strided path.
baracuda_kernels_unary_frac_f16_can_implement
Pre-launch implementability check for unary_frac_f16.
baracuda_kernels_unary_frac_f16_run
Unary elementwise frac, f16 dtype, contiguous fast path.
baracuda_kernels_unary_frac_f16_strided_can_implement
Pre-launch implementability check for unary_frac_f16_strided.
baracuda_kernels_unary_frac_f16_strided_run
Unary elementwise frac, f16 dtype, strided path.
baracuda_kernels_unary_frac_f32_can_implement
Pre-launch implementability check for unary_frac_f32.
baracuda_kernels_unary_frac_f32_run
Unary elementwise frac, f32 dtype, contiguous fast path.
baracuda_kernels_unary_frac_f32_strided_can_implement
Pre-launch implementability check for unary_frac_f32_strided.
baracuda_kernels_unary_frac_f32_strided_run
Unary elementwise frac, f32 dtype, strided path.
baracuda_kernels_unary_frac_f64_can_implement
Pre-launch implementability check for unary_frac_f64.
baracuda_kernels_unary_frac_f64_run
Unary elementwise frac, f64 dtype, contiguous fast path.
baracuda_kernels_unary_frac_f64_strided_can_implement
Pre-launch implementability check for unary_frac_f64_strided.
baracuda_kernels_unary_frac_f64_strided_run
Unary elementwise frac, f64 dtype, strided path.
baracuda_kernels_unary_gelu_backward_bf16_can_implement
Pre-launch implementability check for unary_gelu_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_backward_bf16_run
GELU (erf-based) backward, bf16.
baracuda_kernels_unary_gelu_backward_f16_can_implement
Pre-launch implementability check for unary_gelu_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_backward_f16_run
GELU (erf-based) backward, f16.
baracuda_kernels_unary_gelu_backward_f32_can_implement
Pre-launch implementability check for unary_gelu_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_backward_f32_run
GELU (exact / erf-based) backward, f32. dx = dy * (Φ(x) + x*φ(x)). Saved-x.
baracuda_kernels_unary_gelu_backward_f64_can_implement
Pre-launch implementability check for unary_gelu_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_backward_f64_run
GELU (erf-based) backward, f64.
baracuda_kernels_unary_gelu_bf16_can_implement
Pre-launch implementability check for unary_gelu_bf16.
baracuda_kernels_unary_gelu_bf16_run
Unary elementwise gelu, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_bf16_strided_can_implement
Pre-launch implementability check for unary_gelu_bf16_strided.
baracuda_kernels_unary_gelu_bf16_strided_run
Unary elementwise gelu, bf16 dtype, strided path.
baracuda_kernels_unary_gelu_erf_bf16_can_implement
baracuda_kernels_unary_gelu_erf_bf16_can_implement (baracuda kernels unary gelu erf bf16 can implement).
baracuda_kernels_unary_gelu_erf_bf16_run
unary_gelu_erf, bf16, contig.
baracuda_kernels_unary_gelu_erf_bf16_strided_can_implement
Pre-launch implementability check for unary_gelu_erf_bf16_strided.
baracuda_kernels_unary_gelu_erf_bf16_strided_run
baracuda_kernels_unary_gelu_erf_bf16_strided_run (baracuda kernels unary gelu erf bf16 strided run).
baracuda_kernels_unary_gelu_erf_f16_can_implement
baracuda_kernels_unary_gelu_erf_f16_can_implement (baracuda kernels unary gelu erf f16 can implement).
baracuda_kernels_unary_gelu_erf_f16_run
unary_gelu_erf, f16, contig.
baracuda_kernels_unary_gelu_erf_f16_strided_can_implement
Pre-launch implementability check for unary_gelu_erf_f16_strided.
baracuda_kernels_unary_gelu_erf_f16_strided_run
baracuda_kernels_unary_gelu_erf_f16_strided_run (baracuda kernels unary gelu erf f16 strided run).
baracuda_kernels_unary_gelu_erf_f32_can_implement
baracuda_kernels_unary_gelu_erf_f32_can_implement (baracuda kernels unary gelu erf f32 can implement).
baracuda_kernels_unary_gelu_erf_f32_run
unary_gelu_erf, f32, contig.
baracuda_kernels_unary_gelu_erf_f32_strided_can_implement
Pre-launch implementability check for unary_gelu_erf_f32_strided.
baracuda_kernels_unary_gelu_erf_f32_strided_run
baracuda_kernels_unary_gelu_erf_f32_strided_run (baracuda kernels unary gelu erf f32 strided run).
baracuda_kernels_unary_gelu_erf_f64_can_implement
baracuda_kernels_unary_gelu_erf_f64_can_implement (baracuda kernels unary gelu erf f64 can implement).
baracuda_kernels_unary_gelu_erf_f64_run
unary_gelu_erf, f64, contig.
baracuda_kernels_unary_gelu_erf_f64_strided_can_implement
Pre-launch implementability check for unary_gelu_erf_f64_strided.
baracuda_kernels_unary_gelu_erf_f64_strided_run
baracuda_kernels_unary_gelu_erf_f64_strided_run (baracuda kernels unary gelu erf f64 strided run).
baracuda_kernels_unary_gelu_f16_can_implement
Pre-launch implementability check for unary_gelu_f16.
baracuda_kernels_unary_gelu_f16_run
Unary elementwise gelu, f16 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_f16_strided_can_implement
Pre-launch implementability check for unary_gelu_f16_strided.
baracuda_kernels_unary_gelu_f16_strided_run
Unary elementwise gelu, f16 dtype, strided path.
baracuda_kernels_unary_gelu_f32_can_implement
Pre-launch implementability check for unary_gelu_f32.
baracuda_kernels_unary_gelu_f32_run
Unary elementwise gelu, f32 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_f32_strided_can_implement
Pre-launch implementability check for unary_gelu_f32_strided.
baracuda_kernels_unary_gelu_f32_strided_run
Unary elementwise gelu, f32 dtype, strided path.
baracuda_kernels_unary_gelu_f64_can_implement
Pre-launch implementability check for unary_gelu_f64.
baracuda_kernels_unary_gelu_f64_run
Unary elementwise gelu, f64 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_f64_strided_can_implement
Pre-launch implementability check for unary_gelu_f64_strided.
baracuda_kernels_unary_gelu_f64_strided_run
Unary elementwise gelu, f64 dtype, strided path.
baracuda_kernels_unary_gelu_tanh_backward_bf16_can_implement
Pre-launch implementability check for unary_gelu_tanh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_tanh_backward_bf16_run
GELU (tanh approximation) backward, bf16.
baracuda_kernels_unary_gelu_tanh_backward_f16_can_implement
Pre-launch implementability check for unary_gelu_tanh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_tanh_backward_f16_run
GELU (tanh approximation) backward, f16.
baracuda_kernels_unary_gelu_tanh_backward_f32_can_implement
Pre-launch implementability check for unary_gelu_tanh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_tanh_backward_f32_run
GELU (tanh approximation) backward, f32. Saved-x.
baracuda_kernels_unary_gelu_tanh_backward_f64_can_implement
Pre-launch implementability check for unary_gelu_tanh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_tanh_backward_f64_run
GELU (tanh approximation) backward, f64.
baracuda_kernels_unary_gelu_tanh_bf16_can_implement
Pre-launch implementability check for unary_gelu_tanh_bf16.
baracuda_kernels_unary_gelu_tanh_bf16_run
Unary elementwise gelu_tanh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_tanh_bf16_strided_can_implement
Pre-launch implementability check for unary_gelu_tanh_bf16_strided.
baracuda_kernels_unary_gelu_tanh_bf16_strided_run
Unary elementwise gelu_tanh, bf16 dtype, strided path.
baracuda_kernels_unary_gelu_tanh_f16_can_implement
Pre-launch implementability check for unary_gelu_tanh_f16.
baracuda_kernels_unary_gelu_tanh_f16_run
Unary elementwise gelu_tanh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_tanh_f16_strided_can_implement
Pre-launch implementability check for unary_gelu_tanh_f16_strided.
baracuda_kernels_unary_gelu_tanh_f16_strided_run
Unary elementwise gelu_tanh, f16 dtype, strided path.
baracuda_kernels_unary_gelu_tanh_f32_can_implement
Pre-launch implementability check for unary_gelu_tanh_f32.
baracuda_kernels_unary_gelu_tanh_f32_run
Unary elementwise gelu_tanh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_tanh_f32_strided_can_implement
Pre-launch implementability check for unary_gelu_tanh_f32_strided.
baracuda_kernels_unary_gelu_tanh_f32_strided_run
Unary elementwise gelu_tanh, f32 dtype, strided path.
baracuda_kernels_unary_gelu_tanh_f64_can_implement
Pre-launch implementability check for unary_gelu_tanh_f64.
baracuda_kernels_unary_gelu_tanh_f64_run
Unary elementwise gelu_tanh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_tanh_f64_strided_can_implement
Pre-launch implementability check for unary_gelu_tanh_f64_strided.
baracuda_kernels_unary_gelu_tanh_f64_strided_run
Unary elementwise gelu_tanh, f64 dtype, strided path.
baracuda_kernels_unary_hardshrink_backward_bf16_can_implement
Pre-launch implementability check for unary_hardshrink_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardshrink_backward_bf16_run
Hardshrink backward, bf16.
baracuda_kernels_unary_hardshrink_backward_f16_can_implement
Pre-launch implementability check for unary_hardshrink_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardshrink_backward_f16_run
Hardshrink backward, f16.
baracuda_kernels_unary_hardshrink_backward_f32_can_implement
Pre-launch implementability check for unary_hardshrink_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardshrink_backward_f32_run
Hardshrink backward, f32. dx = (|x| > λ) ? dy : 0 with λ=0.5. Saved-x.
baracuda_kernels_unary_hardshrink_backward_f64_can_implement
Pre-launch implementability check for unary_hardshrink_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardshrink_backward_f64_run
Hardshrink backward, f64.
baracuda_kernels_unary_hardshrink_bf16_can_implement
Pre-launch implementability check for unary_hardshrink_bf16.
baracuda_kernels_unary_hardshrink_bf16_run
Unary elementwise hardshrink (λ=0.5), bf16, contig.
baracuda_kernels_unary_hardshrink_bf16_strided_can_implement
Pre-launch implementability check for unary_hardshrink_bf16_strided.
baracuda_kernels_unary_hardshrink_bf16_strided_run
Unary elementwise hardshrink (λ=0.5), bf16, strided.
baracuda_kernels_unary_hardshrink_f16_can_implement
Pre-launch implementability check for unary_hardshrink_f16.
baracuda_kernels_unary_hardshrink_f16_run
Unary elementwise hardshrink (λ=0.5), f16, contig.
baracuda_kernels_unary_hardshrink_f16_strided_can_implement
Pre-launch implementability check for unary_hardshrink_f16_strided.
baracuda_kernels_unary_hardshrink_f16_strided_run
Unary elementwise hardshrink (λ=0.5), f16, strided.
baracuda_kernels_unary_hardshrink_f32_can_implement
Pre-launch implementability check for unary_hardshrink_f32.
baracuda_kernels_unary_hardshrink_f32_run
Unary elementwise hardshrink (λ=0.5), f32, contig.
baracuda_kernels_unary_hardshrink_f32_strided_can_implement
Pre-launch implementability check for unary_hardshrink_f32_strided.
baracuda_kernels_unary_hardshrink_f32_strided_run
Unary elementwise hardshrink (λ=0.5), f32, strided.
baracuda_kernels_unary_hardshrink_f64_can_implement
Pre-launch implementability check for unary_hardshrink_f64.
baracuda_kernels_unary_hardshrink_f64_run
Unary elementwise hardshrink (λ=0.5), f64, contig.
baracuda_kernels_unary_hardshrink_f64_strided_can_implement
Pre-launch implementability check for unary_hardshrink_f64_strided.
baracuda_kernels_unary_hardshrink_f64_strided_run
Unary elementwise hardshrink (λ=0.5), f64, strided.
baracuda_kernels_unary_hardsigmoid_backward_bf16_can_implement
Pre-launch implementability check for unary_hardsigmoid_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardsigmoid_backward_bf16_run
Hardsigmoid backward, bf16.
baracuda_kernels_unary_hardsigmoid_backward_f16_can_implement
Pre-launch implementability check for unary_hardsigmoid_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardsigmoid_backward_f16_run
Hardsigmoid backward, f16.
baracuda_kernels_unary_hardsigmoid_backward_f32_can_implement
Pre-launch implementability check for unary_hardsigmoid_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardsigmoid_backward_f32_run
Hardsigmoid backward, f32. dx = (-3 < x < 3) ? dy / 6 : 0. Saved-x.
baracuda_kernels_unary_hardsigmoid_backward_f64_can_implement
Pre-launch implementability check for unary_hardsigmoid_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardsigmoid_backward_f64_run
Hardsigmoid backward, f64.
baracuda_kernels_unary_hardsigmoid_bf16_can_implement
Pre-launch implementability check for unary_hardsigmoid_bf16.
baracuda_kernels_unary_hardsigmoid_bf16_run
Unary elementwise hardsigmoid, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_hardsigmoid_bf16_strided_can_implement
Pre-launch implementability check for unary_hardsigmoid_bf16_strided.
baracuda_kernels_unary_hardsigmoid_bf16_strided_run
Unary elementwise hardsigmoid, bf16 dtype, strided path.
baracuda_kernels_unary_hardsigmoid_f16_can_implement
Pre-launch implementability check for unary_hardsigmoid_f16.
baracuda_kernels_unary_hardsigmoid_f16_run
Unary elementwise hardsigmoid, f16 dtype, contiguous fast path.
baracuda_kernels_unary_hardsigmoid_f16_strided_can_implement
Pre-launch implementability check for unary_hardsigmoid_f16_strided.
baracuda_kernels_unary_hardsigmoid_f16_strided_run
Unary elementwise hardsigmoid, f16 dtype, strided path.
baracuda_kernels_unary_hardsigmoid_f32_can_implement
Pre-launch implementability check for unary_hardsigmoid_f32.
baracuda_kernels_unary_hardsigmoid_f32_run
Unary elementwise hardsigmoid, f32 dtype, contiguous fast path.
baracuda_kernels_unary_hardsigmoid_f32_strided_can_implement
Pre-launch implementability check for unary_hardsigmoid_f32_strided.
baracuda_kernels_unary_hardsigmoid_f32_strided_run
Unary elementwise hardsigmoid, f32 dtype, strided path.
baracuda_kernels_unary_hardsigmoid_f64_can_implement
Pre-launch implementability check for unary_hardsigmoid_f64.
baracuda_kernels_unary_hardsigmoid_f64_run
Unary elementwise hardsigmoid, f64 dtype, contiguous fast path.
baracuda_kernels_unary_hardsigmoid_f64_strided_can_implement
Pre-launch implementability check for unary_hardsigmoid_f64_strided.
baracuda_kernels_unary_hardsigmoid_f64_strided_run
Unary elementwise hardsigmoid, f64 dtype, strided path.
baracuda_kernels_unary_hardswish_backward_bf16_can_implement
Pre-launch implementability check for unary_hardswish_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardswish_backward_bf16_run
Hardswish backward, bf16.
baracuda_kernels_unary_hardswish_backward_f16_can_implement
Pre-launch implementability check for unary_hardswish_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardswish_backward_f16_run
Hardswish backward, f16.
baracuda_kernels_unary_hardswish_backward_f32_can_implement
Pre-launch implementability check for unary_hardswish_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardswish_backward_f32_run
Hardswish backward, f32. Three-region piecewise + (2x+3)/6 middle. Saved-x.
baracuda_kernels_unary_hardswish_backward_f64_can_implement
Pre-launch implementability check for unary_hardswish_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardswish_backward_f64_run
Hardswish backward, f64.
baracuda_kernels_unary_hardswish_bf16_can_implement
Pre-launch implementability check for unary_hardswish_bf16.
baracuda_kernels_unary_hardswish_bf16_run
Unary elementwise hardswish, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_hardswish_bf16_strided_can_implement
Pre-launch implementability check for unary_hardswish_bf16_strided.
baracuda_kernels_unary_hardswish_bf16_strided_run
Unary elementwise hardswish, bf16 dtype, strided path.
baracuda_kernels_unary_hardswish_f16_can_implement
Pre-launch implementability check for unary_hardswish_f16.
baracuda_kernels_unary_hardswish_f16_run
Unary elementwise hardswish, f16 dtype, contiguous fast path.
baracuda_kernels_unary_hardswish_f16_strided_can_implement
Pre-launch implementability check for unary_hardswish_f16_strided.
baracuda_kernels_unary_hardswish_f16_strided_run
Unary elementwise hardswish, f16 dtype, strided path.
baracuda_kernels_unary_hardswish_f32_can_implement
Pre-launch implementability check for unary_hardswish_f32.
baracuda_kernels_unary_hardswish_f32_run
Unary elementwise hardswish, f32 dtype, contiguous fast path.
baracuda_kernels_unary_hardswish_f32_strided_can_implement
Pre-launch implementability check for unary_hardswish_f32_strided.
baracuda_kernels_unary_hardswish_f32_strided_run
Unary elementwise hardswish, f32 dtype, strided path.
baracuda_kernels_unary_hardswish_f64_can_implement
Pre-launch implementability check for unary_hardswish_f64.
baracuda_kernels_unary_hardswish_f64_run
Unary elementwise hardswish, f64 dtype, contiguous fast path.
baracuda_kernels_unary_hardswish_f64_strided_can_implement
Pre-launch implementability check for unary_hardswish_f64_strided.
baracuda_kernels_unary_hardswish_f64_strided_run
Unary elementwise hardswish, f64 dtype, strided path.
baracuda_kernels_unary_hardtanh_backward_bf16_can_implement
Pre-launch implementability check for unary_hardtanh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardtanh_backward_bf16_run
Hardtanh backward, bf16.
baracuda_kernels_unary_hardtanh_backward_f16_can_implement
Pre-launch implementability check for unary_hardtanh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardtanh_backward_f16_run
Hardtanh backward, f16.
baracuda_kernels_unary_hardtanh_backward_f32_can_implement
Pre-launch implementability check for unary_hardtanh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardtanh_backward_f32_run
Hardtanh backward, f32. dx = (-1 < x < 1) ? dy : 0. Saved-x.
baracuda_kernels_unary_hardtanh_backward_f64_can_implement
Pre-launch implementability check for unary_hardtanh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardtanh_backward_f64_run
Hardtanh backward, f64.
baracuda_kernels_unary_hardtanh_bf16_can_implement
Pre-launch implementability check for unary_hardtanh_bf16.
baracuda_kernels_unary_hardtanh_bf16_run
Unary elementwise hardtanh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_hardtanh_bf16_strided_can_implement
Pre-launch implementability check for unary_hardtanh_bf16_strided.
baracuda_kernels_unary_hardtanh_bf16_strided_run
Unary elementwise hardtanh, bf16 dtype, strided path.
baracuda_kernels_unary_hardtanh_f16_can_implement
Pre-launch implementability check for unary_hardtanh_f16.
baracuda_kernels_unary_hardtanh_f16_run
Unary elementwise hardtanh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_hardtanh_f16_strided_can_implement
Pre-launch implementability check for unary_hardtanh_f16_strided.
baracuda_kernels_unary_hardtanh_f16_strided_run
Unary elementwise hardtanh, f16 dtype, strided path.
baracuda_kernels_unary_hardtanh_f32_can_implement
Pre-launch implementability check for unary_hardtanh_f32.
baracuda_kernels_unary_hardtanh_f32_run
Unary elementwise hardtanh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_hardtanh_f32_strided_can_implement
Pre-launch implementability check for unary_hardtanh_f32_strided.
baracuda_kernels_unary_hardtanh_f32_strided_run
Unary elementwise hardtanh, f32 dtype, strided path.
baracuda_kernels_unary_hardtanh_f64_can_implement
Pre-launch implementability check for unary_hardtanh_f64.
baracuda_kernels_unary_hardtanh_f64_run
Unary elementwise hardtanh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_hardtanh_f64_strided_can_implement
Pre-launch implementability check for unary_hardtanh_f64_strided.
baracuda_kernels_unary_hardtanh_f64_strided_run
Unary elementwise hardtanh, f64 dtype, strided path.
baracuda_kernels_unary_leaky_relu_backward_bf16_can_implement
Pre-launch implementability check for unary_leaky_relu_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_leaky_relu_backward_bf16_run
LeakyReLU backward, bf16.
baracuda_kernels_unary_leaky_relu_backward_f16_can_implement
Pre-launch implementability check for unary_leaky_relu_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_leaky_relu_backward_f16_run
LeakyReLU backward, f16.
baracuda_kernels_unary_leaky_relu_backward_f32_can_implement
Pre-launch implementability check for unary_leaky_relu_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_leaky_relu_backward_f32_run
LeakyReLU backward, f32. dx = (x > 0) ? dy : dy·α with α=0.01. Saved-x.
baracuda_kernels_unary_leaky_relu_backward_f64_can_implement
Pre-launch implementability check for unary_leaky_relu_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_leaky_relu_backward_f64_run
LeakyReLU backward, f64.
baracuda_kernels_unary_leaky_relu_bf16_can_implement
Pre-launch implementability check for unary_leaky_relu_bf16.
baracuda_kernels_unary_leaky_relu_bf16_run
Unary elementwise leaky_relu (α=0.01), bf16, contig.
baracuda_kernels_unary_leaky_relu_bf16_strided_can_implement
Pre-launch implementability check for unary_leaky_relu_bf16_strided.
baracuda_kernels_unary_leaky_relu_bf16_strided_run
Unary elementwise leaky_relu (α=0.01), bf16, strided.
baracuda_kernels_unary_leaky_relu_f16_can_implement
Pre-launch implementability check for unary_leaky_relu_f16.
baracuda_kernels_unary_leaky_relu_f16_run
Unary elementwise leaky_relu (α=0.01), f16, contig.
baracuda_kernels_unary_leaky_relu_f16_strided_can_implement
Pre-launch implementability check for unary_leaky_relu_f16_strided.
baracuda_kernels_unary_leaky_relu_f16_strided_run
Unary elementwise leaky_relu (α=0.01), f16, strided.
baracuda_kernels_unary_leaky_relu_f32_can_implement
Pre-launch implementability check for unary_leaky_relu_f32.
baracuda_kernels_unary_leaky_relu_f32_run
Unary elementwise leaky_relu (α=0.01), f32, contig.
baracuda_kernels_unary_leaky_relu_f32_strided_can_implement
Pre-launch implementability check for unary_leaky_relu_f32_strided.
baracuda_kernels_unary_leaky_relu_f32_strided_run
Unary elementwise leaky_relu (α=0.01), f32, strided.
baracuda_kernels_unary_leaky_relu_f64_can_implement
Pre-launch implementability check for unary_leaky_relu_f64.
baracuda_kernels_unary_leaky_relu_f64_run
Unary elementwise leaky_relu (α=0.01), f64, contig.
baracuda_kernels_unary_leaky_relu_f64_strided_can_implement
Pre-launch implementability check for unary_leaky_relu_f64_strided.
baracuda_kernels_unary_leaky_relu_f64_strided_run
Unary elementwise leaky_relu (α=0.01), f64, strided.
baracuda_kernels_unary_lgamma_bf16_can_implement
Pre-launch implementability check for unary_lgamma_bf16.
baracuda_kernels_unary_lgamma_bf16_run
Unary elementwise lgamma, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_lgamma_bf16_strided_can_implement
Pre-launch implementability check for unary_lgamma_bf16_strided.
baracuda_kernels_unary_lgamma_bf16_strided_run
Unary elementwise lgamma, bf16 dtype, strided path.
baracuda_kernels_unary_lgamma_f16_can_implement
Pre-launch implementability check for unary_lgamma_f16.
baracuda_kernels_unary_lgamma_f16_run
Unary elementwise lgamma, f16 dtype, contiguous fast path.
baracuda_kernels_unary_lgamma_f16_strided_can_implement
Pre-launch implementability check for unary_lgamma_f16_strided.
baracuda_kernels_unary_lgamma_f16_strided_run
Unary elementwise lgamma, f16 dtype, strided path.
baracuda_kernels_unary_lgamma_f32_can_implement
Pre-launch implementability check for unary_lgamma_f32.
baracuda_kernels_unary_lgamma_f32_run
Unary elementwise lgamma, f32 dtype, contiguous fast path.
baracuda_kernels_unary_lgamma_f32_strided_can_implement
Pre-launch implementability check for unary_lgamma_f32_strided.
baracuda_kernels_unary_lgamma_f32_strided_run
Unary elementwise lgamma, f32 dtype, strided path.
baracuda_kernels_unary_lgamma_f64_can_implement
Pre-launch implementability check for unary_lgamma_f64.
baracuda_kernels_unary_lgamma_f64_run
Unary elementwise lgamma, f64 dtype, contiguous fast path.
baracuda_kernels_unary_lgamma_f64_strided_can_implement
Pre-launch implementability check for unary_lgamma_f64_strided.
baracuda_kernels_unary_lgamma_f64_strided_run
Unary elementwise lgamma, f64 dtype, strided path.
baracuda_kernels_unary_log1p_backward_bf16_can_implement
Pre-launch implementability check for unary_log1p_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log1p_backward_bf16_run
Log1p backward, bf16.
baracuda_kernels_unary_log1p_backward_f16_can_implement
Pre-launch implementability check for unary_log1p_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log1p_backward_f16_run
Log1p backward, f16.
baracuda_kernels_unary_log1p_backward_f32_can_implement
Pre-launch implementability check for unary_log1p_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_log1p_backward_f32_run
Log1p backward, f32. dx = dy / (1 + x).
baracuda_kernels_unary_log1p_backward_f64_can_implement
Pre-launch implementability check for unary_log1p_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_log1p_backward_f64_run
Log1p backward, f64.
baracuda_kernels_unary_log1p_bf16_can_implement
Pre-launch implementability check for unary_log1p_bf16.
baracuda_kernels_unary_log1p_bf16_run
Unary elementwise log1p, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_log1p_bf16_strided_can_implement
Pre-launch implementability check for unary_log1p_bf16_strided.
baracuda_kernels_unary_log1p_bf16_strided_run
Unary elementwise log1p, bf16 dtype, strided path.
baracuda_kernels_unary_log1p_f16_can_implement
Pre-launch implementability check for unary_log1p_f16.
baracuda_kernels_unary_log1p_f16_run
Unary elementwise log1p, f16 dtype, contiguous fast path.
baracuda_kernels_unary_log1p_f16_strided_can_implement
Pre-launch implementability check for unary_log1p_f16_strided.
baracuda_kernels_unary_log1p_f16_strided_run
Unary elementwise log1p, f16 dtype, strided path.
baracuda_kernels_unary_log1p_f32_can_implement
Pre-launch implementability check for unary_log1p_f32.
baracuda_kernels_unary_log1p_f32_run
Unary elementwise log1p, f32 dtype, contiguous fast path.
baracuda_kernels_unary_log1p_f32_strided_can_implement
Pre-launch implementability check for unary_log1p_f32_strided.
baracuda_kernels_unary_log1p_f32_strided_run
Unary elementwise log1p, f32 dtype, strided path.
baracuda_kernels_unary_log1p_f64_can_implement
Pre-launch implementability check for unary_log1p_f64.
baracuda_kernels_unary_log1p_f64_run
Unary elementwise log1p, f64 dtype, contiguous fast path.
baracuda_kernels_unary_log1p_f64_strided_can_implement
Pre-launch implementability check for unary_log1p_f64_strided.
baracuda_kernels_unary_log1p_f64_strided_run
Unary elementwise log1p, f64 dtype, strided path.
baracuda_kernels_unary_log2_backward_bf16_can_implement
Pre-launch implementability check for unary_log2_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log2_backward_bf16_run
Log2 backward, bf16.
baracuda_kernels_unary_log2_backward_f16_can_implement
Pre-launch implementability check for unary_log2_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log2_backward_f16_run
Log2 backward, f16.
baracuda_kernels_unary_log2_backward_f32_can_implement
Pre-launch implementability check for unary_log2_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_log2_backward_f32_run
Log2 backward, f32. dx = dy / (x * ln(2)).
baracuda_kernels_unary_log2_backward_f64_can_implement
Pre-launch implementability check for unary_log2_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_log2_backward_f64_run
Log2 backward, f64.
baracuda_kernels_unary_log2_bf16_can_implement
Pre-launch implementability check for unary_log2_bf16.
baracuda_kernels_unary_log2_bf16_run
Unary elementwise log2, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_log2_bf16_strided_can_implement
Pre-launch implementability check for unary_log2_bf16_strided.
baracuda_kernels_unary_log2_bf16_strided_run
Unary elementwise log2, bf16 dtype, strided path.
baracuda_kernels_unary_log2_f16_can_implement
Pre-launch implementability check for unary_log2_f16.
baracuda_kernels_unary_log2_f16_run
Unary elementwise log2, f16 dtype, contiguous fast path.
baracuda_kernels_unary_log2_f16_strided_can_implement
Pre-launch implementability check for unary_log2_f16_strided.
baracuda_kernels_unary_log2_f16_strided_run
Unary elementwise log2, f16 dtype, strided path.
baracuda_kernels_unary_log2_f32_can_implement
Pre-launch implementability check for unary_log2_f32.
baracuda_kernels_unary_log2_f32_run
Unary elementwise log2, f32 dtype, contiguous fast path.
baracuda_kernels_unary_log2_f32_strided_can_implement
Pre-launch implementability check for unary_log2_f32_strided.
baracuda_kernels_unary_log2_f32_strided_run
Unary elementwise log2, f32 dtype, strided path.
baracuda_kernels_unary_log2_f64_can_implement
Pre-launch implementability check for unary_log2_f64.
baracuda_kernels_unary_log2_f64_run
Unary elementwise log2, f64 dtype, contiguous fast path.
baracuda_kernels_unary_log2_f64_strided_can_implement
Pre-launch implementability check for unary_log2_f64_strided.
baracuda_kernels_unary_log2_f64_strided_run
Unary elementwise log2, f64 dtype, strided path.
baracuda_kernels_unary_log10_backward_bf16_can_implement
Pre-launch implementability check for unary_log10_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log10_backward_bf16_run
Log10 backward, bf16.
baracuda_kernels_unary_log10_backward_f16_can_implement
Pre-launch implementability check for unary_log10_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log10_backward_f16_run
Log10 backward, f16.
baracuda_kernels_unary_log10_backward_f32_can_implement
Pre-launch implementability check for unary_log10_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_log10_backward_f32_run
Log10 backward, f32. dx = dy / (x * ln(10)).
baracuda_kernels_unary_log10_backward_f64_can_implement
Pre-launch implementability check for unary_log10_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_log10_backward_f64_run
Log10 backward, f64.
baracuda_kernels_unary_log10_bf16_can_implement
Pre-launch implementability check for unary_log10_bf16.
baracuda_kernels_unary_log10_bf16_run
Unary elementwise log10, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_log10_bf16_strided_can_implement
Pre-launch implementability check for unary_log10_bf16_strided.
baracuda_kernels_unary_log10_bf16_strided_run
Unary elementwise log10, bf16 dtype, strided path.
baracuda_kernels_unary_log10_f16_can_implement
Pre-launch implementability check for unary_log10_f16.
baracuda_kernels_unary_log10_f16_run
Unary elementwise log10, f16 dtype, contiguous fast path.
baracuda_kernels_unary_log10_f16_strided_can_implement
Pre-launch implementability check for unary_log10_f16_strided.
baracuda_kernels_unary_log10_f16_strided_run
Unary elementwise log10, f16 dtype, strided path.
baracuda_kernels_unary_log10_f32_can_implement
Pre-launch implementability check for unary_log10_f32.
baracuda_kernels_unary_log10_f32_run
Unary elementwise log10, f32 dtype, contiguous fast path.
baracuda_kernels_unary_log10_f32_strided_can_implement
Pre-launch implementability check for unary_log10_f32_strided.
baracuda_kernels_unary_log10_f32_strided_run
Unary elementwise log10, f32 dtype, strided path.
baracuda_kernels_unary_log10_f64_can_implement
Pre-launch implementability check for unary_log10_f64.
baracuda_kernels_unary_log10_f64_run
Unary elementwise log10, f64 dtype, contiguous fast path.
baracuda_kernels_unary_log10_f64_strided_can_implement
Pre-launch implementability check for unary_log10_f64_strided.
baracuda_kernels_unary_log10_f64_strided_run
Unary elementwise log10, f64 dtype, strided path.
baracuda_kernels_unary_log_backward_bf16_can_implement
Pre-launch implementability check for unary_log_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log_backward_bf16_run
Log backward, bf16.
baracuda_kernels_unary_log_backward_f16_can_implement
Pre-launch implementability check for unary_log_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log_backward_f16_run
Log backward, f16.
baracuda_kernels_unary_log_backward_f32_can_implement
Pre-launch implementability check for unary_log_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_log_backward_f32_run
Log backward, f32. dx = dy / x.
baracuda_kernels_unary_log_backward_f64_can_implement
Pre-launch implementability check for unary_log_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_log_backward_f64_run
Log backward, f64.
baracuda_kernels_unary_log_bf16_can_implement
Pre-launch implementability check for unary_log_bf16.
baracuda_kernels_unary_log_bf16_run
Unary elementwise log, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_log_bf16_strided_can_implement
Pre-launch implementability check for unary_log_bf16_strided.
baracuda_kernels_unary_log_bf16_strided_run
Unary elementwise log, bf16 dtype, strided path.
baracuda_kernels_unary_log_f16_can_implement
Pre-launch implementability check for unary_log_f16.
baracuda_kernels_unary_log_f16_run
Unary elementwise log, f16 dtype, contiguous fast path.
baracuda_kernels_unary_log_f16_strided_can_implement
Pre-launch implementability check for unary_log_f16_strided.
baracuda_kernels_unary_log_f16_strided_run
Unary elementwise log, f16 dtype, strided path.
baracuda_kernels_unary_log_f32_can_implement
Pre-launch implementability check for unary_log_f32.
baracuda_kernels_unary_log_f32_run
Unary elementwise log, f32 dtype, contiguous fast path.
baracuda_kernels_unary_log_f32_strided_can_implement
Pre-launch implementability check for unary_log_f32_strided.
baracuda_kernels_unary_log_f32_strided_run
Unary elementwise log, f32 dtype, strided path.
baracuda_kernels_unary_log_f64_can_implement
Pre-launch implementability check for unary_log_f64.
baracuda_kernels_unary_log_f64_run
Unary elementwise log, f64 dtype, contiguous fast path.
baracuda_kernels_unary_log_f64_strided_can_implement
Pre-launch implementability check for unary_log_f64_strided.
baracuda_kernels_unary_log_f64_strided_run
Unary elementwise log, f64 dtype, strided path.
baracuda_kernels_unary_logit_backward_bf16_can_implement
Pre-launch implementability check for unary_logit_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_logit_backward_bf16_run
Logit backward, bf16.
baracuda_kernels_unary_logit_backward_f16_can_implement
Pre-launch implementability check for unary_logit_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_logit_backward_f16_run
Logit backward, f16.
baracuda_kernels_unary_logit_backward_f32_can_implement
Pre-launch implementability check for unary_logit_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_logit_backward_f32_run
Logit backward, f32. dx = dy / (x * (1 - x)). Domain 0 < x < 1.
baracuda_kernels_unary_logit_backward_f64_can_implement
Pre-launch implementability check for unary_logit_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_logit_backward_f64_run
Logit backward, f64.
baracuda_kernels_unary_logit_bf16_can_implement
Pre-launch implementability check for unary_logit_bf16.
baracuda_kernels_unary_logit_bf16_run
Unary elementwise logit, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_logit_bf16_strided_can_implement
Pre-launch implementability check for unary_logit_bf16_strided.
baracuda_kernels_unary_logit_bf16_strided_run
Unary elementwise logit, bf16 dtype, strided path.
baracuda_kernels_unary_logit_f16_can_implement
Pre-launch implementability check for unary_logit_f16.
baracuda_kernels_unary_logit_f16_run
Unary elementwise logit, f16 dtype, contiguous fast path.
baracuda_kernels_unary_logit_f16_strided_can_implement
Pre-launch implementability check for unary_logit_f16_strided.
baracuda_kernels_unary_logit_f16_strided_run
Unary elementwise logit, f16 dtype, strided path.
baracuda_kernels_unary_logit_f32_can_implement
Pre-launch implementability check for unary_logit_f32.
baracuda_kernels_unary_logit_f32_run
Unary elementwise logit, f32 dtype, contiguous fast path.
baracuda_kernels_unary_logit_f32_strided_can_implement
Pre-launch implementability check for unary_logit_f32_strided.
baracuda_kernels_unary_logit_f32_strided_run
Unary elementwise logit, f32 dtype, strided path.
baracuda_kernels_unary_logit_f64_can_implement
Pre-launch implementability check for unary_logit_f64.
baracuda_kernels_unary_logit_f64_run
Unary elementwise logit, f64 dtype, contiguous fast path.
baracuda_kernels_unary_logit_f64_strided_can_implement
Pre-launch implementability check for unary_logit_f64_strided.
baracuda_kernels_unary_logit_f64_strided_run
Unary elementwise logit, f64 dtype, strided path.
baracuda_kernels_unary_mish_backward_bf16_can_implement
Pre-launch implementability check for unary_mish_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_mish_backward_bf16_run
Mish backward, bf16.
baracuda_kernels_unary_mish_backward_f16_can_implement
Pre-launch implementability check for unary_mish_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_mish_backward_f16_run
Mish backward, f16.
baracuda_kernels_unary_mish_backward_f32_can_implement
Pre-launch implementability check for unary_mish_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_mish_backward_f32_run
Mish backward, f32. dx = dy * (tanh(sp) + x*s*(1 - tanh(sp)^2)), sp = softplus(x). Saved-x.
baracuda_kernels_unary_mish_backward_f64_can_implement
Pre-launch implementability check for unary_mish_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_mish_backward_f64_run
Mish backward, f64.
baracuda_kernels_unary_mish_bf16_can_implement
Pre-launch implementability check for unary_mish_bf16.
baracuda_kernels_unary_mish_bf16_run
Unary elementwise mish, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_mish_bf16_strided_can_implement
Pre-launch implementability check for unary_mish_bf16_strided.
baracuda_kernels_unary_mish_bf16_strided_run
Unary elementwise mish, bf16 dtype, strided path.
baracuda_kernels_unary_mish_f16_can_implement
Pre-launch implementability check for unary_mish_f16.
baracuda_kernels_unary_mish_f16_run
Unary elementwise mish, f16 dtype, contiguous fast path.
baracuda_kernels_unary_mish_f16_strided_can_implement
Pre-launch implementability check for unary_mish_f16_strided.
baracuda_kernels_unary_mish_f16_strided_run
Unary elementwise mish, f16 dtype, strided path.
baracuda_kernels_unary_mish_f32_can_implement
Pre-launch implementability check for unary_mish_f32.
baracuda_kernels_unary_mish_f32_run
Unary elementwise mish, f32 dtype, contiguous fast path.
baracuda_kernels_unary_mish_f32_strided_can_implement
Pre-launch implementability check for unary_mish_f32_strided.
baracuda_kernels_unary_mish_f32_strided_run
Unary elementwise mish, f32 dtype, strided path.
baracuda_kernels_unary_mish_f64_can_implement
Pre-launch implementability check for unary_mish_f64.
baracuda_kernels_unary_mish_f64_run
Unary elementwise mish, f64 dtype, contiguous fast path.
baracuda_kernels_unary_mish_f64_strided_can_implement
Pre-launch implementability check for unary_mish_f64_strided.
baracuda_kernels_unary_mish_f64_strided_run
Unary elementwise mish, f64 dtype, strided path.
baracuda_kernels_unary_neg_bf16_can_implement
Pre-launch implementability check for unary_neg_bf16.
baracuda_kernels_unary_neg_bf16_run
Unary elementwise neg, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_neg_bf16_strided_can_implement
Pre-launch implementability check for unary_neg_bf16_strided.
baracuda_kernels_unary_neg_bf16_strided_run
Unary elementwise neg, bf16 dtype, strided path.
baracuda_kernels_unary_neg_f16_can_implement
Pre-launch implementability check for unary_neg_f16.
baracuda_kernels_unary_neg_f16_run
Unary elementwise neg, f16 dtype, contiguous fast path.
baracuda_kernels_unary_neg_f16_strided_can_implement
Pre-launch implementability check for unary_neg_f16_strided.
baracuda_kernels_unary_neg_f16_strided_run
Unary elementwise neg, f16 dtype, strided path.
baracuda_kernels_unary_neg_f32_can_implement
Pre-launch implementability check for unary_neg_f32. Validates the problem size without launching a kernel. Returns the standard status code mapping.
baracuda_kernels_unary_neg_f32_run
Unary elementwise neg, f32 dtype, contiguous fast path. This is the unary-pointwise trailblazer — its safety contract carries over to every plain unary launcher (neg, abs, sqr, sqrt, rsqrt, recip, exp, log, sin, cos, tan, sign, floor, ceil, round, erf, relu, silu, gelu, tanh, sigmoid, etc.) AND every parameterized-unary launcher (unary_param_* family: powi, threshold, elu, prelu, lerp, etc.) across all dtypes. See also binary_add_f32_run for the binary contig aliasing contract and ternary_clamp_f32_run for the ternary one.
baracuda_kernels_unary_neg_f32_strided_can_implement
Pre-launch implementability check for unary_neg_f32_strided.
baracuda_kernels_unary_neg_f32_strided_run
Unary elementwise neg, f32 dtype, strided path. This is the unary-strided trailblazer — its safety contract (including aliasing) carries over to every other unary strided launcher AND every parameterized-unary strided launcher (powi, threshold, elu, prelu, lerp) across all dtypes.
baracuda_kernels_unary_neg_f64_can_implement
Pre-launch implementability check for unary_neg_f64.
baracuda_kernels_unary_neg_f64_run
Unary elementwise neg, f64 dtype, contiguous fast path.
baracuda_kernels_unary_neg_f64_strided_can_implement
Pre-launch implementability check for unary_neg_f64_strided.
baracuda_kernels_unary_neg_f64_strided_run
Unary elementwise neg, f64 dtype, strided path.
baracuda_kernels_unary_powf_bf16_can_implement
baracuda_kernels_unary_powf_bf16_can_implement (baracuda kernels unary powf bf16 can implement).
baracuda_kernels_unary_powf_bf16_run
unary_powf, bf16, contig.
baracuda_kernels_unary_powf_bf16_strided_can_implement
Implementability check for baracuda_kernels_unary_powf_bf16_strided. Host-side only.
baracuda_kernels_unary_powf_bf16_strided_run
baracuda_kernels_unary_powf_bf16_strided_run (baracuda kernels unary powf bf16 strided run).
baracuda_kernels_unary_powf_f16_can_implement
baracuda_kernels_unary_powf_f16_can_implement (baracuda kernels unary powf f16 can implement).
baracuda_kernels_unary_powf_f16_run
unary_powf, f16, contig. f32 detour.
baracuda_kernels_unary_powf_f16_strided_can_implement
Implementability check for baracuda_kernels_unary_powf_f16_strided. Host-side only.
baracuda_kernels_unary_powf_f16_strided_run
baracuda_kernels_unary_powf_f16_strided_run (baracuda kernels unary powf f16 strided run).
baracuda_kernels_unary_powf_f32_can_implement
Implementability check for unary_powf_f32.
baracuda_kernels_unary_powf_f32_run
Unary elementwise pow(x, exponent), f32, contig.
baracuda_kernels_unary_powf_f32_strided_can_implement
Implementability check for baracuda_kernels_unary_powf_f32_strided. Host-side only.
baracuda_kernels_unary_powf_f32_strided_run
unary_powf, f32, strided sibling.
baracuda_kernels_unary_powf_f64_can_implement
baracuda_kernels_unary_powf_f64_can_implement (baracuda kernels unary powf f64 can implement).
baracuda_kernels_unary_powf_f64_run
unary_powf, f64, contig. pow (libdevice) is full-double precision; the f32 exponent is widened once at kernel entry.
baracuda_kernels_unary_powf_f64_strided_can_implement
Implementability check for baracuda_kernels_unary_powf_f64_strided. Host-side only.
baracuda_kernels_unary_powf_f64_strided_run
baracuda_kernels_unary_powf_f64_strided_run (baracuda kernels unary powf f64 strided run).
baracuda_kernels_unary_powi_backward_bf16_can_implement
Pre-launch implementability check for unary_powi_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_bf16_run
powi BW, bf16.
baracuda_kernels_unary_powi_backward_bf16_strided_can_implement
Pre-launch implementability check for unary_powi_backward_bf16_strided. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_bf16_strided_run
powi BW, bf16, strided.
baracuda_kernels_unary_powi_backward_f16_can_implement
Pre-launch implementability check for unary_powi_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_f16_run
powi BW, f16.
baracuda_kernels_unary_powi_backward_f16_strided_can_implement
Pre-launch implementability check for unary_powi_backward_f16_strided. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_f16_strided_run
powi BW, f16, strided.
baracuda_kernels_unary_powi_backward_f32_can_implement
Pre-launch implementability check for unary_powi_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_f32_run
powi backward: dx = n · x^(n-1) · dy, f32. Saved-x.
baracuda_kernels_unary_powi_backward_f32_strided_can_implement
Pre-launch implementability check for unary_powi_backward_f32_strided. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_f32_strided_run
powi BW, f32, strided.
baracuda_kernels_unary_powi_backward_f64_can_implement
Pre-launch implementability check for unary_powi_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_f64_run
powi BW, f64.
baracuda_kernels_unary_powi_backward_f64_strided_can_implement
Pre-launch implementability check for unary_powi_backward_f64_strided. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_f64_strided_run
powi BW, f64, strided.
baracuda_kernels_unary_powi_bf16_can_implement
Implementability check for baracuda_kernels_unary_powi_bf16. Host-side only.
baracuda_kernels_unary_powi_bf16_run
powi FW, bf16.
baracuda_kernels_unary_powi_bf16_strided_can_implement
Implementability check for baracuda_kernels_unary_powi_bf16_strided. Host-side only.
baracuda_kernels_unary_powi_bf16_strided_run
powi FW, bf16, strided.
baracuda_kernels_unary_powi_f16_can_implement
Implementability check for baracuda_kernels_unary_powi_f16. Host-side only.
baracuda_kernels_unary_powi_f16_run
powi FW, f16.
baracuda_kernels_unary_powi_f16_strided_can_implement
Implementability check for baracuda_kernels_unary_powi_f16_strided. Host-side only.
baracuda_kernels_unary_powi_f16_strided_run
powi FW, f16, strided.
baracuda_kernels_unary_powi_f32_can_implement
Implementability check for baracuda_kernels_unary_powi_f32. Host-side only.
baracuda_kernels_unary_powi_f32_run
Unary elementwise powi(x; n) = x^n (integer exponent), f32, contig.
baracuda_kernels_unary_powi_f32_strided_can_implement
Implementability check for baracuda_kernels_unary_powi_f32_strided. Host-side only.
baracuda_kernels_unary_powi_f32_strided_run
powi FW, f32, strided.
baracuda_kernels_unary_powi_f64_can_implement
Implementability check for baracuda_kernels_unary_powi_f64. Host-side only.
baracuda_kernels_unary_powi_f64_run
powi FW, f64.
baracuda_kernels_unary_powi_f64_strided_can_implement
Implementability check for baracuda_kernels_unary_powi_f64_strided. Host-side only.
baracuda_kernels_unary_powi_f64_strided_run
powi FW, f64, strided.
baracuda_kernels_unary_reciprocal_backward_bf16_can_implement
Pre-launch implementability check for unary_reciprocal_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_reciprocal_backward_bf16_run
Reciprocal backward, bf16.
baracuda_kernels_unary_reciprocal_backward_f16_can_implement
Pre-launch implementability check for unary_reciprocal_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_reciprocal_backward_f16_run
Reciprocal backward, f16.
baracuda_kernels_unary_reciprocal_backward_f32_can_implement
Pre-launch implementability check for unary_reciprocal_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_reciprocal_backward_f32_run
Reciprocal backward, f32. dx = -dy / x². Domain x != 0.
baracuda_kernels_unary_reciprocal_backward_f64_can_implement
Pre-launch implementability check for unary_reciprocal_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_reciprocal_backward_f64_run
Reciprocal backward, f64.
baracuda_kernels_unary_reciprocal_bf16_can_implement
Pre-launch implementability check for unary_reciprocal_bf16.
baracuda_kernels_unary_reciprocal_bf16_run
Unary elementwise reciprocal, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_reciprocal_bf16_strided_can_implement
Pre-launch implementability check for unary_reciprocal_bf16_strided.
baracuda_kernels_unary_reciprocal_bf16_strided_run
Unary elementwise reciprocal, bf16 dtype, strided path.
baracuda_kernels_unary_reciprocal_f16_can_implement
Pre-launch implementability check for unary_reciprocal_f16.
baracuda_kernels_unary_reciprocal_f16_run
Unary elementwise reciprocal, f16 dtype, contiguous fast path.
baracuda_kernels_unary_reciprocal_f16_strided_can_implement
Pre-launch implementability check for unary_reciprocal_f16_strided.
baracuda_kernels_unary_reciprocal_f16_strided_run
Unary elementwise reciprocal, f16 dtype, strided path.
baracuda_kernels_unary_reciprocal_f32_can_implement
Pre-launch implementability check for unary_reciprocal_f32.
baracuda_kernels_unary_reciprocal_f32_run
Unary elementwise reciprocal, f32 dtype, contiguous fast path.
baracuda_kernels_unary_reciprocal_f32_strided_can_implement
Pre-launch implementability check for unary_reciprocal_f32_strided.
baracuda_kernels_unary_reciprocal_f32_strided_run
Unary elementwise reciprocal, f32 dtype, strided path.
baracuda_kernels_unary_reciprocal_f64_can_implement
Pre-launch implementability check for unary_reciprocal_f64.
baracuda_kernels_unary_reciprocal_f64_run
Unary elementwise reciprocal, f64 dtype, contiguous fast path.
baracuda_kernels_unary_reciprocal_f64_strided_can_implement
Pre-launch implementability check for unary_reciprocal_f64_strided.
baracuda_kernels_unary_reciprocal_f64_strided_run
Unary elementwise reciprocal, f64 dtype, strided path.
baracuda_kernels_unary_relu6_backward_bf16_can_implement
Pre-launch implementability check for unary_relu6_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu6_backward_bf16_run
ReLU6 backward, bf16.
baracuda_kernels_unary_relu6_backward_f16_can_implement
Pre-launch implementability check for unary_relu6_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu6_backward_f16_run
ReLU6 backward, f16.
baracuda_kernels_unary_relu6_backward_f32_can_implement
Pre-launch implementability check for unary_relu6_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu6_backward_f32_run
ReLU6 backward, f32. dx = (0 < x < 6) ? dy : 0. Saved-x.
baracuda_kernels_unary_relu6_backward_f64_can_implement
Pre-launch implementability check for unary_relu6_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu6_backward_f64_run
ReLU6 backward, f64.
baracuda_kernels_unary_relu6_bf16_can_implement
Pre-launch implementability check for unary_relu6_bf16.
baracuda_kernels_unary_relu6_bf16_run
Unary elementwise relu6, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_relu6_bf16_strided_can_implement
Pre-launch implementability check for unary_relu6_bf16_strided.
baracuda_kernels_unary_relu6_bf16_strided_run
Unary elementwise relu6, bf16 dtype, strided path.
baracuda_kernels_unary_relu6_f16_can_implement
Pre-launch implementability check for unary_relu6_f16.
baracuda_kernels_unary_relu6_f16_run
Unary elementwise relu6, f16 dtype, contiguous fast path.
baracuda_kernels_unary_relu6_f16_strided_can_implement
Pre-launch implementability check for unary_relu6_f16_strided.
baracuda_kernels_unary_relu6_f16_strided_run
Unary elementwise relu6, f16 dtype, strided path.
baracuda_kernels_unary_relu6_f32_can_implement
Pre-launch implementability check for unary_relu6_f32.
baracuda_kernels_unary_relu6_f32_run
Unary elementwise relu6, f32 dtype, contiguous fast path.
baracuda_kernels_unary_relu6_f32_strided_can_implement
Pre-launch implementability check for unary_relu6_f32_strided.
baracuda_kernels_unary_relu6_f32_strided_run
Unary elementwise relu6, f32 dtype, strided path.
baracuda_kernels_unary_relu6_f64_can_implement
Pre-launch implementability check for unary_relu6_f64.
baracuda_kernels_unary_relu6_f64_run
Unary elementwise relu6, f64 dtype, contiguous fast path.
baracuda_kernels_unary_relu6_f64_strided_can_implement
Pre-launch implementability check for unary_relu6_f64_strided.
baracuda_kernels_unary_relu6_f64_strided_run
Unary elementwise relu6, f64 dtype, strided path.
baracuda_kernels_unary_relu_backward_bf16_can_implement
Pre-launch implementability check for unary_relu_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu_backward_bf16_run
ReLU backward, bf16.
baracuda_kernels_unary_relu_backward_f16_can_implement
Pre-launch implementability check for unary_relu_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu_backward_f16_run
ReLU backward, f16.
baracuda_kernels_unary_relu_backward_f32_can_implement
Pre-launch implementability check for unary_relu_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu_backward_f32_run
ReLU backward, f32. dx = (x > 0) ? dy : 0. Saved-x. This is the activation-BW trailblazer — its aliasing contract carries over to every other unary_<op>_backward_<dt>_run (gelu, silu, tanh, sigmoid, elu, leaky_relu, mish, hardswish, hardsigmoid, gelu_tanh, erf, erfc, etc.) across all dtypes, both saved-x and saved-y variants.
baracuda_kernels_unary_relu_backward_f64_can_implement
Pre-launch implementability check for unary_relu_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu_backward_f64_run
ReLU backward, f64.
baracuda_kernels_unary_relu_bf16_can_implement
Pre-launch implementability check for unary_relu_bf16.
baracuda_kernels_unary_relu_bf16_run
Unary elementwise relu, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_relu_bf16_strided_can_implement
Pre-launch implementability check for unary_relu_bf16_strided.
baracuda_kernels_unary_relu_bf16_strided_run
Unary elementwise relu, bf16 dtype, strided path.
baracuda_kernels_unary_relu_f16_can_implement
Pre-launch implementability check for unary_relu_f16.
baracuda_kernels_unary_relu_f16_run
Unary elementwise relu, f16 dtype, contiguous fast path.
baracuda_kernels_unary_relu_f16_strided_can_implement
Pre-launch implementability check for unary_relu_f16_strided.
baracuda_kernels_unary_relu_f16_strided_run
Unary elementwise relu, f16 dtype, strided path.
baracuda_kernels_unary_relu_f32_can_implement
Pre-launch implementability check for unary_relu_f32.
baracuda_kernels_unary_relu_f32_run
Unary elementwise relu, f32 dtype, contiguous fast path.
baracuda_kernels_unary_relu_f32_strided_can_implement
Pre-launch implementability check for unary_relu_f32_strided.
baracuda_kernels_unary_relu_f32_strided_run
Unary elementwise relu, f32 dtype, strided path.
baracuda_kernels_unary_relu_f64_can_implement
Pre-launch implementability check for unary_relu_f64.
baracuda_kernels_unary_relu_f64_run
Unary elementwise relu, f64 dtype, contiguous fast path.
baracuda_kernels_unary_relu_f64_strided_can_implement
Pre-launch implementability check for unary_relu_f64_strided.
baracuda_kernels_unary_relu_f64_strided_run
Unary elementwise relu, f64 dtype, strided path.
baracuda_kernels_unary_round_bf16_can_implement
Pre-launch implementability check for unary_round_bf16.
baracuda_kernels_unary_round_bf16_run
Unary elementwise round, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_round_bf16_strided_can_implement
Pre-launch implementability check for unary_round_bf16_strided.
baracuda_kernels_unary_round_bf16_strided_run
Unary elementwise round, bf16 dtype, strided path.
baracuda_kernels_unary_round_f16_can_implement
Pre-launch implementability check for unary_round_f16.
baracuda_kernels_unary_round_f16_run
Unary elementwise round, f16 dtype, contiguous fast path.
baracuda_kernels_unary_round_f16_strided_can_implement
Pre-launch implementability check for unary_round_f16_strided.
baracuda_kernels_unary_round_f16_strided_run
Unary elementwise round, f16 dtype, strided path.
baracuda_kernels_unary_round_f32_can_implement
Pre-launch implementability check for unary_round_f32.
baracuda_kernels_unary_round_f32_run
Unary elementwise round, f32 dtype, contiguous fast path.
baracuda_kernels_unary_round_f32_strided_can_implement
Pre-launch implementability check for unary_round_f32_strided.
baracuda_kernels_unary_round_f32_strided_run
Unary elementwise round, f32 dtype, strided path.
baracuda_kernels_unary_round_f64_can_implement
Pre-launch implementability check for unary_round_f64.
baracuda_kernels_unary_round_f64_run
Unary elementwise round, f64 dtype, contiguous fast path.
baracuda_kernels_unary_round_f64_strided_can_implement
Pre-launch implementability check for unary_round_f64_strided.
baracuda_kernels_unary_round_f64_strided_run
Unary elementwise round, f64 dtype, strided path.
baracuda_kernels_unary_rsqrt_backward_bf16_can_implement
Pre-launch implementability check for unary_rsqrt_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_rsqrt_backward_bf16_run
Rsqrt backward, bf16.
baracuda_kernels_unary_rsqrt_backward_f16_can_implement
Pre-launch implementability check for unary_rsqrt_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_rsqrt_backward_f16_run
Rsqrt backward, f16.
baracuda_kernels_unary_rsqrt_backward_f32_can_implement
Pre-launch implementability check for unary_rsqrt_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_rsqrt_backward_f32_run
Rsqrt backward, f32. dx = -0.5 * dy * y³. Saved-y.
baracuda_kernels_unary_rsqrt_backward_f64_can_implement
Pre-launch implementability check for unary_rsqrt_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_rsqrt_backward_f64_run
Rsqrt backward, f64.
baracuda_kernels_unary_rsqrt_bf16_can_implement
Pre-launch implementability check for unary_rsqrt_bf16.
baracuda_kernels_unary_rsqrt_bf16_run
Unary elementwise rsqrt, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_rsqrt_bf16_strided_can_implement
Pre-launch implementability check for unary_rsqrt_bf16_strided.
baracuda_kernels_unary_rsqrt_bf16_strided_run
Unary elementwise rsqrt, bf16 dtype, strided path.
baracuda_kernels_unary_rsqrt_f16_can_implement
Pre-launch implementability check for unary_rsqrt_f16.
baracuda_kernels_unary_rsqrt_f16_run
Unary elementwise rsqrt, f16 dtype, contiguous fast path.
baracuda_kernels_unary_rsqrt_f16_strided_can_implement
Pre-launch implementability check for unary_rsqrt_f16_strided.
baracuda_kernels_unary_rsqrt_f16_strided_run
Unary elementwise rsqrt, f16 dtype, strided path.
baracuda_kernels_unary_rsqrt_f32_can_implement
Pre-launch implementability check for unary_rsqrt_f32.
baracuda_kernels_unary_rsqrt_f32_run
Unary elementwise rsqrt, f32 dtype, contiguous fast path.
baracuda_kernels_unary_rsqrt_f32_strided_can_implement
Pre-launch implementability check for unary_rsqrt_f32_strided.
baracuda_kernels_unary_rsqrt_f32_strided_run
Unary elementwise rsqrt, f32 dtype, strided path.
baracuda_kernels_unary_rsqrt_f64_can_implement
Pre-launch implementability check for unary_rsqrt_f64.
baracuda_kernels_unary_rsqrt_f64_run
Unary elementwise rsqrt, f64 dtype, contiguous fast path.
baracuda_kernels_unary_rsqrt_f64_strided_can_implement
Pre-launch implementability check for unary_rsqrt_f64_strided.
baracuda_kernels_unary_rsqrt_f64_strided_run
Unary elementwise rsqrt, f64 dtype, strided path.
baracuda_kernels_unary_selu_backward_bf16_can_implement
Pre-launch implementability check for unary_selu_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_selu_backward_bf16_run
SELU backward, bf16.
baracuda_kernels_unary_selu_backward_f16_can_implement
Pre-launch implementability check for unary_selu_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_selu_backward_f16_run
SELU backward, f16.
baracuda_kernels_unary_selu_backward_f32_can_implement
Pre-launch implementability check for unary_selu_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_selu_backward_f32_run
SELU backward, f32. x>0 → dy*scale; x<=0 → dy*scale*alpha*exp(x). Saved-x.
baracuda_kernels_unary_selu_backward_f64_can_implement
Pre-launch implementability check for unary_selu_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_selu_backward_f64_run
SELU backward, f64.
baracuda_kernels_unary_selu_bf16_can_implement
Pre-launch implementability check for unary_selu_bf16.
baracuda_kernels_unary_selu_bf16_run
Unary elementwise selu, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_selu_bf16_strided_can_implement
Pre-launch implementability check for unary_selu_bf16_strided.
baracuda_kernels_unary_selu_bf16_strided_run
Unary elementwise selu, bf16 dtype, strided path.
baracuda_kernels_unary_selu_f16_can_implement
Pre-launch implementability check for unary_selu_f16.
baracuda_kernels_unary_selu_f16_run
Unary elementwise selu, f16 dtype, contiguous fast path.
baracuda_kernels_unary_selu_f16_strided_can_implement
Pre-launch implementability check for unary_selu_f16_strided.
baracuda_kernels_unary_selu_f16_strided_run
Unary elementwise selu, f16 dtype, strided path.
baracuda_kernels_unary_selu_f32_can_implement
Pre-launch implementability check for unary_selu_f32.
baracuda_kernels_unary_selu_f32_run
Unary elementwise selu, f32 dtype, contiguous fast path.
baracuda_kernels_unary_selu_f32_strided_can_implement
Pre-launch implementability check for unary_selu_f32_strided.
baracuda_kernels_unary_selu_f32_strided_run
Unary elementwise selu, f32 dtype, strided path.
baracuda_kernels_unary_selu_f64_can_implement
Pre-launch implementability check for unary_selu_f64.
baracuda_kernels_unary_selu_f64_run
Unary elementwise selu, f64 dtype, contiguous fast path.
baracuda_kernels_unary_selu_f64_strided_can_implement
Pre-launch implementability check for unary_selu_f64_strided.
baracuda_kernels_unary_selu_f64_strided_run
Unary elementwise selu, f64 dtype, strided path.
baracuda_kernels_unary_sigmoid_backward_bf16_can_implement
Pre-launch implementability check for unary_sigmoid_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sigmoid_backward_bf16_run
Sigmoid backward, bf16.
baracuda_kernels_unary_sigmoid_backward_f16_can_implement
Pre-launch implementability check for unary_sigmoid_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sigmoid_backward_f16_run
Sigmoid backward, f16.
baracuda_kernels_unary_sigmoid_backward_f32_can_implement
Pre-launch implementability check for unary_sigmoid_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_sigmoid_backward_f32_run
Sigmoid backward, f32. dx = dy * y * (1 - y). Saved-y.
baracuda_kernels_unary_sigmoid_backward_f64_can_implement
Pre-launch implementability check for unary_sigmoid_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_sigmoid_backward_f64_run
Sigmoid backward, f64.
baracuda_kernels_unary_sigmoid_bf16_can_implement
Pre-launch implementability check for unary_sigmoid_bf16.
baracuda_kernels_unary_sigmoid_bf16_run
Unary elementwise sigmoid, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_sigmoid_bf16_strided_can_implement
Pre-launch implementability check for unary_sigmoid_bf16_strided.
baracuda_kernels_unary_sigmoid_bf16_strided_run
Unary elementwise sigmoid, bf16 dtype, strided path.
baracuda_kernels_unary_sigmoid_f16_can_implement
Pre-launch implementability check for unary_sigmoid_f16.
baracuda_kernels_unary_sigmoid_f16_run
Unary elementwise sigmoid, f16 dtype, contiguous fast path.
baracuda_kernels_unary_sigmoid_f16_strided_can_implement
Pre-launch implementability check for unary_sigmoid_f16_strided.
baracuda_kernels_unary_sigmoid_f16_strided_run
Unary elementwise sigmoid, f16 dtype, strided path.
baracuda_kernels_unary_sigmoid_f32_can_implement
Pre-launch implementability check for unary_sigmoid_f32.
baracuda_kernels_unary_sigmoid_f32_run
Unary elementwise sigmoid, f32 dtype, contiguous fast path.
baracuda_kernels_unary_sigmoid_f32_strided_can_implement
Pre-launch implementability check for unary_sigmoid_f32_strided.
baracuda_kernels_unary_sigmoid_f32_strided_run
Unary elementwise sigmoid, f32 dtype, strided path.
baracuda_kernels_unary_sigmoid_f64_can_implement
Pre-launch implementability check for unary_sigmoid_f64.
baracuda_kernels_unary_sigmoid_f64_run
Unary elementwise sigmoid, f64 dtype, contiguous fast path.
baracuda_kernels_unary_sigmoid_f64_strided_can_implement
Pre-launch implementability check for unary_sigmoid_f64_strided.
baracuda_kernels_unary_sigmoid_f64_strided_run
Unary elementwise sigmoid, f64 dtype, strided path.
baracuda_kernels_unary_sign_bf16_can_implement
Pre-launch implementability check for unary_sign_bf16.
baracuda_kernels_unary_sign_bf16_run
Unary elementwise sign, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_sign_bf16_strided_can_implement
Pre-launch implementability check for unary_sign_bf16_strided.
baracuda_kernels_unary_sign_bf16_strided_run
Unary elementwise sign, bf16 dtype, strided path.
baracuda_kernels_unary_sign_f16_can_implement
Pre-launch implementability check for unary_sign_f16.
baracuda_kernels_unary_sign_f16_run
Unary elementwise sign, f16 dtype, contiguous fast path.
baracuda_kernels_unary_sign_f16_strided_can_implement
Pre-launch implementability check for unary_sign_f16_strided.
baracuda_kernels_unary_sign_f16_strided_run
Unary elementwise sign, f16 dtype, strided path.
baracuda_kernels_unary_sign_f32_can_implement
Pre-launch implementability check for unary_sign_f32.
baracuda_kernels_unary_sign_f32_run
Unary elementwise sign, f32 dtype, contiguous fast path.
baracuda_kernels_unary_sign_f32_strided_can_implement
Pre-launch implementability check for unary_sign_f32_strided.
baracuda_kernels_unary_sign_f32_strided_run
Unary elementwise sign, f32 dtype, strided path.
baracuda_kernels_unary_sign_f64_can_implement
Pre-launch implementability check for unary_sign_f64.
baracuda_kernels_unary_sign_f64_run
Unary elementwise sign, f64 dtype, contiguous fast path.
baracuda_kernels_unary_sign_f64_strided_can_implement
Pre-launch implementability check for unary_sign_f64_strided.
baracuda_kernels_unary_sign_f64_strided_run
Unary elementwise sign, f64 dtype, strided path.
baracuda_kernels_unary_silu_backward_bf16_can_implement
Pre-launch implementability check for unary_silu_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_silu_backward_bf16_run
SiLU backward, bf16.
baracuda_kernels_unary_silu_backward_f16_can_implement
Pre-launch implementability check for unary_silu_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_silu_backward_f16_run
SiLU backward, f16.
baracuda_kernels_unary_silu_backward_f32_can_implement
Pre-launch implementability check for unary_silu_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_silu_backward_f32_run
SiLU (Swish) backward, f32. dx = dy * s * (1 + x*(1-s)) with s = sigmoid(x). Saved-x.
baracuda_kernels_unary_silu_backward_f64_can_implement
Pre-launch implementability check for unary_silu_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_silu_backward_f64_run
SiLU backward, f64.
baracuda_kernels_unary_silu_bf16_can_implement
Pre-launch implementability check for unary_silu_bf16.
baracuda_kernels_unary_silu_bf16_run
Unary elementwise silu, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_silu_bf16_strided_can_implement
Pre-launch implementability check for unary_silu_bf16_strided.
baracuda_kernels_unary_silu_bf16_strided_run
Unary elementwise silu, bf16 dtype, strided path.
baracuda_kernels_unary_silu_f16_can_implement
Pre-launch implementability check for unary_silu_f16.
baracuda_kernels_unary_silu_f16_run
Unary elementwise silu, f16 dtype, contiguous fast path.
baracuda_kernels_unary_silu_f16_strided_can_implement
Pre-launch implementability check for unary_silu_f16_strided.
baracuda_kernels_unary_silu_f16_strided_run
Unary elementwise silu, f16 dtype, strided path.
baracuda_kernels_unary_silu_f32_can_implement
Pre-launch implementability check for unary_silu_f32.
baracuda_kernels_unary_silu_f32_run
Unary elementwise silu, f32 dtype, contiguous fast path.
baracuda_kernels_unary_silu_f32_strided_can_implement
Pre-launch implementability check for unary_silu_f32_strided.
baracuda_kernels_unary_silu_f32_strided_run
Unary elementwise silu, f32 dtype, strided path.
baracuda_kernels_unary_silu_f64_can_implement
Pre-launch implementability check for unary_silu_f64.
baracuda_kernels_unary_silu_f64_run
Unary elementwise silu, f64 dtype, contiguous fast path.
baracuda_kernels_unary_silu_f64_strided_can_implement
Pre-launch implementability check for unary_silu_f64_strided.
baracuda_kernels_unary_silu_f64_strided_run
Unary elementwise silu, f64 dtype, strided path.
baracuda_kernels_unary_sin_backward_bf16_can_implement
Pre-launch implementability check for unary_sin_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sin_backward_bf16_run
Sin backward, bf16.
baracuda_kernels_unary_sin_backward_f16_can_implement
Pre-launch implementability check for unary_sin_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sin_backward_f16_run
Sin backward, f16.
baracuda_kernels_unary_sin_backward_f32_can_implement
Pre-launch implementability check for unary_sin_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_sin_backward_f32_run
Sin backward, f32. dx = dy * cos(x). Caller must pass the forward input x as saved.
baracuda_kernels_unary_sin_backward_f64_can_implement
Pre-launch implementability check for unary_sin_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_sin_backward_f64_run
Sin backward, f64.
baracuda_kernels_unary_sin_bf16_can_implement
Pre-launch implementability check for unary_sin_bf16.
baracuda_kernels_unary_sin_bf16_run
Unary elementwise sin, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_sin_bf16_strided_can_implement
Pre-launch implementability check for unary_sin_bf16_strided.
baracuda_kernels_unary_sin_bf16_strided_run
Unary elementwise sin, bf16 dtype, strided path.
baracuda_kernels_unary_sin_f16_can_implement
Pre-launch implementability check for unary_sin_f16.
baracuda_kernels_unary_sin_f16_run
Unary elementwise sin, f16 dtype, contiguous fast path.
baracuda_kernels_unary_sin_f16_strided_can_implement
Pre-launch implementability check for unary_sin_f16_strided.
baracuda_kernels_unary_sin_f16_strided_run
Unary elementwise sin, f16 dtype, strided path.
baracuda_kernels_unary_sin_f32_can_implement
Pre-launch implementability check for unary_sin_f32.
baracuda_kernels_unary_sin_f32_run
Unary elementwise sin, f32 dtype, contiguous fast path.
baracuda_kernels_unary_sin_f32_strided_can_implement
Pre-launch implementability check for unary_sin_f32_strided.
baracuda_kernels_unary_sin_f32_strided_run
Unary elementwise sin, f32 dtype, strided path.
baracuda_kernels_unary_sin_f64_can_implement
Pre-launch implementability check for unary_sin_f64.
baracuda_kernels_unary_sin_f64_run
Unary elementwise sin, f64 dtype, contiguous fast path.
baracuda_kernels_unary_sin_f64_strided_can_implement
Pre-launch implementability check for unary_sin_f64_strided.
baracuda_kernels_unary_sin_f64_strided_run
Unary elementwise sin, f64 dtype, strided path.
baracuda_kernels_unary_sinh_backward_bf16_can_implement
Pre-launch implementability check for unary_sinh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sinh_backward_bf16_run
Sinh backward, bf16.
baracuda_kernels_unary_sinh_backward_f16_can_implement
Pre-launch implementability check for unary_sinh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sinh_backward_f16_run
Sinh backward, f16.
baracuda_kernels_unary_sinh_backward_f32_can_implement
Pre-launch implementability check for unary_sinh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_sinh_backward_f32_run
Sinh backward, f32. dx = dy * cosh(x). Saved-x.
baracuda_kernels_unary_sinh_backward_f64_can_implement
Pre-launch implementability check for unary_sinh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_sinh_backward_f64_run
Sinh backward, f64.
baracuda_kernels_unary_sinh_bf16_can_implement
Pre-launch implementability check for unary_sinh_bf16.
baracuda_kernels_unary_sinh_bf16_run
Unary elementwise sinh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_sinh_bf16_strided_can_implement
Pre-launch implementability check for unary_sinh_bf16_strided.
baracuda_kernels_unary_sinh_bf16_strided_run
Unary elementwise sinh, bf16 dtype, strided path.
baracuda_kernels_unary_sinh_f16_can_implement
Pre-launch implementability check for unary_sinh_f16.
baracuda_kernels_unary_sinh_f16_run
Unary elementwise sinh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_sinh_f16_strided_can_implement
Pre-launch implementability check for unary_sinh_f16_strided.
baracuda_kernels_unary_sinh_f16_strided_run
Unary elementwise sinh, f16 dtype, strided path.
baracuda_kernels_unary_sinh_f32_can_implement
Pre-launch implementability check for unary_sinh_f32.
baracuda_kernels_unary_sinh_f32_run
Unary elementwise sinh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_sinh_f32_strided_can_implement
Pre-launch implementability check for unary_sinh_f32_strided.
baracuda_kernels_unary_sinh_f32_strided_run
Unary elementwise sinh, f32 dtype, strided path.
baracuda_kernels_unary_sinh_f64_can_implement
Pre-launch implementability check for unary_sinh_f64.
baracuda_kernels_unary_sinh_f64_run
Unary elementwise sinh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_sinh_f64_strided_can_implement
Pre-launch implementability check for unary_sinh_f64_strided.
baracuda_kernels_unary_sinh_f64_strided_run
Unary elementwise sinh, f64 dtype, strided path.
baracuda_kernels_unary_softplus_backward_bf16_can_implement
Pre-launch implementability check for unary_softplus_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_softplus_backward_bf16_run
Softplus backward, bf16.
baracuda_kernels_unary_softplus_backward_f16_can_implement
Pre-launch implementability check for unary_softplus_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_softplus_backward_f16_run
Softplus backward, f16.
baracuda_kernels_unary_softplus_backward_f32_can_implement
Pre-launch implementability check for unary_softplus_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_softplus_backward_f32_run
Softplus backward, f32. dx = dy / (1 + exp(-x)). Saved-x.
baracuda_kernels_unary_softplus_backward_f64_can_implement
Pre-launch implementability check for unary_softplus_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_softplus_backward_f64_run
Softplus backward, f64.
baracuda_kernels_unary_softplus_bf16_can_implement
Pre-launch implementability check for unary_softplus_bf16.
baracuda_kernels_unary_softplus_bf16_run
Unary elementwise softplus, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_softplus_bf16_strided_can_implement
Pre-launch implementability check for unary_softplus_bf16_strided.
baracuda_kernels_unary_softplus_bf16_strided_run
Unary elementwise softplus, bf16 dtype, strided path.
baracuda_kernels_unary_softplus_f16_can_implement
Pre-launch implementability check for unary_softplus_f16.
baracuda_kernels_unary_softplus_f16_run
Unary elementwise softplus, f16 dtype, contiguous fast path.
baracuda_kernels_unary_softplus_f16_strided_can_implement
Pre-launch implementability check for unary_softplus_f16_strided.
baracuda_kernels_unary_softplus_f16_strided_run
Unary elementwise softplus, f16 dtype, strided path.
baracuda_kernels_unary_softplus_f32_can_implement
Pre-launch implementability check for unary_softplus_f32.
baracuda_kernels_unary_softplus_f32_run
Unary elementwise softplus, f32 dtype, contiguous fast path.
baracuda_kernels_unary_softplus_f32_strided_can_implement
Pre-launch implementability check for unary_softplus_f32_strided.
baracuda_kernels_unary_softplus_f32_strided_run
Unary elementwise softplus, f32 dtype, strided path.
baracuda_kernels_unary_softplus_f64_can_implement
Pre-launch implementability check for unary_softplus_f64.
baracuda_kernels_unary_softplus_f64_run
Unary elementwise softplus, f64 dtype, contiguous fast path.
baracuda_kernels_unary_softplus_f64_strided_can_implement
Pre-launch implementability check for unary_softplus_f64_strided.
baracuda_kernels_unary_softplus_f64_strided_run
Unary elementwise softplus, f64 dtype, strided path.
baracuda_kernels_unary_softshrink_backward_bf16_can_implement
Pre-launch implementability check for unary_softshrink_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_softshrink_backward_bf16_run
Softshrink backward, bf16.
baracuda_kernels_unary_softshrink_backward_f16_can_implement
Pre-launch implementability check for unary_softshrink_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_softshrink_backward_f16_run
Softshrink backward, f16.
baracuda_kernels_unary_softshrink_backward_f32_can_implement
Pre-launch implementability check for unary_softshrink_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_softshrink_backward_f32_run
Softshrink backward, f32. dx = (|x| > λ) ? dy : 0 with λ=0.5. Saved-x.
baracuda_kernels_unary_softshrink_backward_f64_can_implement
Pre-launch implementability check for unary_softshrink_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_softshrink_backward_f64_run
Softshrink backward, f64.
baracuda_kernels_unary_softshrink_bf16_can_implement
Pre-launch implementability check for unary_softshrink_bf16.
baracuda_kernels_unary_softshrink_bf16_run
Unary elementwise softshrink (λ=0.5), bf16, contig.
baracuda_kernels_unary_softshrink_bf16_strided_can_implement
Pre-launch implementability check for unary_softshrink_bf16_strided.
baracuda_kernels_unary_softshrink_bf16_strided_run
Unary elementwise softshrink (λ=0.5), bf16, strided.
baracuda_kernels_unary_softshrink_f16_can_implement
Pre-launch implementability check for unary_softshrink_f16.
baracuda_kernels_unary_softshrink_f16_run
Unary elementwise softshrink (λ=0.5), f16, contig.
baracuda_kernels_unary_softshrink_f16_strided_can_implement
Pre-launch implementability check for unary_softshrink_f16_strided.
baracuda_kernels_unary_softshrink_f16_strided_run
Unary elementwise softshrink (λ=0.5), f16, strided.
baracuda_kernels_unary_softshrink_f32_can_implement
Pre-launch implementability check for unary_softshrink_f32.
baracuda_kernels_unary_softshrink_f32_run
Unary elementwise softshrink (λ=0.5), f32, contig.
baracuda_kernels_unary_softshrink_f32_strided_can_implement
Pre-launch implementability check for unary_softshrink_f32_strided.
baracuda_kernels_unary_softshrink_f32_strided_run
Unary elementwise softshrink (λ=0.5), f32, strided.
baracuda_kernels_unary_softshrink_f64_can_implement
Pre-launch implementability check for unary_softshrink_f64.
baracuda_kernels_unary_softshrink_f64_run
Unary elementwise softshrink (λ=0.5), f64, contig.
baracuda_kernels_unary_softshrink_f64_strided_can_implement
Pre-launch implementability check for unary_softshrink_f64_strided.
baracuda_kernels_unary_softshrink_f64_strided_run
Unary elementwise softshrink (λ=0.5), f64, strided.
baracuda_kernels_unary_softsign_bf16_can_implement
Pre-launch implementability check for unary_softsign_bf16.
baracuda_kernels_unary_softsign_bf16_run
Unary elementwise softsign, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_softsign_bf16_strided_can_implement
Pre-launch implementability check for unary_softsign_bf16_strided.
baracuda_kernels_unary_softsign_bf16_strided_run
Unary elementwise softsign, bf16 dtype, strided path.
baracuda_kernels_unary_softsign_f16_can_implement
Pre-launch implementability check for unary_softsign_f16.
baracuda_kernels_unary_softsign_f16_run
Unary elementwise softsign, f16 dtype, contiguous fast path.
baracuda_kernels_unary_softsign_f16_strided_can_implement
Pre-launch implementability check for unary_softsign_f16_strided.
baracuda_kernels_unary_softsign_f16_strided_run
Unary elementwise softsign, f16 dtype, strided path.
baracuda_kernels_unary_softsign_f32_can_implement
Pre-launch implementability check for unary_softsign_f32.
baracuda_kernels_unary_softsign_f32_run
Unary elementwise softsign, f32 dtype, contiguous fast path.
baracuda_kernels_unary_softsign_f32_strided_can_implement
Pre-launch implementability check for unary_softsign_f32_strided.
baracuda_kernels_unary_softsign_f32_strided_run
Unary elementwise softsign, f32 dtype, strided path.
baracuda_kernels_unary_softsign_f64_can_implement
Pre-launch implementability check for unary_softsign_f64.
baracuda_kernels_unary_softsign_f64_run
Unary elementwise softsign, f64 dtype, contiguous fast path.
baracuda_kernels_unary_softsign_f64_strided_can_implement
Pre-launch implementability check for unary_softsign_f64_strided.
baracuda_kernels_unary_softsign_f64_strided_run
Unary elementwise softsign, f64 dtype, strided path.
baracuda_kernels_unary_sqrt_backward_bf16_can_implement
Pre-launch implementability check for unary_sqrt_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sqrt_backward_bf16_run
Sqrt backward, bf16.
baracuda_kernels_unary_sqrt_backward_f16_can_implement
Pre-launch implementability check for unary_sqrt_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sqrt_backward_f16_run
Sqrt backward, f16.
baracuda_kernels_unary_sqrt_backward_f32_can_implement
Pre-launch implementability check for unary_sqrt_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_sqrt_backward_f32_run
Sqrt backward, f32. dx = dy / (2 * y). Saved-y. Callers must ensure y[i] != 0.
baracuda_kernels_unary_sqrt_backward_f64_can_implement
Pre-launch implementability check for unary_sqrt_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_sqrt_backward_f64_run
Sqrt backward, f64.
baracuda_kernels_unary_sqrt_bf16_can_implement
Pre-launch implementability check for unary_sqrt_bf16.
baracuda_kernels_unary_sqrt_bf16_run
Unary elementwise sqrt, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_sqrt_bf16_strided_can_implement
Pre-launch implementability check for unary_sqrt_bf16_strided.
baracuda_kernels_unary_sqrt_bf16_strided_run
Unary elementwise sqrt, bf16 dtype, strided path.
baracuda_kernels_unary_sqrt_f16_can_implement
Pre-launch implementability check for unary_sqrt_f16.
baracuda_kernels_unary_sqrt_f16_run
Unary elementwise sqrt, f16 dtype, contiguous fast path.
baracuda_kernels_unary_sqrt_f16_strided_can_implement
Pre-launch implementability check for unary_sqrt_f16_strided.
baracuda_kernels_unary_sqrt_f16_strided_run
Unary elementwise sqrt, f16 dtype, strided path.
baracuda_kernels_unary_sqrt_f32_can_implement
Pre-launch implementability check for unary_sqrt_f32.
baracuda_kernels_unary_sqrt_f32_run
Unary elementwise sqrt, f32 dtype, contiguous fast path.
baracuda_kernels_unary_sqrt_f32_strided_can_implement
Pre-launch implementability check for unary_sqrt_f32_strided.
baracuda_kernels_unary_sqrt_f32_strided_run
Unary elementwise sqrt, f32 dtype, strided path.
baracuda_kernels_unary_sqrt_f64_can_implement
Pre-launch implementability check for unary_sqrt_f64.
baracuda_kernels_unary_sqrt_f64_run
Unary elementwise sqrt, f64 dtype, contiguous fast path.
baracuda_kernels_unary_sqrt_f64_strided_can_implement
Pre-launch implementability check for unary_sqrt_f64_strided.
baracuda_kernels_unary_sqrt_f64_strided_run
Unary elementwise sqrt, f64 dtype, strided path.
baracuda_kernels_unary_square_backward_bf16_can_implement
Pre-launch implementability check for unary_square_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_square_backward_bf16_run
Square backward, bf16.
baracuda_kernels_unary_square_backward_f16_can_implement
Pre-launch implementability check for unary_square_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_square_backward_f16_run
Square backward, f16.
baracuda_kernels_unary_square_backward_f32_can_implement
Pre-launch implementability check for unary_square_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_square_backward_f32_run
Square backward, f32. dx = dy * 2 * x.
baracuda_kernels_unary_square_backward_f64_can_implement
Pre-launch implementability check for unary_square_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_square_backward_f64_run
Square backward, f64.
baracuda_kernels_unary_square_bf16_can_implement
Pre-launch implementability check for unary_square_bf16.
baracuda_kernels_unary_square_bf16_run
Unary elementwise square, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_square_bf16_strided_can_implement
Pre-launch implementability check for unary_square_bf16_strided.
baracuda_kernels_unary_square_bf16_strided_run
Unary elementwise square, bf16 dtype, strided path.
baracuda_kernels_unary_square_f16_can_implement
Pre-launch implementability check for unary_square_f16.
baracuda_kernels_unary_square_f16_run
Unary elementwise square, f16 dtype, contiguous fast path.
baracuda_kernels_unary_square_f16_strided_can_implement
Pre-launch implementability check for unary_square_f16_strided.
baracuda_kernels_unary_square_f16_strided_run
Unary elementwise square, f16 dtype, strided path.
baracuda_kernels_unary_square_f32_can_implement
Pre-launch implementability check for unary_square_f32.
baracuda_kernels_unary_square_f32_run
Unary elementwise square, f32 dtype, contiguous fast path.
baracuda_kernels_unary_square_f32_strided_can_implement
Pre-launch implementability check for unary_square_f32_strided.
baracuda_kernels_unary_square_f32_strided_run
Unary elementwise square, f32 dtype, strided path.
baracuda_kernels_unary_square_f64_can_implement
Pre-launch implementability check for unary_square_f64.
baracuda_kernels_unary_square_f64_run
Unary elementwise square, f64 dtype, contiguous fast path.
baracuda_kernels_unary_square_f64_strided_can_implement
Pre-launch implementability check for unary_square_f64_strided.
baracuda_kernels_unary_square_f64_strided_run
Unary elementwise square, f64 dtype, strided path.
baracuda_kernels_unary_step_bf16_can_implement
baracuda_kernels_unary_step_bf16_can_implement (baracuda kernels unary step bf16 can implement).
baracuda_kernels_unary_step_bf16_run
unary_step, bf16, contig.
baracuda_kernels_unary_step_bf16_strided_can_implement
Pre-launch implementability check for unary_step_bf16_strided.
baracuda_kernels_unary_step_bf16_strided_run
baracuda_kernels_unary_step_bf16_strided_run (baracuda kernels unary step bf16 strided run).
baracuda_kernels_unary_step_f16_can_implement
baracuda_kernels_unary_step_f16_can_implement (baracuda kernels unary step f16 can implement).
baracuda_kernels_unary_step_f16_run
unary_step, f16, contig.
baracuda_kernels_unary_step_f16_strided_can_implement
Pre-launch implementability check for unary_step_f16_strided.
baracuda_kernels_unary_step_f16_strided_run
baracuda_kernels_unary_step_f16_strided_run (baracuda kernels unary step f16 strided run).
baracuda_kernels_unary_step_f32_can_implement
baracuda_kernels_unary_step_f32_can_implement (baracuda kernels unary step f32 can implement).
baracuda_kernels_unary_step_f32_run
unary_step, f32, contig.
baracuda_kernels_unary_step_f32_strided_can_implement
Pre-launch implementability check for unary_step_f32_strided.
baracuda_kernels_unary_step_f32_strided_run
baracuda_kernels_unary_step_f32_strided_run (baracuda kernels unary step f32 strided run).
baracuda_kernels_unary_step_f64_can_implement
baracuda_kernels_unary_step_f64_can_implement (baracuda kernels unary step f64 can implement).
baracuda_kernels_unary_step_f64_run
unary_step, f64, contig.
baracuda_kernels_unary_step_f64_strided_can_implement
Pre-launch implementability check for unary_step_f64_strided.
baracuda_kernels_unary_step_f64_strided_run
baracuda_kernels_unary_step_f64_strided_run (baracuda kernels unary step f64 strided run).
baracuda_kernels_unary_tan_backward_bf16_can_implement
Pre-launch implementability check for unary_tan_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_tan_backward_bf16_run
Tan backward, bf16.
baracuda_kernels_unary_tan_backward_f16_can_implement
Pre-launch implementability check for unary_tan_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_tan_backward_f16_run
Tan backward, f16.
baracuda_kernels_unary_tan_backward_f32_can_implement
Pre-launch implementability check for unary_tan_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_tan_backward_f32_run
Tan backward, f32. dx = dy * (1 + tan(x)²). Saved-x.
baracuda_kernels_unary_tan_backward_f64_can_implement
Pre-launch implementability check for unary_tan_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_tan_backward_f64_run
Tan backward, f64.
baracuda_kernels_unary_tan_bf16_can_implement
Pre-launch implementability check for unary_tan_bf16.
baracuda_kernels_unary_tan_bf16_run
Unary elementwise tan, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_tan_bf16_strided_can_implement
Pre-launch implementability check for unary_tan_bf16_strided.
baracuda_kernels_unary_tan_bf16_strided_run
Unary elementwise tan, bf16 dtype, strided path.
baracuda_kernels_unary_tan_f16_can_implement
Pre-launch implementability check for unary_tan_f16.
baracuda_kernels_unary_tan_f16_run
Unary elementwise tan, f16 dtype, contiguous fast path.
baracuda_kernels_unary_tan_f16_strided_can_implement
Pre-launch implementability check for unary_tan_f16_strided.
baracuda_kernels_unary_tan_f16_strided_run
Unary elementwise tan, f16 dtype, strided path.
baracuda_kernels_unary_tan_f32_can_implement
Pre-launch implementability check for unary_tan_f32.
baracuda_kernels_unary_tan_f32_run
Unary elementwise tan, f32 dtype, contiguous fast path.
baracuda_kernels_unary_tan_f32_strided_can_implement
Pre-launch implementability check for unary_tan_f32_strided.
baracuda_kernels_unary_tan_f32_strided_run
Unary elementwise tan, f32 dtype, strided path.
baracuda_kernels_unary_tan_f64_can_implement
Pre-launch implementability check for unary_tan_f64.
baracuda_kernels_unary_tan_f64_run
Unary elementwise tan, f64 dtype, contiguous fast path.
baracuda_kernels_unary_tan_f64_strided_can_implement
Pre-launch implementability check for unary_tan_f64_strided.
baracuda_kernels_unary_tan_f64_strided_run
Unary elementwise tan, f64 dtype, strided path.
baracuda_kernels_unary_tanh_backward_bf16_can_implement
Pre-launch implementability check for unary_tanh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanh_backward_bf16_run
Tanh backward, bf16.
baracuda_kernels_unary_tanh_backward_f16_can_implement
Pre-launch implementability check for unary_tanh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanh_backward_f16_run
Tanh backward, f16.
baracuda_kernels_unary_tanh_backward_f32_can_implement
Pre-launch implementability check for unary_tanh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanh_backward_f32_run
Tanh backward, f32. dx = dy * (1 - y²). Saved-y.
baracuda_kernels_unary_tanh_backward_f64_can_implement
Pre-launch implementability check for unary_tanh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanh_backward_f64_run
Tanh backward, f64.
baracuda_kernels_unary_tanh_bf16_can_implement
Pre-launch implementability check for unary_tanh_bf16.
baracuda_kernels_unary_tanh_bf16_run
Unary elementwise tanh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_tanh_bf16_strided_can_implement
Pre-launch implementability check for unary_tanh_bf16_strided.
baracuda_kernels_unary_tanh_bf16_strided_run
Unary elementwise tanh, bf16 dtype, strided path.
baracuda_kernels_unary_tanh_f16_can_implement
Pre-launch implementability check for unary_tanh_f16.
baracuda_kernels_unary_tanh_f16_run
Unary elementwise tanh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_tanh_f16_strided_can_implement
Pre-launch implementability check for unary_tanh_f16_strided.
baracuda_kernels_unary_tanh_f16_strided_run
Unary elementwise tanh, f16 dtype, strided path.
baracuda_kernels_unary_tanh_f32_can_implement
Pre-launch implementability check for unary_tanh_f32.
baracuda_kernels_unary_tanh_f32_run
Unary elementwise tanh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_tanh_f32_strided_can_implement
Pre-launch implementability check for unary_tanh_f32_strided.
baracuda_kernels_unary_tanh_f32_strided_run
Unary elementwise tanh, f32 dtype, strided path.
baracuda_kernels_unary_tanh_f64_can_implement
Pre-launch implementability check for unary_tanh_f64.
baracuda_kernels_unary_tanh_f64_run
Unary elementwise tanh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_tanh_f64_strided_can_implement
Pre-launch implementability check for unary_tanh_f64_strided.
baracuda_kernels_unary_tanh_f64_strided_run
Unary elementwise tanh, f64 dtype, strided path.
baracuda_kernels_unary_tanhshrink_backward_bf16_can_implement
Pre-launch implementability check for unary_tanhshrink_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanhshrink_backward_bf16_run
Tanhshrink backward, bf16.
baracuda_kernels_unary_tanhshrink_backward_f16_can_implement
Pre-launch implementability check for unary_tanhshrink_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanhshrink_backward_f16_run
Tanhshrink backward, f16.
baracuda_kernels_unary_tanhshrink_backward_f32_can_implement
Pre-launch implementability check for unary_tanhshrink_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanhshrink_backward_f32_run
Tanhshrink backward, f32. dx = dy * tanh(x)².
baracuda_kernels_unary_tanhshrink_backward_f64_can_implement
Pre-launch implementability check for unary_tanhshrink_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanhshrink_backward_f64_run
Tanhshrink backward, f64.
baracuda_kernels_unary_tanhshrink_bf16_can_implement
Pre-launch implementability check for unary_tanhshrink_bf16.
baracuda_kernels_unary_tanhshrink_bf16_run
Unary elementwise tanhshrink, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_tanhshrink_bf16_strided_can_implement
Pre-launch implementability check for unary_tanhshrink_bf16_strided.
baracuda_kernels_unary_tanhshrink_bf16_strided_run
Unary elementwise tanhshrink, bf16 dtype, strided path.
baracuda_kernels_unary_tanhshrink_f16_can_implement
Pre-launch implementability check for unary_tanhshrink_f16.
baracuda_kernels_unary_tanhshrink_f16_run
Unary elementwise tanhshrink, f16 dtype, contiguous fast path.
baracuda_kernels_unary_tanhshrink_f16_strided_can_implement
Pre-launch implementability check for unary_tanhshrink_f16_strided.
baracuda_kernels_unary_tanhshrink_f16_strided_run
Unary elementwise tanhshrink, f16 dtype, strided path.
baracuda_kernels_unary_tanhshrink_f32_can_implement
Pre-launch implementability check for unary_tanhshrink_f32.
baracuda_kernels_unary_tanhshrink_f32_run
Unary elementwise tanhshrink, f32 dtype, contiguous fast path.
baracuda_kernels_unary_tanhshrink_f32_strided_can_implement
Pre-launch implementability check for unary_tanhshrink_f32_strided.
baracuda_kernels_unary_tanhshrink_f32_strided_run
Unary elementwise tanhshrink, f32 dtype, strided path.
baracuda_kernels_unary_tanhshrink_f64_can_implement
Pre-launch implementability check for unary_tanhshrink_f64.
baracuda_kernels_unary_tanhshrink_f64_run
Unary elementwise tanhshrink, f64 dtype, contiguous fast path.
baracuda_kernels_unary_tanhshrink_f64_strided_can_implement
Pre-launch implementability check for unary_tanhshrink_f64_strided.
baracuda_kernels_unary_tanhshrink_f64_strided_run
Unary elementwise tanhshrink, f64 dtype, strided path.
baracuda_kernels_unary_threshold_backward_bf16_can_implement
Pre-launch implementability check for unary_threshold_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_threshold_backward_bf16_run
threshold BW, bf16.
baracuda_kernels_unary_threshold_backward_f16_can_implement
Pre-launch implementability check for unary_threshold_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_threshold_backward_f16_run
threshold BW, f16.
baracuda_kernels_unary_threshold_backward_f32_can_implement
Pre-launch implementability check for unary_threshold_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_threshold_backward_f32_run
threshold backward: dx = (x > t) ? dy : 0, f32. Saved-x.
baracuda_kernels_unary_threshold_backward_f64_can_implement
Pre-launch implementability check for unary_threshold_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_threshold_backward_f64_run
threshold BW, f64.
baracuda_kernels_unary_threshold_bf16_can_implement
Implementability check for baracuda_kernels_unary_threshold_bf16. Host-side only.
baracuda_kernels_unary_threshold_bf16_run
threshold FW, bf16.
baracuda_kernels_unary_threshold_f16_can_implement
Implementability check for baracuda_kernels_unary_threshold_f16. Host-side only.
baracuda_kernels_unary_threshold_f16_run
threshold FW, f16.
baracuda_kernels_unary_threshold_f32_can_implement
Implementability check for baracuda_kernels_unary_threshold_f32. Host-side only.
baracuda_kernels_unary_threshold_f32_run
Unary elementwise threshold(x; t, v) = (x > t) ? x : v, f32, contig.
baracuda_kernels_unary_threshold_f64_can_implement
Implementability check for baracuda_kernels_unary_threshold_f64. Host-side only.
baracuda_kernels_unary_threshold_f64_run
threshold FW, f64. The f32 params widen to f64 losslessly.
baracuda_kernels_unary_trunc_bf16_can_implement
Pre-launch implementability check for unary_trunc_bf16.
baracuda_kernels_unary_trunc_bf16_run
Unary elementwise trunc, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_trunc_bf16_strided_can_implement
Pre-launch implementability check for unary_trunc_bf16_strided.
baracuda_kernels_unary_trunc_bf16_strided_run
Unary elementwise trunc, bf16 dtype, strided path.
baracuda_kernels_unary_trunc_f16_can_implement
Pre-launch implementability check for unary_trunc_f16.
baracuda_kernels_unary_trunc_f16_run
Unary elementwise trunc, f16 dtype, contiguous fast path.
baracuda_kernels_unary_trunc_f16_strided_can_implement
Pre-launch implementability check for unary_trunc_f16_strided.
baracuda_kernels_unary_trunc_f16_strided_run
Unary elementwise trunc, f16 dtype, strided path.
baracuda_kernels_unary_trunc_f32_can_implement
Pre-launch implementability check for unary_trunc_f32.
baracuda_kernels_unary_trunc_f32_run
Unary elementwise trunc, f32 dtype, contiguous fast path.
baracuda_kernels_unary_trunc_f32_strided_can_implement
Pre-launch implementability check for unary_trunc_f32_strided.
baracuda_kernels_unary_trunc_f32_strided_run
Unary elementwise trunc, f32 dtype, strided path.
baracuda_kernels_unary_trunc_f64_can_implement
Pre-launch implementability check for unary_trunc_f64.
baracuda_kernels_unary_trunc_f64_run
Unary elementwise trunc, f64 dtype, contiguous fast path.
baracuda_kernels_unary_trunc_f64_strided_can_implement
Pre-launch implementability check for unary_trunc_f64_strided.
baracuda_kernels_unary_trunc_f64_strided_run
Unary elementwise trunc, f64 dtype, strided path.
baracuda_kernels_unique_consecutive_f32_can_implement
baracuda_kernels_unique_consecutive_f32_can_implement (baracuda kernels unique consecutive f32 can implement).
baracuda_kernels_unique_consecutive_f32_run
Unique-consecutive, f32. Emits one cell per run-start; output slot order is atomic-counter race order. counter[row] holds the actual unique count post-launch.
baracuda_kernels_unique_consecutive_f64_can_implement
baracuda_kernels_unique_consecutive_f64_can_implement (baracuda kernels unique consecutive f64 can implement).
baracuda_kernels_unique_consecutive_f64_run
Unique-consecutive, f64.
baracuda_kernels_unique_consecutive_i32_can_implement
baracuda_kernels_unique_consecutive_i32_can_implement (baracuda kernels unique consecutive i32 can implement).
baracuda_kernels_unique_consecutive_i32_run
Unique-consecutive, i32.
baracuda_kernels_unsorted_segment_max_backward_f32_can_implement
Implementability check for unsorted_segment_max_backward_f32.
baracuda_kernels_unsorted_segment_max_backward_f32_run
unsorted_segment_max_backward — f32.
baracuda_kernels_unsorted_segment_max_backward_f64_can_implement
Implementability check for unsorted_segment_max_backward_f64.
baracuda_kernels_unsorted_segment_max_backward_f64_run
unsorted_segment_max_backward — f64.
baracuda_kernels_unsorted_segment_max_f32_can_implement
Implementability check for unsorted_segment_max_f32.
baracuda_kernels_unsorted_segment_max_f32_run
out[s, d] = max_{n : seg[n] == s} input[n, d] — unsorted; atomicMax-via-CAS. Output pre-initialized to -inf by the launcher. f32.
baracuda_kernels_unsorted_segment_max_f64_can_implement
Implementability check for unsorted_segment_max_f64.
baracuda_kernels_unsorted_segment_max_f64_run
unsorted_segment_max — f64.
baracuda_kernels_unsorted_segment_max_i64idx_f32_can_implement
baracuda_kernels_unsorted_segment_max_i64idx_f32_can_implement (baracuda kernels unsorted segment max i64idx f32 can implement).
baracuda_kernels_unsorted_segment_max_i64idx_f32_run
baracuda_kernels_unsorted_segment_max_i64idx_f32_run (baracuda kernels unsorted segment max i64idx f32 run).
baracuda_kernels_unsorted_segment_max_i64idx_f64_can_implement
baracuda_kernels_unsorted_segment_max_i64idx_f64_can_implement (baracuda kernels unsorted segment max i64idx f64 can implement).
baracuda_kernels_unsorted_segment_max_i64idx_f64_run
baracuda_kernels_unsorted_segment_max_i64idx_f64_run (baracuda kernels unsorted segment max i64idx f64 run).
baracuda_kernels_unsorted_segment_mean_backward_f32_can_implement
Implementability check for unsorted_segment_mean_backward_f32.
baracuda_kernels_unsorted_segment_mean_backward_f32_run
unsorted_segment_mean_backward — f32.
baracuda_kernels_unsorted_segment_mean_backward_f64_can_implement
Implementability check for unsorted_segment_mean_backward_f64.
baracuda_kernels_unsorted_segment_mean_backward_f64_run
unsorted_segment_mean_backward — f64.
baracuda_kernels_unsorted_segment_mean_backward_i64idx_f32_can_implement
baracuda_kernels_unsorted_segment_mean_backward_i64idx_f32_can_implement (baracuda kernels unsorted segment mean backward i64idx f32 can implement).
baracuda_kernels_unsorted_segment_mean_backward_i64idx_f32_run
baracuda_kernels_unsorted_segment_mean_backward_i64idx_f32_run (baracuda kernels unsorted segment mean backward i64idx f32 run).
baracuda_kernels_unsorted_segment_mean_backward_i64idx_f64_can_implement
baracuda_kernels_unsorted_segment_mean_backward_i64idx_f64_can_implement (baracuda kernels unsorted segment mean backward i64idx f64 can implement).
baracuda_kernels_unsorted_segment_mean_backward_i64idx_f64_run
baracuda_kernels_unsorted_segment_mean_backward_i64idx_f64_run (baracuda kernels unsorted segment mean backward i64idx f64 run).
baracuda_kernels_unsorted_segment_mean_f32_can_implement
Implementability check for unsorted_segment_mean_f32.
baracuda_kernels_unsorted_segment_mean_f32_run
out[s, d] = mean_{n : seg[n] == s} input[n, d] — unsorted. Workspace: num_segments * sizeof(i32) for per-segment counts. f32.
baracuda_kernels_unsorted_segment_mean_f64_can_implement
Implementability check for unsorted_segment_mean_f64.
baracuda_kernels_unsorted_segment_mean_f64_run
unsorted_segment_mean — f64.
baracuda_kernels_unsorted_segment_mean_i64idx_f32_can_implement
baracuda_kernels_unsorted_segment_mean_i64idx_f32_can_implement (baracuda kernels unsorted segment mean i64idx f32 can implement).
baracuda_kernels_unsorted_segment_mean_i64idx_f32_run
baracuda_kernels_unsorted_segment_mean_i64idx_f32_run (baracuda kernels unsorted segment mean i64idx f32 run).
baracuda_kernels_unsorted_segment_mean_i64idx_f64_can_implement
baracuda_kernels_unsorted_segment_mean_i64idx_f64_can_implement (baracuda kernels unsorted segment mean i64idx f64 can implement).
baracuda_kernels_unsorted_segment_mean_i64idx_f64_run
baracuda_kernels_unsorted_segment_mean_i64idx_f64_run (baracuda kernels unsorted segment mean i64idx f64 run).
baracuda_kernels_unsorted_segment_min_backward_f32_can_implement
Implementability check for unsorted_segment_min_backward_f32.
baracuda_kernels_unsorted_segment_min_backward_f32_run
unsorted_segment_min_backward — f32.
baracuda_kernels_unsorted_segment_min_backward_f64_can_implement
Implementability check for unsorted_segment_min_backward_f64.
baracuda_kernels_unsorted_segment_min_backward_f64_run
unsorted_segment_min_backward — f64.
baracuda_kernels_unsorted_segment_min_f32_can_implement
Implementability check for unsorted_segment_min_f32.
baracuda_kernels_unsorted_segment_min_f32_run
out[s, d] = min_{n : seg[n] == s} input[n, d] — unsorted. f32.
baracuda_kernels_unsorted_segment_min_f64_can_implement
Implementability check for unsorted_segment_min_f64.
baracuda_kernels_unsorted_segment_min_f64_run
unsorted_segment_min — f64.
baracuda_kernels_unsorted_segment_min_i64idx_f32_can_implement
baracuda_kernels_unsorted_segment_min_i64idx_f32_can_implement (baracuda kernels unsorted segment min i64idx f32 can implement).
baracuda_kernels_unsorted_segment_min_i64idx_f32_run
baracuda_kernels_unsorted_segment_min_i64idx_f32_run (baracuda kernels unsorted segment min i64idx f32 run).
baracuda_kernels_unsorted_segment_min_i64idx_f64_can_implement
baracuda_kernels_unsorted_segment_min_i64idx_f64_can_implement (baracuda kernels unsorted segment min i64idx f64 can implement).
baracuda_kernels_unsorted_segment_min_i64idx_f64_run
baracuda_kernels_unsorted_segment_min_i64idx_f64_run (baracuda kernels unsorted segment min i64idx f64 run).
baracuda_kernels_unsorted_segment_prod_backward_f32_can_implement
Implementability check for unsorted_segment_prod_backward_f32.
baracuda_kernels_unsorted_segment_prod_backward_f32_run
unsorted_segment_prod_backward — f32. Shares the kernel with the sorted variant; distinct symbol for SKU tagging.
baracuda_kernels_unsorted_segment_prod_backward_f64_can_implement
Implementability check for unsorted_segment_prod_backward_f64.
baracuda_kernels_unsorted_segment_prod_backward_f64_run
unsorted_segment_prod_backward — f64.
baracuda_kernels_unsorted_segment_prod_f32_can_implement
Implementability check for unsorted_segment_prod_f32.
baracuda_kernels_unsorted_segment_prod_f32_run
unsorted_segment_prod FW — f32.
baracuda_kernels_unsorted_segment_prod_f64_can_implement
Implementability check for unsorted_segment_prod_f64.
baracuda_kernels_unsorted_segment_prod_f64_run
unsorted_segment_prod FW — f64.
baracuda_kernels_unsorted_segment_sum_backward_f32_can_implement
Implementability check for unsorted_segment_sum_backward_f32.
baracuda_kernels_unsorted_segment_sum_backward_f32_run
Same kernel as segment_sum_backward_f32; distinct symbol for SKU-tagging differentiation.
baracuda_kernels_unsorted_segment_sum_backward_f64_can_implement
Implementability check for unsorted_segment_sum_backward_f64.
baracuda_kernels_unsorted_segment_sum_backward_f64_run
unsorted_segment_sum_backward — f64.
baracuda_kernels_unsorted_segment_sum_backward_i64idx_f32_can_implement
baracuda_kernels_unsorted_segment_sum_backward_i64idx_f32_can_implement (baracuda kernels unsorted segment sum backward i64idx f32 can implement).
baracuda_kernels_unsorted_segment_sum_backward_i64idx_f32_run
baracuda_kernels_unsorted_segment_sum_backward_i64idx_f32_run (baracuda kernels unsorted segment sum backward i64idx f32 run).
baracuda_kernels_unsorted_segment_sum_backward_i64idx_f64_can_implement
baracuda_kernels_unsorted_segment_sum_backward_i64idx_f64_can_implement (baracuda kernels unsorted segment sum backward i64idx f64 can implement).
baracuda_kernels_unsorted_segment_sum_backward_i64idx_f64_run
baracuda_kernels_unsorted_segment_sum_backward_i64idx_f64_run (baracuda kernels unsorted segment sum backward i64idx f64 run).
baracuda_kernels_unsorted_segment_sum_f32_can_implement
Implementability check for unsorted_segment_sum_f32.
baracuda_kernels_unsorted_segment_sum_f32_run
out[s, d] = Σ_{n : seg[n] == s} input[n, d] — unsorted seg ids; atomicAdd into output. Output pre-zeroed by the launcher. f32.
baracuda_kernels_unsorted_segment_sum_f64_can_implement
Implementability check for unsorted_segment_sum_f64.
baracuda_kernels_unsorted_segment_sum_f64_run
unsorted_segment_sum — f64.
baracuda_kernels_unsorted_segment_sum_i64idx_f32_can_implement
baracuda_kernels_unsorted_segment_sum_i64idx_f32_can_implement (baracuda kernels unsorted segment sum i64idx f32 can implement).
baracuda_kernels_unsorted_segment_sum_i64idx_f32_run
baracuda_kernels_unsorted_segment_sum_i64idx_f32_run (baracuda kernels unsorted segment sum i64idx f32 run).
baracuda_kernels_unsorted_segment_sum_i64idx_f64_can_implement
baracuda_kernels_unsorted_segment_sum_i64idx_f64_can_implement (baracuda kernels unsorted segment sum i64idx f64 can implement).
baracuda_kernels_unsorted_segment_sum_i64idx_f64_run
baracuda_kernels_unsorted_segment_sum_i64idx_f64_run (baracuda kernels unsorted segment sum i64idx f64 run).
baracuda_kernels_upsample_bilinear_2d_bw_bf16_run
Alias for baracuda_kernels_interpolate_bilinear_2d_backward_bf16_run.
baracuda_kernels_upsample_bilinear_2d_bw_f16_run
Alias for baracuda_kernels_interpolate_bilinear_2d_backward_f16_run.
baracuda_kernels_upsample_bilinear_2d_bw_f32_run
Alias for baracuda_kernels_interpolate_bilinear_2d_backward_f32_run.
baracuda_kernels_upsample_bilinear_2d_bw_f64_run
Alias for baracuda_kernels_interpolate_bilinear_2d_backward_f64_run.
baracuda_kernels_upsample_bilinear_2d_fw_bf16_run
Alias for baracuda_kernels_interpolate_bilinear_2d_bf16_run.
baracuda_kernels_upsample_bilinear_2d_fw_f16_run
Alias for baracuda_kernels_interpolate_bilinear_2d_f16_run.
baracuda_kernels_upsample_bilinear_2d_fw_f32_run
Alias for baracuda_kernels_interpolate_bilinear_2d_f32_run under the new Phase 19.2 upsample_* naming convention.
baracuda_kernels_upsample_bilinear_2d_fw_f64_run
Alias for baracuda_kernels_interpolate_bilinear_2d_f64_run.
baracuda_kernels_upsample_nearest_2d_bw_bf16_can_implement
baracuda_kernels_upsample_nearest_2d_bw_bf16_can_implement (baracuda kernels upsample nearest 2d bw bf16 can implement).
baracuda_kernels_upsample_nearest_2d_bw_bf16_run
upsample_nearest_2d BW, bf16. # Safety: as f32 BW.
baracuda_kernels_upsample_nearest_2d_bw_f16_can_implement
baracuda_kernels_upsample_nearest_2d_bw_f16_can_implement (baracuda kernels upsample nearest 2d bw f16 can implement).
baracuda_kernels_upsample_nearest_2d_bw_f16_run
upsample_nearest_2d BW, f16. # Safety: as f32 BW. Uses the baracuda::atomic::add<__half> (CAS-based) helper.
baracuda_kernels_upsample_nearest_2d_bw_f32_can_implement
baracuda_kernels_upsample_nearest_2d_bw_f32_can_implement (baracuda kernels upsample nearest 2d bw f32 can implement).
baracuda_kernels_upsample_nearest_2d_bw_f32_run
upsample_nearest_2d BW, f32. Caller pre-zeros dinput.
baracuda_kernels_upsample_nearest_2d_bw_f64_can_implement
baracuda_kernels_upsample_nearest_2d_bw_f64_can_implement (baracuda kernels upsample nearest 2d bw f64 can implement).
baracuda_kernels_upsample_nearest_2d_bw_f64_run
upsample_nearest_2d BW, f64. # Safety: as f32 BW.
baracuda_kernels_upsample_nearest_2d_fw_bf16_can_implement
baracuda_kernels_upsample_nearest_2d_fw_bf16_can_implement (baracuda kernels upsample nearest 2d fw bf16 can implement).
baracuda_kernels_upsample_nearest_2d_fw_bf16_run
upsample_nearest_2d FW, bf16. # Safety: as f32.
baracuda_kernels_upsample_nearest_2d_fw_f16_can_implement
baracuda_kernels_upsample_nearest_2d_fw_f16_can_implement (baracuda kernels upsample nearest 2d fw f16 can implement).
baracuda_kernels_upsample_nearest_2d_fw_f16_run
upsample_nearest_2d FW, f16. # Safety: as f32.
baracuda_kernels_upsample_nearest_2d_fw_f32_can_implement
baracuda_kernels_upsample_nearest_2d_fw_f32_can_implement (baracuda kernels upsample nearest 2d fw f32 can implement).
baracuda_kernels_upsample_nearest_2d_fw_f32_run
upsample(x, mode='nearest') FW, f32. input: [N, C, IH, IW]; output: [N, C, OH, OW]. NCHW. Coordinate mapping: nearest under align_corners=false.
baracuda_kernels_upsample_nearest_2d_fw_f64_can_implement
baracuda_kernels_upsample_nearest_2d_fw_f64_can_implement (baracuda kernels upsample nearest 2d fw f64 can implement).
baracuda_kernels_upsample_nearest_2d_fw_f64_run
upsample_nearest_2d FW, f64. # Safety: as f32.
baracuda_kernels_where_backward_bf16_can_implement
Pre-launch check companion.
baracuda_kernels_where_backward_bf16_run
where backward, bf16.
baracuda_kernels_where_backward_f16_can_implement
Pre-launch check companion.
baracuda_kernels_where_backward_f16_run
where backward, f16.
baracuda_kernels_where_backward_f32_can_implement
Pre-launch check companion.
baracuda_kernels_where_backward_f32_run
where backward, f32. Writes da = cond ? dy : 0 and db = cond ? 0 : dy.
baracuda_kernels_where_backward_f64_can_implement
Pre-launch check companion.
baracuda_kernels_where_backward_f64_run
where backward, f64.
baracuda_kernels_where_bf16_can_implement
Pre-launch check for where_bf16.
baracuda_kernels_where_bf16_run
where(cond, a, b), bf16 values + u8 cond, contig fast path.
baracuda_kernels_where_bf16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_bf16_strided_run
where(cond, a, b), bf16 values, strided / broadcast path.
baracuda_kernels_where_f16_can_implement
Pre-launch check for where_f16.
baracuda_kernels_where_f16_run
where(cond, a, b), f16 values + u8 cond, contig fast path.
baracuda_kernels_where_f16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_f16_strided_run
where(cond, a, b), f16 values, strided / broadcast path.
baracuda_kernels_where_f32_can_implement
Pre-launch check for where_f32.
baracuda_kernels_where_f32_run
where(cond, a, b), f32 values + u8 cond, contig fast path. This is the where-ternary trailblazer — its safety + aliasing contract carries over to every other where-family launcher across all value dtypes and cond-dtype variants (where_u32cond_*, where_i64cond_*).
baracuda_kernels_where_f32_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_f32_strided_run
where(cond, a, b), f32 values, strided / broadcast path.
baracuda_kernels_where_f64_can_implement
Pre-launch check for where_f64.
baracuda_kernels_where_f64_run
where(cond, a, b), f64 values + u8 cond, contig fast path.
baracuda_kernels_where_f64_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_f64_strided_run
where(cond, a, b), f64 values, strided / broadcast path.
baracuda_kernels_where_i64cond_bf16_can_implement
Pre-launch check for where_i64cond_bf16.
baracuda_kernels_where_i64cond_bf16_run
where(cond, a, b), i64 cond + bf16 values, contig fast path.
baracuda_kernels_where_i64cond_bf16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_i64cond_bf16_strided_run
where(cond, a, b), i64 cond + bf16 values, strided / broadcast.
baracuda_kernels_where_i64cond_f16_can_implement
Pre-launch check for where_i64cond_f16.
baracuda_kernels_where_i64cond_f16_run
where(cond, a, b), i64 cond + f16 values, contig fast path.
baracuda_kernels_where_i64cond_f16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_i64cond_f16_strided_run
where(cond, a, b), i64 cond + f16 values, strided / broadcast.
baracuda_kernels_where_i64cond_f32_can_implement
Pre-launch check for where_i64cond_f32.
baracuda_kernels_where_i64cond_f32_run
where(cond, a, b), i64 cond + f32 values, contig fast path.
baracuda_kernels_where_i64cond_f32_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_i64cond_f32_strided_run
where(cond, a, b), i64 cond + f32 values, strided / broadcast.
baracuda_kernels_where_i64cond_f64_can_implement
Pre-launch check for where_i64cond_f64.
baracuda_kernels_where_i64cond_f64_run
where(cond, a, b), i64 cond + f64 values, contig fast path.
baracuda_kernels_where_i64cond_f64_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_i64cond_f64_strided_run
where(cond, a, b), i64 cond + f64 values, strided / broadcast.
baracuda_kernels_where_i64cond_fp8e4m3_can_implement
baracuda_kernels_where_i64cond_fp8e4m3_can_implement (baracuda kernels where i64cond fp8e4m3 can implement).
baracuda_kernels_where_i64cond_fp8e4m3_run
where(cond, a, b), i64 cond + Fp8E4M3 values, contig fast path.
baracuda_kernels_where_i64cond_fp8e4m3_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_i64cond_fp8e4m3_strided_run
baracuda_kernels_where_i64cond_fp8e4m3_strided_run (baracuda kernels where i64cond fp8e4m3 strided run).
baracuda_kernels_where_i64cond_i8_can_implement
baracuda_kernels_where_i64cond_i8_can_implement (baracuda kernels where i64cond i8 can implement).
baracuda_kernels_where_i64cond_i8_run
where(cond, a, b), i64 cond + i8 values, contig fast path.
baracuda_kernels_where_i64cond_i8_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_i64cond_i8_strided_run
baracuda_kernels_where_i64cond_i8_strided_run (baracuda kernels where i64cond i8 strided run).
baracuda_kernels_where_i64cond_i16_can_implement
baracuda_kernels_where_i64cond_i16_can_implement (baracuda kernels where i64cond i16 can implement).
baracuda_kernels_where_i64cond_i16_run
where(cond, a, b), i64 cond + i16 values, contig fast path.
baracuda_kernels_where_i64cond_i16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_i64cond_i16_strided_run
baracuda_kernels_where_i64cond_i16_strided_run (baracuda kernels where i64cond i16 strided run).
baracuda_kernels_where_i64cond_i32_can_implement
baracuda_kernels_where_i64cond_i32_can_implement (baracuda kernels where i64cond i32 can implement).
baracuda_kernels_where_i64cond_i32_run
where(cond, a, b), i64 cond + i32 values, contig fast path.
baracuda_kernels_where_i64cond_i32_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_i64cond_i32_strided_run
baracuda_kernels_where_i64cond_i32_strided_run (baracuda kernels where i64cond i32 strided run).
baracuda_kernels_where_i64cond_i64_can_implement
baracuda_kernels_where_i64cond_i64_can_implement (baracuda kernels where i64cond i64 can implement).
baracuda_kernels_where_i64cond_i64_run
where(cond, a, b), i64 cond + i64 values, contig fast path.
baracuda_kernels_where_i64cond_i64_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_i64cond_i64_strided_run
baracuda_kernels_where_i64cond_i64_strided_run (baracuda kernels where i64cond i64 strided run).
baracuda_kernels_where_i64cond_u8_can_implement
baracuda_kernels_where_i64cond_u8_can_implement (baracuda kernels where i64cond u8 can implement).
baracuda_kernels_where_i64cond_u8_run
where(cond, a, b), i64 cond + u8 values, contig fast path.
baracuda_kernels_where_i64cond_u8_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_i64cond_u8_strided_run
baracuda_kernels_where_i64cond_u8_strided_run (baracuda kernels where i64cond u8 strided run).
baracuda_kernels_where_i64cond_u32_can_implement
baracuda_kernels_where_i64cond_u32_can_implement (baracuda kernels where i64cond u32 can implement).
baracuda_kernels_where_i64cond_u32_run
where(cond, a, b), i64 cond + u32 values, contig fast path.
baracuda_kernels_where_i64cond_u32_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_i64cond_u32_strided_run
baracuda_kernels_where_i64cond_u32_strided_run (baracuda kernels where i64cond u32 strided run).
baracuda_kernels_where_u8cond_fp8e4m3_can_implement
baracuda_kernels_where_u8cond_fp8e4m3_can_implement (baracuda kernels where u8cond fp8e4m3 can implement).
baracuda_kernels_where_u8cond_fp8e4m3_run
where(cond, a, b), u8 cond + Fp8E4M3 values, contig fast path.
baracuda_kernels_where_u8cond_fp8e4m3_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u8cond_fp8e4m3_strided_run
baracuda_kernels_where_u8cond_fp8e4m3_strided_run (baracuda kernels where u8cond fp8e4m3 strided run).
baracuda_kernels_where_u8cond_i8_can_implement
Pre-launch check for where_u8cond_i8.
baracuda_kernels_where_u8cond_i8_run
where(cond, a, b), u8 cond + i8 values, contig fast path.
baracuda_kernels_where_u8cond_i8_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u8cond_i8_strided_run
where(cond, a, b), u8 cond + i8 values, strided / broadcast.
baracuda_kernels_where_u8cond_i16_can_implement
Pre-launch check for where_u8cond_i16.
baracuda_kernels_where_u8cond_i16_run
where(cond, a, b), u8 cond + i16 values, contig fast path.
baracuda_kernels_where_u8cond_i16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u8cond_i16_strided_run
where(cond, a, b), u8 cond + i16 values, strided / broadcast.
baracuda_kernels_where_u8cond_i32_can_implement
Pre-launch check for where_u8cond_i32.
baracuda_kernels_where_u8cond_i32_run
where(cond, a, b), u8 cond + i32 values, contig fast path.
baracuda_kernels_where_u8cond_i32_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u8cond_i32_strided_run
where(cond, a, b), u8 cond + i32 values, strided / broadcast.
baracuda_kernels_where_u8cond_i64_can_implement
Pre-launch check for where_u8cond_i64.
baracuda_kernels_where_u8cond_i64_run
where(cond, a, b), u8 cond + i64 values, contig fast path.
baracuda_kernels_where_u8cond_i64_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u8cond_i64_strided_run
where(cond, a, b), u8 cond + i64 values, strided / broadcast.
baracuda_kernels_where_u8cond_u8_can_implement
Pre-launch check for where_u8cond_u8.
baracuda_kernels_where_u8cond_u8_run
where(cond, a, b), u8 cond + u8 values, contig fast path.
baracuda_kernels_where_u8cond_u8_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u8cond_u8_strided_run
where(cond, a, b), u8 cond + u8 values, strided / broadcast.
baracuda_kernels_where_u8cond_u32_can_implement
Pre-launch check for where_u8cond_u32.
baracuda_kernels_where_u8cond_u32_run
where(cond, a, b), u8 cond + u32 values, contig fast path.
baracuda_kernels_where_u8cond_u32_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u8cond_u32_strided_run
where(cond, a, b), u8 cond + u32 values, strided / broadcast.
baracuda_kernels_where_u32cond_bf16_can_implement
Pre-launch check for where_u32cond_bf16.
baracuda_kernels_where_u32cond_bf16_run
where(cond, a, b), u32 cond + bf16 values, contig fast path.
baracuda_kernels_where_u32cond_bf16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u32cond_bf16_strided_run
where(cond, a, b), u32 cond + bf16 values, strided / broadcast.
baracuda_kernels_where_u32cond_f16_can_implement
Pre-launch check for where_u32cond_f16.
baracuda_kernels_where_u32cond_f16_run
where(cond, a, b), u32 cond + f16 values, contig fast path.
baracuda_kernels_where_u32cond_f16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u32cond_f16_strided_run
where(cond, a, b), u32 cond + f16 values, strided / broadcast.
baracuda_kernels_where_u32cond_f32_can_implement
Pre-launch check for where_u32cond_f32.
baracuda_kernels_where_u32cond_f32_run
where(cond, a, b), u32 cond + f32 values, contig fast path.
baracuda_kernels_where_u32cond_f32_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u32cond_f32_strided_run
where(cond, a, b), u32 cond + f32 values, strided / broadcast path. Each operand carries its own stride array.
baracuda_kernels_where_u32cond_f64_can_implement
Pre-launch check for where_u32cond_f64.
baracuda_kernels_where_u32cond_f64_run
where(cond, a, b), u32 cond + f64 values, contig fast path.
baracuda_kernels_where_u32cond_f64_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u32cond_f64_strided_run
where(cond, a, b), u32 cond + f64 values, strided / broadcast.
baracuda_kernels_where_u32cond_fp8e4m3_can_implement
baracuda_kernels_where_u32cond_fp8e4m3_can_implement (baracuda kernels where u32cond fp8e4m3 can implement).
baracuda_kernels_where_u32cond_fp8e4m3_run
where(cond, a, b), u32 cond + Fp8E4M3 values, contig fast path.
baracuda_kernels_where_u32cond_fp8e4m3_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u32cond_fp8e4m3_strided_run
baracuda_kernels_where_u32cond_fp8e4m3_strided_run (baracuda kernels where u32cond fp8e4m3 strided run).
baracuda_kernels_where_u32cond_i8_can_implement
baracuda_kernels_where_u32cond_i8_can_implement (baracuda kernels where u32cond i8 can implement).
baracuda_kernels_where_u32cond_i8_run
where(cond, a, b), u32 cond + i8 values, contig fast path.
baracuda_kernels_where_u32cond_i8_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u32cond_i8_strided_run
baracuda_kernels_where_u32cond_i8_strided_run (baracuda kernels where u32cond i8 strided run).
baracuda_kernels_where_u32cond_i16_can_implement
baracuda_kernels_where_u32cond_i16_can_implement (baracuda kernels where u32cond i16 can implement).
baracuda_kernels_where_u32cond_i16_run
where(cond, a, b), u32 cond + i16 values, contig fast path.
baracuda_kernels_where_u32cond_i16_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u32cond_i16_strided_run
baracuda_kernels_where_u32cond_i16_strided_run (baracuda kernels where u32cond i16 strided run).
baracuda_kernels_where_u32cond_i32_can_implement
baracuda_kernels_where_u32cond_i32_can_implement (baracuda kernels where u32cond i32 can implement).
baracuda_kernels_where_u32cond_i32_run
where(cond, a, b), u32 cond + i32 values, contig fast path.
baracuda_kernels_where_u32cond_i32_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u32cond_i32_strided_run
baracuda_kernels_where_u32cond_i32_strided_run (baracuda kernels where u32cond i32 strided run).
baracuda_kernels_where_u32cond_i64_can_implement
baracuda_kernels_where_u32cond_i64_can_implement (baracuda kernels where u32cond i64 can implement).
baracuda_kernels_where_u32cond_i64_run
where(cond, a, b), u32 cond + i64 values, contig fast path.
baracuda_kernels_where_u32cond_i64_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u32cond_i64_strided_run
baracuda_kernels_where_u32cond_i64_strided_run (baracuda kernels where u32cond i64 strided run).
baracuda_kernels_where_u32cond_u8_can_implement
baracuda_kernels_where_u32cond_u8_can_implement (baracuda kernels where u32cond u8 can implement).
baracuda_kernels_where_u32cond_u8_run
where(cond, a, b), u32 cond + u8 values, contig fast path.
baracuda_kernels_where_u32cond_u8_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u32cond_u8_strided_run
baracuda_kernels_where_u32cond_u8_strided_run (baracuda kernels where u32cond u8 strided run).
baracuda_kernels_where_u32cond_u32_can_implement
baracuda_kernels_where_u32cond_u32_can_implement (baracuda kernels where u32cond u32 can implement).
baracuda_kernels_where_u32cond_u32_run
where(cond, a, b), u32 cond + u32 values, contig fast path.
baracuda_kernels_where_u32cond_u32_strided_can_implement
Pre-launch check companion.
baracuda_kernels_where_u32cond_u32_strided_run
baracuda_kernels_where_u32cond_u32_strided_run (baracuda kernels where u32cond u32 strided run).
baracuda_kernels_write_slice_b1_can_implement
Implementability check for baracuda_kernels_write_slice_b1. Host-side only.
baracuda_kernels_write_slice_b1_run
WriteSlice, 1-byte element (i8 / u8 / S8 / U8 / Bool / Fp8E4M3 / Fp8E5M2). Generic per-slab-element memcpy kernel.
baracuda_kernels_write_slice_b2_can_implement
Implementability check for baracuda_kernels_write_slice_b2. Host-side only.
baracuda_kernels_write_slice_b2_run
WriteSlice, 2-byte element (f16 / bf16). See b1 variant for the contract.
baracuda_kernels_write_slice_b4_can_implement
Implementability check for baracuda_kernels_write_slice_b4. Host-side only.
baracuda_kernels_write_slice_b4_run
WriteSlice, 4-byte element (f32 / F32Strict / i32).
baracuda_kernels_write_slice_b8_can_implement
Implementability check for baracuda_kernels_write_slice_b8. Host-side only.
baracuda_kernels_write_slice_b8_run
WriteSlice, 8-byte element (f64 / i64 / Complex32).
baracuda_kernels_write_slice_b16_can_implement
Implementability check for baracuda_kernels_write_slice_b16. Host-side only.
baracuda_kernels_write_slice_b16_run
WriteSlice, 16-byte element (Complex64).
baracuda_kernels_write_slice_nibble_can_implement
Implementability check for baracuda_kernels_write_slice_nibble. Host-side only.
baracuda_kernels_write_slice_nibble_run
WriteSlice, nibble-packed (S4 / U4 — two elements per byte). Constraint: range_start[rank-1] and range_end[rank-1] must both be even so no read-modify-write straddles a byte boundary. Shape / range_start arrays passed in are byte-counted on the innermost axis (Rust side halves before calling).
cublasCgemmStridedBatched
cublasCgemmStridedBatched — single-precision complex strided- batched matrix-matrix multiply. Complex32 (== cuComplex == cuFloatComplex) analogue of cublasSgemmStridedBatched. Used by the WY-blocked batched-unmqr plan (crate’s BatchedOrmqrWyPlan<Complex32>) — transa = CUBLAS_OP_C selects V^H for the first GEMM and T^H for the second GEMM when applying Q^H.
cublasCgeqrfBatched
cublasCgeqrfBatched. Complex32 analogue. tau_array[b] is cuComplex (NOT real-typed even though tau is real-magnitude for real Householder — cuBLAS uses complex tau across the complex family so the same apply routines can dispatch uniformly).
cublasCreate_v2
cublasCreate_v2 — create a cuBLAS handle.
cublasDestroy_v2
cublasDestroy_v2 — destroy a cuBLAS handle.
cublasDgemmStridedBatched
cublasDgemmStridedBatched — double-precision strided-batched matrix-matrix multiply. f64 analogue of cublasSgemmStridedBatched.
cublasDgeqrfBatched
cublasDgeqrfBatched. f64 analogue.
cublasDtrsm
cublasDtrsm — double-precision triangular solve. f64 analogue of cublasStrsm.
cublasGemmEx
cublasGemmEx — mixed-precision GEMM with explicit dtype tags.
cublasGemmStridedBatchedEx
cublasGemmStridedBatchedEx — mixed-precision strided-batched GEMM with explicit dtype tags (Phase 74). The Ex sibling of cublasSgemmStridedBatched: each batch slot i computes C[i] := α · op(A[i]) · op(B[i]) + β · C[i] where the slot-i operand is reached by adding i * stride_* (in elements) to the base pointer. stride_a / stride_b may be 0 to broadcast one matrix across all slots; stride_c must step disjoint output regions.
cublasSetStream_v2
cublasSetStream_v2 — bind a CUDA stream to the cuBLAS handle.
cublasSgemmStridedBatched
cublasSgemmStridedBatched — single-precision strided-batched matrix-matrix multiply. Each slot computes C[i] := α · op(A[i]) · op(B[i]) + β · C[i] where A[i], B[i], C[i] are reached by stepping stride{A,B,C} element counts from the respective base pointers.
cublasSgeqrfBatched
cublasSgeqrfBatched — batched QR factorization (single precision). Each Aarray[b] is overwritten in place with the geqrf-packed R (upper) + Householder reflectors (strict lower); TauArray[b] receives the Householder scalars.
cublasStrsm
cublasStrsm — single-precision triangular solve.
cublasZgemmStridedBatched
cublasZgemmStridedBatched — double-precision complex strided- batched matrix-matrix multiply. Complex64 analogue of cublasCgemmStridedBatched.
cublasZgeqrfBatched
cublasZgeqrfBatched. Complex64 analogue.
cufftDestroy
cufftDestroy(plan). Frees the plan’s internal workspace.
cufftExecC2C
cufftExecC2C(plan, idata, odata, direction) — complex-to- complex single-precision exec. direction is CUFFT_FORWARD or CUFFT_INVERSE. Inverse is unnormalized.
cufftExecC2R
cufftExecC2R(plan, idata, odata) — complex-to-real single precision. Input length is nx/2 + 1, output length is nx. Unnormalized — caller must scale by 1/nx.
cufftExecD2Z
cufftExecD2Z(plan, idata, odata) — real-to-complex double precision. Same semantics as cufftExecR2C.
cufftExecR2C
cufftExecR2C(plan, idata, odata) — real-to-complex single precision. Input length is nx, output length is nx/2 + 1 (Hermitian-half).
cufftExecZ2D
cufftExecZ2D(plan, idata, odata) — complex-to-real double precision. Unnormalized.
cufftExecZ2Z
cufftExecZ2Z(plan, idata, odata, direction) — complex-to- complex double precision. Same semantics as cufftExecC2C.
cufftPlan1d
cufftPlan1d(plan, nx, type, batch). Allocates a 1-D plan (single FFT of length nx, or batch independent FFTs each of length nx laid out contiguously). cuFFT’s plan struct owns its own workspace internally — no caller-supplied workspace is required for the basic 1-D APIs.
cufftPlanMany
cufftPlanMany(plan, rank, n, inembed, istride, idist, onembed, ostride, odist, type, batch).
cufftSetStream
cufftSetStream(plan, stream). Binds subsequent exec calls on this plan to the given CUDA stream. Returns 0 on success.
curandCreateGenerator
curandCreateGenerator(generator, rng_type). Returns 0 on success.
curandDestroyGenerator
curandDestroyGenerator(generator). Returns 0 on success.
curandGenerateNormal
curandGenerateNormal(generator, ptr, n, mean, stddev) — writes n normally-distributed float samples to ptr. Note: cuRAND requires n be even for the Box-Muller pair generator. Returns 0 on success.
curandGenerateNormalDouble
curandGenerateNormalDouble(generator, ptr, n, mean, stddev). Same parity contract as curandGenerateNormal. Returns 0 on success.
curandGenerateUniform
curandGenerateUniform(generator, ptr, n) — writes n float samples in (0, 1] to ptr. Returns 0 on success.
curandGenerateUniformDouble
curandGenerateUniformDouble(generator, ptr, n) — writes n double samples in (0, 1] to ptr. Returns 0 on success.
curandSetPseudoRandomGeneratorSeed
curandSetPseudoRandomGeneratorSeed(generator, seed). Returns 0 on success.
curandSetStream
curandSetStream(generator, stream). Binds subsequent generator calls to the given CUDA stream. Returns 0 on success.
cusolverDnCgeqrf
cusolverDnCgeqrf — single-precision complex QR factorization, in place. The packed output uses the same convention as the real variant: strict lower triangle + tau encode the Householder reflectors; the upper triangle holds R.
cusolverDnCgeqrf_bufferSize
cusolverDnCgeqrf_bufferSize — workspace query for single-precision complex QR factorization. Mirrors cusolverDnSgeqrf_bufferSize.
cusolverDnCheevd
cusolverDnCheevd — complex-Hermitian eigh (Complex32). A is overwritten in place with the eigenvectors (column-major); W receives the n real eigenvalues sorted ascending.
cusolverDnCheevd_bufferSize
cusolverDnCheevd_bufferSize — complex-Hermitian divide-and-conquer eigh, single precision (Complex32). Eigenvalues are real-valued float.
cusolverDnCreate
cusolverDnCreate(handle). Returns 0 on success.
cusolverDnCreateGesvdjInfo
cusolverDnCreateGesvdjInfo — allocate a Jacobi-SVD params object with cuSOLVER’s defaults (tol = 1e-7 for f32 / 1e-12 for f64, max_sweeps = 100, sort_eig = 1).
cusolverDnCreateParams
cusolverDnCreateParams — allocate the opaque params struct used by all 64-bit cuSOLVER APIs. Plan layer creates one lazily on first run (mirroring the handle lifecycle).
cusolverDnCunmqr
cusolverDnCunmqr — apply Q, Q^T, or Q^H from a complex geqrf factorization to a complex C in place.
cusolverDnCunmqr_bufferSize
cusolverDnCunmqr_bufferSize.
cusolverDnDDgels
cusolverDnDDgels. f64 analogue.
cusolverDnDDgels_bufferSize
cusolverDnDDgels_bufferSize. f64 analogue.
cusolverDnDestroy
cusolverDnDestroy(handle). Returns 0 on success.
cusolverDnDestroyGesvdjInfo
cusolverDnDestroyGesvdjInfo. Returns 0 on success.
cusolverDnDestroyParams
cusolverDnDestroyParams. Returns 0 on success.
cusolverDnDgeqrf
cusolverDnDgeqrf. f64 analogue.
cusolverDnDgeqrf_bufferSize
cusolverDnDgeqrf_bufferSize. f64 analogue.
cusolverDnDgesvd
cusolverDnDgesvd. f64 analogue.
cusolverDnDgesvd_bufferSize
cusolverDnDgesvd_bufferSize. f64 analogue.
cusolverDnDgesvdaStridedBatched
cusolverDnDgesvdaStridedBatched. f64 analogue.
cusolverDnDgesvdaStridedBatched_bufferSize
cusolverDnDgesvdaStridedBatched_bufferSize. f64 analogue.
cusolverDnDgesvdjBatched
cusolverDnDgesvdjBatched. f64 analogue.
cusolverDnDgesvdjBatched_bufferSize
cusolverDnDgesvdjBatched_bufferSize. f64 analogue.
cusolverDnDgetrf
cusolverDnDgetrf. f64 analogue.
cusolverDnDgetrf_bufferSize
cusolverDnDgetrf_bufferSize. f64 analogue.
cusolverDnDgetrs
cusolverDnDgetrs. f64 analogue.
cusolverDnDormqr
cusolverDnDormqr. f64 analogue.
cusolverDnDormqr_bufferSize
cusolverDnDormqr_bufferSize. f64 analogue.
cusolverDnDpotrf
cusolverDnDpotrf. f64 analogue.
cusolverDnDpotrfBatched
cusolverDnDpotrfBatched. f64 analogue.
cusolverDnDpotrf_bufferSize
cusolverDnDpotrf_bufferSize. f64 analogue.
cusolverDnDsyevd
cusolverDnDsyevd. f64 analogue.
cusolverDnDsyevd_bufferSize
cusolverDnDsyevd_bufferSize. f64 analogue.
cusolverDnSSgels
cusolverDnSSgels — least-squares solve min ||A·x - b||² for m ≥ n full-rank A. Iterative refinement; returns niters ≥ 0 on convergence, -N on fallback-needed. Single precision.
cusolverDnSSgels_bufferSize
cusolverDnSSgels_bufferSize — query bytes (the routine’s workspace is supplied as a raw byte buffer, not a typed element count, distinct from the *_bufferSize entries above).
cusolverDnSetStream
cusolverDnSetStream(handle, stream). Binds subsequent cuSOLVER calls to the given CUDA stream. Returns 0 on success.
cusolverDnSgeqrf
cusolverDnSgeqrf — QR factorization in place. A is overwritten: upper triangle = R, strict lower triangle + tau = Householder reflectors that encode Q. To materialize Q as a dense matrix, follow with ormqr against an identity.
cusolverDnSgeqrf_bufferSize
cusolverDnSgeqrf_bufferSize.
cusolverDnSgesvd
cusolverDnSgesvd — SVD: A = U · diag(S) · V^T. The jobu / jobv characters are ASCII bytes: 'A' (full U/V^T), 'S' (thin U/V^T), 'O' (overwrite A — disallowed at plan layer), 'N' (skip).
cusolverDnSgesvd_bufferSize
cusolverDnSgesvd_bufferSize.
cusolverDnSgesvdaStridedBatched
cusolverDnSgesvdaStridedBatched — f32 rectangular-batched approximate-SVD. Each batch slot factors a [m, n] matrix into U: [m, rank], S: [rank], V: [n, rank] (column-major; cuSOLVER returns V, not V^T). The host array h_R_nrmF (size batch_size) receives per-slot residual Frobenius norms.
cusolverDnSgesvdaStridedBatched_bufferSize
cusolverDnSgesvdaStridedBatched_bufferSize — query the device workspace size (in elements, multiply by sizeof(f32) for bytes) for the f32 rectangular-batched approximate-SVD.
cusolverDnSgesvdjBatched
cusolverDnSgesvdjBatched — batched Jacobi SVD A = U · diag(S) · V^T (single precision). Each matrix is square [m, m] (cuSOLVER’s Jacobi-batched API requires square input; thin rectangular is achievable via gesvdaStridedBatched — deferred). The plan surfaces V (not V^T); callers apply the transpose if needed.
cusolverDnSgesvdjBatched_bufferSize
cusolverDnSgesvdjBatched_bufferSize. jobz is 0 (no vectors) or 1 (compute U / V). For batched, each matrix in A is independently SVD’d; outputs are packed [batch * m * m] etc.
cusolverDnSgetrf
cusolverDnSgetrf — LU factorization with partial pivoting in place. A := L · U (with L unit-diagonal, stored in the strict lower triangle; U in the upper triangle). ipiv[i] is the row swap performed at step i (1-based per LAPACK convention).
cusolverDnSgetrf_bufferSize
cusolverDnSgetrf_bufferSize — query workspace element count.
cusolverDnSgetrs
cusolverDnSgetrs — solve op(A) · X = B using the packed LU
cusolverDnSormqr
cusolverDnSormqr — apply Q (or Q^T) from geqrf output to a matrix C in place. With C = I this materializes Q as a dense matrix for the “thin” or “full” QR.
cusolverDnSormqr_bufferSize
cusolverDnSormqr_bufferSize. trans selects Q vs Q^T; side selects left vs right multiply.
cusolverDnSpotrf
cusolverDnSpotrf — Cholesky factorization in-place (A := L or A := U). Writes the unused triangle untouched. dev_info returns 0 on success, k > 0 if the leading k-minor is not positive definite (factorization halted at step k).
cusolverDnSpotrfBatched
cusolverDnSpotrfBatched(handle, uplo, n, Aarray, lda, infoArray, batchSize). Each matrix in Aarray[batch_size] is factored independently in-place. Returns 0 on success; per-matrix factor info lands in infoArray[i].
cusolverDnSpotrf_bufferSize
cusolverDnSpotrf_bufferSize — query workspace bytes (as element count, must be multiplied by sizeof(T) for cudaMalloc).
cusolverDnSsyevd
cusolverDnSsyevd — real-symmetric eigh, f32. A is overwritten in place with the eigenvectors (column-major) when jobz == VECTOR. W receives the n eigenvalues sorted ascending.
cusolverDnSsyevd_bufferSize
cusolverDnSsyevd_bufferSize — query workspace element count for real-symmetric divide-and-conquer eigh, f32.
cusolverDnXgeev
cusolverDnXgeev — general (non-symmetric) eigendecomposition. A is destroyed in place (used as scratch by the LAPACK- equivalent algorithm). W receives the n complex eigenvalues; VL / VR (when requested) receive the column-major left / right complex eigenvectors. For non-Hermitian input the eigenvalues can be complex even when the input is real, hence the always-complex W storage.
cusolverDnXgeev_bufferSize
cusolverDnXgeev_bufferSize — query the host + device byte counts for cusolverDnXgeev at the given problem size and element types. The two output pointers receive byte counts (NOT element counts — different from the legacy _bufferSize APIs).
cusolverDnZgeqrf
cusolverDnZgeqrf — double-precision complex QR factorization.
cusolverDnZgeqrf_bufferSize
cusolverDnZgeqrf_bufferSize. f64-complex analogue of the C variant.
cusolverDnZheevd
cusolverDnZheevd. Complex64 analogue.
cusolverDnZheevd_bufferSize
cusolverDnZheevd_bufferSize. Complex64 analogue.
cusolverDnZunmqr
cusolverDnZunmqr. f64-complex analogue.
cusolverDnZunmqr_bufferSize
cusolverDnZunmqr_bufferSize. f64-complex analogue.

Type Aliases§

cuFloatComplex
cuFloatComplex is the canonical CUDA name for the single-precision complex struct — an alias for cuComplex. Surfaced so cuSOLVER’s complex APIs (cusolverDn{C,Z}unmqr, …) can spell their signatures in the same vocabulary as the NVIDIA headers.
cublasDiagType_t
cuBLAS diag-type tag for triangular solves (trsm). CUBLAS_DIAG_NON_UNIT = 0, CUBLAS_DIAG_UNIT = 1.
cublasFillMode_t
cuBLAS fill-mode tag re-used by cuSOLVER for triangular factorizations. CUBLAS_FILL_MODE_LOWER = 0, CUBLAS_FILL_MODE_UPPER = 1.
cublasHandle_t
Opaque cuBLAS handle. Used by cublas*geqrfBatched (which lives in cuBLAS, not cuSOLVER) and any future cuBLAS-routed linalg paths.
cudaDataType
cudaDataType tag used by the 64-bit cuSOLVER APIs (Xgeev, Xgesvd, …) to identify tensor element types. These constants originate in <library_types.h> and are stable across CUDA versions.
cufftHandle
Opaque cuFFT plan handle. Unusually for CUDA libraries this is an integer ID (int), not a pointer. A value of -1 is reserved at the safe-plan layer as the “not yet created” sentinel — cuFFT itself returns small non-negative integers for live handles.
cufftResult
cuFFT result code type. CUFFT_SUCCESS = 0. Any non-zero return is mapped to a negative status at the safe-plan layer for distinct error reporting.
curandGenerator_t
Opaque cuRAND generator handle. Treated as a stateful object owned by safe Rust at the plan layer — never inspect its internals here.
cusolverDnHandle_t
Opaque cuSOLVER dense handle. Stateful object; the plan layer creates one lazily on first run and reuses across launches.
cusolverDnParams_t
Opaque parameter struct used by the 64-bit cuSOLVER APIs (Xgeev, Xpotrf, …). The struct holds advanced configuration (algorithm choice, precision modes) — for the trailblazer the plan layer leaves it at defaults. Created via cusolverDnCreateParams and destroyed via cusolverDnDestroyParams.
cusolverEigMode_t
cuSOLVER eig-mode enum tag (used by syevd / heevd / Xgeev). 0 = NOVECTOR (compute eigenvalues only), 1 = VECTOR (eigenvalues + eigenvectors). Routed through as an i32 for the legacy syevd / heevd APIs. The CUSOLVER_EIG_MODE_NOVECTOR / _VECTOR constants live further down (originally introduced for gesvdjBatched’s jobz argument; reused verbatim here for the eig family).
gesvdjInfo_t
Opaque cuSOLVER Jacobi-SVD parameter object. Stateful; created once per plan, reused across launches, destroyed on plan drop. Used by cusolverDn*gesvdjBatched for the batched-SVD path.