Crate baracuda_kernels_sys

Expand description

§baracuda-kernels-sys

Raw extern "C" entry points for compiled bespoke kernels. You almost certainly want baracuda-kernels instead — that crate wraps these unsafe calls with typed plans, lifetime-checked device buffers, and a proper Rust API.

Functions in this crate take raw void* pointers, integer dimensions, and a cudaStream_t cast as *mut c_void. They are unsafe because:

They dereference the pointer arguments without bounds-checking.
They assume the pointers are valid device addresses.
They assume the workspace pointer (when non-null) points to at least workspace_bytes of writable device memory.
They assume the stream is a valid CUDA stream owned by the calling thread’s current context.

§Status codes

All *_run and *_can_implement functions return an i32 status:

0: success.
1: misaligned operand.
2: invalid problem (e.g. M, N, or K is non-positive).
3: not supported (this kernel doesn’t implement the requested shape).
4: workspace too small or null when required.
5: internal kernel error (typically a launch failure).

Structs§

cuComplex: ABI-compatible single-precision complex struct, matching cuComplex from <cuComplex.h> (interleaved real/imag f32). Identical layout to crate::cufftComplex and to the safe-side [Complex32] from baracuda-kernels-types — a DeviceBuffer<Complex32> can be cast to a *mut cuComplex for the cuSOLVER complex APIs without copy.
cuDoubleComplex: ABI-compatible double-precision complex struct, matching cuDoubleComplex from <cuComplex.h>. Sibling to cuComplex.
cufftComplex: Single-precision complex element layout. Interleaved real/imag pairs — #[repr(C)] matches NVIDIA’s cufftComplex struct exactly (which is itself an alias for float2 in <vector_types.h>). The plan layer pairs this with the crate-level Complex32 newtype.
cufftDoubleComplex: Double-precision complex element layout. ABI-compatible with cuFFT’s cufftDoubleComplex (alias for double2).

Constants§

CUBLAS_COMPUTE_32F: CUBLAS_COMPUTE_32F — fp32 accumulator.
CUBLAS_COMPUTE_64F: CUBLAS_COMPUTE_64F — fp64 accumulator.
CUBLAS_DIAG_NON_UNIT: CUBLAS_DIAG_NON_UNIT — trsm reads the actual diagonal of A. Used by the LstSq QR-fallback path for the back-substitution R · X = Q^T · B, where R’s diagonal is the meaningful pivots.
CUBLAS_DIAG_UNIT: CUBLAS_DIAG_UNIT — trsm treats the diagonal as all-1s (unit-triangular). Not used by the current plan layer; surfaced for completeness.
CUBLAS_FILL_MODE_LOWER: CUBLAS_FILL_MODE_LOWER — pass to potrf to request the lower- triangular Cholesky factor.
CUBLAS_FILL_MODE_UPPER: CUBLAS_FILL_MODE_UPPER — pass to potrf to request the upper- triangular Cholesky factor.
CUBLAS_GEMM_DEFAULT: CUBLAS_GEMM_DEFAULT — let cuBLAS pick the algorithm.
CUBLAS_OP_C: CUBLAS_OP_C — conjugate transpose (only meaningful for complex dtypes). Used by cusolverDn{C,Z}unmqr to apply Q^H.
CUBLAS_OP_N: CUBLAS_OP_N — no transpose. Used by ormqr to control whether to apply Q or Q^T.
CUBLAS_OP_T: CUBLAS_OP_T — transpose.
CUBLAS_SIDE_LEFT: CUBLAS_SIDE_LEFT — Q is applied from the left in ormqr (C := Q · C or C := Q^T · C).
CUBLAS_SIDE_RIGHT: CUBLAS_SIDE_RIGHT — Q is applied from the right.
CUDA_C_32F: CUDA_C_32F — complex f32 (interleaved real/imag).
CUDA_C_64F: CUDA_C_64F — complex f64 (interleaved real/imag).
CUDA_R_16BF: CUDA_R_16BF — bfloat16 (real). Storage tag for __nv_bfloat16.
CUDA_R_16F: CUDA_R_16F — real f16.
CUDA_R_32F: CUDA_R_32F — real f32.
CUDA_R_64F: CUDA_R_64F — real f64.
CUFFT_C2C: cuFFT plan type: complex-to-complex (single precision). Direction is supplied to cufftExecC2C.
CUFFT_C2R: cuFFT plan type: complex-to-real (single precision). Input is N/2 + 1 complex cells (Hermitian-half), output is N real cells.
CUFFT_D2Z: cuFFT plan type: double-precision real-to-complex.
CUFFT_FORWARD: Forward FFT direction tag for cufftExecC2C / cufftExecZ2Z. cuFFT’s forward transform is unnormalized.
CUFFT_INVERSE: Inverse FFT direction tag for cufftExecC2C / cufftExecZ2Z. cuFFT’s inverse transform is also unnormalized — the safe-plan layer multiplies the output by 1/N after exec to match PyTorch’s norm="backward" (forward unnormalized, inverse normalized by N) convention.
CUFFT_R2C: cuFFT plan type: real-to-complex (single precision). Output buffer size is N/2 + 1 complex cells for an N-long real input (Hermitian symmetry).
CUFFT_SUCCESS: CUFFT_SUCCESS — the only success code.
CUFFT_Z2D: cuFFT plan type: double-precision complex-to-real.
CUFFT_Z2Z: cuFFT plan type: double-precision complex-to-complex.
CURAND_RNG_PSEUDO_DEFAULT: CURAND_RNG_PSEUDO_DEFAULT — XORWOW pseudo-random generator. Adequate for the dropout / sampling use cases this milestone targets; future QRNG / Philox / MT19937 work can extend the descriptor surface.
CURAND_STATUS_SUCCESS: CURAND_STATUS_SUCCESS — only success code. Any non-zero return from the cuRAND host API is mapped to status 5 (“internal kernel error”) at the safe-plan layer.
CUSOLVER_EIG_MODE_NOVECTOR: CUSOLVER_EIG_MODE_NOVECTOR — gesvdjBatched jobz value for computing singular values only (skip U / V).
CUSOLVER_EIG_MODE_VECTOR: CUSOLVER_EIG_MODE_VECTOR — gesvdjBatched jobz value for computing both singular values and singular vectors.
CUSOLVER_STATUS_SUCCESS: CUSOLVER_STATUS_SUCCESS — the only success code. Any non-zero return from a cuSOLVER routine is mapped to a negative status at the safe-plan layer for distinct error reporting.

Functions§

baracuda_kernels_adaptive_avg_pool_bf16_bw_can_implement^⚠: baracuda_kernels_adaptive_avg_pool_bf16_bw_can_implement (baracuda kernels adaptive avg pool bf16 bw can implement).
baracuda_kernels_adaptive_avg_pool_bf16_bw_run^⚠: Adaptive AvgPool BW, bf16.
baracuda_kernels_adaptive_avg_pool_bf16_fw_can_implement^⚠: baracuda_kernels_adaptive_avg_pool_bf16_fw_can_implement (baracuda kernels adaptive avg pool bf16 fw can implement).
baracuda_kernels_adaptive_avg_pool_bf16_fw_run^⚠: Adaptive AvgPool FW, bf16.
baracuda_kernels_adaptive_avg_pool_f16_bw_can_implement^⚠: baracuda_kernels_adaptive_avg_pool_f16_bw_can_implement (baracuda kernels adaptive avg pool f16 bw can implement).
baracuda_kernels_adaptive_avg_pool_f16_bw_run^⚠: Adaptive AvgPool BW, f16. Zeros dx internally, then atomic-scatters.
baracuda_kernels_adaptive_avg_pool_f16_fw_can_implement^⚠: baracuda_kernels_adaptive_avg_pool_f16_fw_can_implement (baracuda kernels adaptive avg pool f16 fw can implement).
baracuda_kernels_adaptive_avg_pool_f16_fw_run^⚠: Adaptive AvgPool FW, f16. Rank-agnostic (spatial_rank ∈ {1,2,3}).
baracuda_kernels_adaptive_avg_pool_f32_bw_can_implement^⚠: baracuda_kernels_adaptive_avg_pool_f32_bw_can_implement (baracuda kernels adaptive avg pool f32 bw can implement).
baracuda_kernels_adaptive_avg_pool_f32_bw_run^⚠: Adaptive AvgPool BW, f32.
baracuda_kernels_adaptive_avg_pool_f32_fw_can_implement^⚠: baracuda_kernels_adaptive_avg_pool_f32_fw_can_implement (baracuda kernels adaptive avg pool f32 fw can implement).
baracuda_kernels_adaptive_avg_pool_f32_fw_run^⚠: Adaptive AvgPool FW, f32.
baracuda_kernels_adaptive_avg_pool_f64_bw_can_implement^⚠: baracuda_kernels_adaptive_avg_pool_f64_bw_can_implement (baracuda kernels adaptive avg pool f64 bw can implement).
baracuda_kernels_adaptive_avg_pool_f64_bw_run^⚠: Adaptive AvgPool BW, f64.
baracuda_kernels_adaptive_avg_pool_f64_fw_can_implement^⚠: baracuda_kernels_adaptive_avg_pool_f64_fw_can_implement (baracuda kernels adaptive avg pool f64 fw can implement).
baracuda_kernels_adaptive_avg_pool_f64_fw_run^⚠: Adaptive AvgPool FW, f64.
baracuda_kernels_adaptive_max_pool_bf16_bw_can_implement^⚠: baracuda_kernels_adaptive_max_pool_bf16_bw_can_implement (baracuda kernels adaptive max pool bf16 bw can implement).
baracuda_kernels_adaptive_max_pool_bf16_bw_run^⚠: Adaptive MaxPool BW, bf16.
baracuda_kernels_adaptive_max_pool_bf16_fw_can_implement^⚠: baracuda_kernels_adaptive_max_pool_bf16_fw_can_implement (baracuda kernels adaptive max pool bf16 fw can implement).
baracuda_kernels_adaptive_max_pool_bf16_fw_run^⚠: Adaptive MaxPool FW, bf16.
baracuda_kernels_adaptive_max_pool_f16_bw_can_implement^⚠: baracuda_kernels_adaptive_max_pool_f16_bw_can_implement (baracuda kernels adaptive max pool f16 bw can implement).
baracuda_kernels_adaptive_max_pool_f16_bw_run^⚠: Adaptive MaxPool BW, f16. Recomputes the per-window argmax from the saved x, zeros dx internally, then atomic-scatters dy into the argmax positions.
baracuda_kernels_adaptive_max_pool_f16_fw_can_implement^⚠: baracuda_kernels_adaptive_max_pool_f16_fw_can_implement (baracuda kernels adaptive max pool f16 fw can implement).
baracuda_kernels_adaptive_max_pool_f16_fw_run^⚠: Adaptive MaxPool FW, f16. Writes y only — the matching BW recomputes the argmax internally from the saved x (keeps the Phase 11.8 args shape; no separate indices tensor).
baracuda_kernels_adaptive_max_pool_f32_bw_can_implement^⚠: baracuda_kernels_adaptive_max_pool_f32_bw_can_implement (baracuda kernels adaptive max pool f32 bw can implement).
baracuda_kernels_adaptive_max_pool_f32_bw_run^⚠: Adaptive MaxPool BW, f32.
baracuda_kernels_adaptive_max_pool_f32_fw_can_implement^⚠: baracuda_kernels_adaptive_max_pool_f32_fw_can_implement (baracuda kernels adaptive max pool f32 fw can implement).
baracuda_kernels_adaptive_max_pool_f32_fw_run^⚠: Adaptive MaxPool FW, f32.
baracuda_kernels_adaptive_max_pool_f64_bw_can_implement^⚠: baracuda_kernels_adaptive_max_pool_f64_bw_can_implement (baracuda kernels adaptive max pool f64 bw can implement).
baracuda_kernels_adaptive_max_pool_f64_bw_run^⚠: Adaptive MaxPool BW, f64.
baracuda_kernels_adaptive_max_pool_f64_fw_can_implement^⚠: baracuda_kernels_adaptive_max_pool_f64_fw_can_implement (baracuda kernels adaptive max pool f64 fw can implement).
baracuda_kernels_adaptive_max_pool_f64_fw_run^⚠: Adaptive MaxPool FW, f64.
baracuda_kernels_affine_bf16_can_implement^⚠: Implementability check for affine_bf16. Host-side only.
baracuda_kernels_affine_bf16_run^⚠: Affine y = a*x + b, bf16 storage / f32 compute. a / b arrive as f32.
baracuda_kernels_affine_bf16_strided_can_implement^⚠: baracuda_kernels_affine_bf16_strided_can_implement (baracuda kernels affine bf16 strided can implement).
baracuda_kernels_affine_bf16_strided_run^⚠: Strided affine y = a*x + b, bf16 storage / f32 compute. a / b arrive as f32.
baracuda_kernels_affine_f16_can_implement^⚠: Implementability check for affine_f16. Host-side only.
baracuda_kernels_affine_f16_run^⚠: Affine y = a*x + b, f16 storage / f32 compute. a / b arrive as f32.
baracuda_kernels_affine_f16_strided_can_implement^⚠: baracuda_kernels_affine_f16_strided_can_implement (baracuda kernels affine f16 strided can implement).
baracuda_kernels_affine_f16_strided_run^⚠: Strided affine y = a*x + b, f16 storage / f32 compute. a / b arrive as f32.
baracuda_kernels_affine_f32_can_implement^⚠: Implementability check for affine_f32. Host-side only.
baracuda_kernels_affine_f32_run^⚠: Affine y = a*x + b, f32 dtype.
baracuda_kernels_affine_f32_strided_can_implement^⚠: baracuda_kernels_affine_f32_strided_can_implement (baracuda kernels affine f32 strided can implement).
baracuda_kernels_affine_f32_strided_run^⚠: Strided affine y = a*x + b, f32 dtype.
baracuda_kernels_affine_f64_can_implement^⚠: Implementability check for affine_f64. Host-side only.
baracuda_kernels_affine_f64_run^⚠: Affine y = a*x + b, f64 dtype.
baracuda_kernels_affine_f64_strided_can_implement^⚠: baracuda_kernels_affine_f64_strided_can_implement (baracuda kernels affine f64 strided can implement).
baracuda_kernels_affine_f64_strided_run^⚠: Strided affine y = a*x + b, f64 dtype.
baracuda_kernels_affine_grid_2d_f32_can_implement^⚠: baracuda_kernels_affine_grid_2d_f32_can_implement (baracuda kernels affine grid 2d f32 can implement).
baracuda_kernels_affine_grid_2d_f32_run^⚠: affine_grid(theta, size) — produce [N, OH, OW, 2] grid from theta: [N, 2, 3]. f32. # Safety: as above.
baracuda_kernels_affine_grid_2d_f64_can_implement^⚠: baracuda_kernels_affine_grid_2d_f64_can_implement (baracuda kernels affine grid 2d f64 can implement).
baracuda_kernels_affine_grid_2d_f64_run^⚠: affine_grid_2d, f64. # Safety: as f32.
baracuda_kernels_affine_i8_can_implement^⚠: Implementability check for affine_i8. Host-side only.
baracuda_kernels_affine_i8_run^⚠: Affine y = a*x + b, i8 dtype.
baracuda_kernels_affine_i32_can_implement^⚠: Implementability check for affine_i32. Host-side only.
baracuda_kernels_affine_i32_run^⚠: Affine y = a*x + b, i32 dtype.
baracuda_kernels_affine_i32_strided_can_implement^⚠: baracuda_kernels_affine_i32_strided_can_implement (baracuda kernels affine i32 strided can implement).
baracuda_kernels_affine_i32_strided_run^⚠: Strided affine y = a*x + b, i32 dtype.
baracuda_kernels_affine_i64_can_implement^⚠: Implementability check for affine_i64. Host-side only.
baracuda_kernels_affine_i64_run^⚠: Affine y = a*x + b, i64 dtype.
baracuda_kernels_affine_i64_strided_can_implement^⚠: baracuda_kernels_affine_i64_strided_can_implement (baracuda kernels affine i64 strided can implement).
baracuda_kernels_affine_i64_strided_run^⚠: Strided affine y = a*x + b, i64 dtype.
baracuda_kernels_affine_inplace_bf16_can_implement^⚠: Implementability check for baracuda_kernels_affine_inplace_bf16. Host-side only.
baracuda_kernels_affine_inplace_bf16_run^⚠: In-place affine y = scale * y + offset (bf16). Phase 61 — added for Fuel’s INPLACE_AFFINE op family completion (bf16/f16 weight-decay scaling, Op::AddScalar / Op::MulScalar on bf16 model weights).
baracuda_kernels_affine_inplace_bf16_strided_can_implement^⚠: Implementability check for baracuda_kernels_affine_inplace_bf16_strided. Host-side only.
baracuda_kernels_affine_inplace_bf16_strided_run^⚠: Strided in-place affine (bf16; f32 scalars). Phase 62.
baracuda_kernels_affine_inplace_f16_can_implement^⚠: Implementability check for baracuda_kernels_affine_inplace_f16. Host-side only.
baracuda_kernels_affine_inplace_f16_run^⚠: In-place affine y = scale * y + offset (f16). Phase 61 — added for Fuel’s INPLACE_AFFINE op family completion.
baracuda_kernels_affine_inplace_f16_strided_can_implement^⚠: Implementability check for baracuda_kernels_affine_inplace_f16_strided. Host-side only.
baracuda_kernels_affine_inplace_f16_strided_run^⚠: Strided in-place affine (f16; f32 scalars). Phase 62.
baracuda_kernels_affine_inplace_f32_can_implement^⚠: Implementability check for baracuda_kernels_affine_inplace_f32. Host-side only.
baracuda_kernels_affine_inplace_f32_run^⚠: In-place affine y = scale * y + offset (f32). Used by the safe-plan layer to remap a cuRAND uniform-(0, 1] buffer into Uniform(low, high].
baracuda_kernels_affine_inplace_f32_strided_can_implement^⚠: Implementability check for baracuda_kernels_affine_inplace_f32_strided. Host-side only.
baracuda_kernels_affine_inplace_f32_strided_run^⚠: In-place affine y[off] = scale * y[off] + offset over a strided view (f32). Phase 62.
baracuda_kernels_affine_inplace_f64_can_implement^⚠: Implementability check for baracuda_kernels_affine_inplace_f64. Host-side only.
baracuda_kernels_affine_inplace_f64_run^⚠: In-place affine y = scale * y + offset (f64).
baracuda_kernels_affine_inplace_f64_strided_can_implement^⚠: Implementability check for baracuda_kernels_affine_inplace_f64_strided. Host-side only.
baracuda_kernels_affine_inplace_f64_strided_run^⚠: Strided in-place affine (f64). Phase 62.
baracuda_kernels_affine_inplace_i8_can_implement^⚠: Implementability check for baracuda_kernels_affine_inplace_i8. Host-side only.
baracuda_kernels_affine_inplace_i8_run^⚠: In-place affine y = scale * y + offset (i8). Phase 62.
baracuda_kernels_affine_inplace_i32_can_implement^⚠: Implementability check for baracuda_kernels_affine_inplace_i32. Host-side only.
baracuda_kernels_affine_inplace_i32_run^⚠: In-place affine y = scale * y + offset (i32). Phase 62.
baracuda_kernels_affine_inplace_i32_strided_can_implement^⚠: Implementability check for baracuda_kernels_affine_inplace_i32_strided. Host-side only.
baracuda_kernels_affine_inplace_i32_strided_run^⚠: Strided in-place affine (i32). Phase 62.
baracuda_kernels_affine_inplace_i64_can_implement^⚠: Implementability check for baracuda_kernels_affine_inplace_i64. Host-side only.
baracuda_kernels_affine_inplace_i64_run^⚠: In-place affine y = scale * y + offset (i64). Phase 62.
baracuda_kernels_affine_inplace_i64_strided_can_implement^⚠: Implementability check for baracuda_kernels_affine_inplace_i64_strided. Host-side only.
baracuda_kernels_affine_inplace_i64_strided_run^⚠: Strided in-place affine (i64). Phase 62.
baracuda_kernels_affine_inplace_u8_can_implement^⚠: Implementability check for baracuda_kernels_affine_inplace_u8. Host-side only.
baracuda_kernels_affine_inplace_u8_run^⚠: In-place affine y = scale * y + offset (u8). Phase 62.
baracuda_kernels_affine_inplace_u8_strided_can_implement^⚠: Implementability check for baracuda_kernels_affine_inplace_u8_strided. Host-side only.
baracuda_kernels_affine_inplace_u8_strided_run^⚠: Strided in-place affine (u8). Phase 62.
baracuda_kernels_affine_u8_can_implement^⚠: Implementability check for affine_u8. Host-side only.
baracuda_kernels_affine_u8_run^⚠: Affine y = a*x + b, u8 dtype.
baracuda_kernels_affine_u8_strided_can_implement^⚠: baracuda_kernels_affine_u8_strided_can_implement (baracuda kernels affine u8 strided can implement).
baracuda_kernels_affine_u8_strided_run^⚠: Strided affine y = a*x + b, u8 dtype.
baracuda_kernels_alibi_backward_bf16_can_implement^⚠: Implementability check for alibi_backward_bf16. Host-side only.
baracuda_kernels_alibi_backward_bf16_run^⚠: ALiBi BW, bf16.
baracuda_kernels_alibi_backward_f16_can_implement^⚠: Implementability check for alibi_backward_f16. Host-side only.
baracuda_kernels_alibi_backward_f16_run^⚠: ALiBi BW, f16.
baracuda_kernels_alibi_backward_f32_can_implement^⚠: Implementability check for alibi_backward_f32. Host-side only.
baracuda_kernels_alibi_backward_f32_run^⚠: ALiBi BW, f32. da[b, h, i, j] = dy[b, h, i, j] (pass-through); dslope[h] = Σ_{b, i, j} dy[b, h, i, j] · (j - i). Either da or dslope may be null to skip; both null is rejected.
baracuda_kernels_alibi_backward_f64_can_implement^⚠: Implementability check for alibi_backward_f64. Host-side only.
baracuda_kernels_alibi_backward_f64_run^⚠: ALiBi BW, f64.
baracuda_kernels_alibi_bf16_can_implement^⚠: Implementability check for alibi_bf16. Host-side only.
baracuda_kernels_alibi_bf16_run^⚠: ALiBi FW, bf16.
baracuda_kernels_alibi_f16_can_implement^⚠: Implementability check for alibi_f16. Host-side only.
baracuda_kernels_alibi_f16_run^⚠: ALiBi FW, f16.
baracuda_kernels_alibi_f32_can_implement^⚠: Implementability check for alibi_f32. Host-side only.
baracuda_kernels_alibi_f32_run^⚠: ALiBi FW, f32. y[b, h, i, j] = scores[b, h, i, j] + slopes[h] · (j - i).
baracuda_kernels_alibi_f64_can_implement^⚠: Implementability check for alibi_f64. Host-side only.
baracuda_kernels_alibi_f64_run^⚠: ALiBi FW, f64.
baracuda_kernels_apply_token_penalty_f32_can_implement^⚠: baracuda_kernels_apply_token_penalty_f32_can_implement (baracuda kernels apply token penalty f32 can implement).
baracuda_kernels_apply_token_penalty_f32_run^⚠: baracuda_kernels_apply_token_penalty_f32_run (baracuda kernels apply token penalty f32 run).
baracuda_kernels_arg_reduce_argmax_bf16_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_bf16.
baracuda_kernels_arg_reduce_argmax_bf16_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_bf16_i32.
baracuda_kernels_arg_reduce_argmax_bf16_i32_run^⚠: argmax(x, axis=k), bf16 input, i32 output.
baracuda_kernels_arg_reduce_argmax_bf16_run^⚠: argmax(x, axis=k), bf16 input, i64 output.
baracuda_kernels_arg_reduce_argmax_bf16_u32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_bf16_u32.
baracuda_kernels_arg_reduce_argmax_bf16_u32_run^⚠: argmax(x, axis=k), bf16 input, u32 output.
baracuda_kernels_arg_reduce_argmax_f16_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_f16.
baracuda_kernels_arg_reduce_argmax_f16_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_f16_i32.
baracuda_kernels_arg_reduce_argmax_f16_i32_run^⚠: argmax(x, axis=k), f16 input, i32 output.
baracuda_kernels_arg_reduce_argmax_f16_run^⚠: argmax(x, axis=k), f16 input, i64 output.
baracuda_kernels_arg_reduce_argmax_f16_u32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_f16_u32.
baracuda_kernels_arg_reduce_argmax_f16_u32_run^⚠: argmax(x, axis=k), f16 input, u32 output.
baracuda_kernels_arg_reduce_argmax_f32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_f32.
baracuda_kernels_arg_reduce_argmax_f32_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_f32_i32.
baracuda_kernels_arg_reduce_argmax_f32_i32_run^⚠: argmax(x, axis=k), f32 input, i32 output.
baracuda_kernels_arg_reduce_argmax_f32_run^⚠: argmax(x, axis=k), f32 input, i64 output. Ties broken by first occurrence (smallest index wins).
baracuda_kernels_arg_reduce_argmax_f32_u32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_f32_u32.
baracuda_kernels_arg_reduce_argmax_f32_u32_run^⚠: argmax(x, axis=k), f32 input, u32 output.
baracuda_kernels_arg_reduce_argmax_f64_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_f64.
baracuda_kernels_arg_reduce_argmax_f64_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_f64_i32.
baracuda_kernels_arg_reduce_argmax_f64_i32_run^⚠: argmax(x, axis=k), f64 input, i32 output.
baracuda_kernels_arg_reduce_argmax_f64_run^⚠: argmax(x, axis=k), f64 input, i64 output.
baracuda_kernels_arg_reduce_argmax_f64_u32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_f64_u32.
baracuda_kernels_arg_reduce_argmax_f64_u32_run^⚠: argmax(x, axis=k), f64 input, u32 output.
baracuda_kernels_arg_reduce_argmax_i8_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_i8_i32.
baracuda_kernels_arg_reduce_argmax_i8_i32_run^⚠: argmax(x, axis=k) i8 input, i32 idx output.
baracuda_kernels_arg_reduce_argmax_i8_i64_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_i8_i64.
baracuda_kernels_arg_reduce_argmax_i8_i64_run^⚠: argmax(x, axis=k) i8 input, i64 idx output.
baracuda_kernels_arg_reduce_argmax_i16_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_i16_i32.
baracuda_kernels_arg_reduce_argmax_i16_i32_run^⚠: argmax(x, axis=k) i16 input, i32 idx output.
baracuda_kernels_arg_reduce_argmax_i16_i64_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_i16_i64.
baracuda_kernels_arg_reduce_argmax_i16_i64_run^⚠: argmax(x, axis=k) i16 input, i64 idx output.
baracuda_kernels_arg_reduce_argmax_i32_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_i32_i32.
baracuda_kernels_arg_reduce_argmax_i32_i32_run^⚠: argmax(x, axis=k) i32 input, i32 idx output.
baracuda_kernels_arg_reduce_argmax_i32_i64_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_i32_i64.
baracuda_kernels_arg_reduce_argmax_i32_i64_run^⚠: argmax(x, axis=k) i32 input, i64 idx output.
baracuda_kernels_arg_reduce_argmax_i64_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_i64_i32.
baracuda_kernels_arg_reduce_argmax_i64_i32_run^⚠: argmax(x, axis=k) i64 input, i32 idx output.
baracuda_kernels_arg_reduce_argmax_i64_i64_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_i64_i64.
baracuda_kernels_arg_reduce_argmax_i64_i64_run^⚠: argmax(x, axis=k) i64 input, i64 idx output.
baracuda_kernels_arg_reduce_argmax_u8_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_u8_i32.
baracuda_kernels_arg_reduce_argmax_u8_i32_run^⚠: argmax(x, axis=k) u8 input, i32 idx output.
baracuda_kernels_arg_reduce_argmax_u8_i64_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_u8_i64.
baracuda_kernels_arg_reduce_argmax_u8_i64_run^⚠: argmax(x, axis=k) u8 input, i64 idx output.
baracuda_kernels_arg_reduce_argmax_u32_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_u32_i32.
baracuda_kernels_arg_reduce_argmax_u32_i32_run^⚠: argmax(x, axis=k) u32 input, i32 idx output.
baracuda_kernels_arg_reduce_argmax_u32_i64_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmax_u32_i64.
baracuda_kernels_arg_reduce_argmax_u32_i64_run^⚠: argmax(x, axis=k) u32 input, i64 idx output.
baracuda_kernels_arg_reduce_argmin_bf16_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_bf16.
baracuda_kernels_arg_reduce_argmin_bf16_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_bf16_i32.
baracuda_kernels_arg_reduce_argmin_bf16_i32_run^⚠: argmin(x, axis=k), bf16 input, i32 output.
baracuda_kernels_arg_reduce_argmin_bf16_run^⚠: argmin(x, axis=k), bf16 input, i64 output.
baracuda_kernels_arg_reduce_argmin_bf16_u32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_bf16_u32.
baracuda_kernels_arg_reduce_argmin_bf16_u32_run^⚠: argmin(x, axis=k), bf16 input, u32 output.
baracuda_kernels_arg_reduce_argmin_f16_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_f16.
baracuda_kernels_arg_reduce_argmin_f16_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_f16_i32.
baracuda_kernels_arg_reduce_argmin_f16_i32_run^⚠: argmin(x, axis=k), f16 input, i32 output.
baracuda_kernels_arg_reduce_argmin_f16_run^⚠: argmin(x, axis=k), f16 input, i64 output.
baracuda_kernels_arg_reduce_argmin_f16_u32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_f16_u32.
baracuda_kernels_arg_reduce_argmin_f16_u32_run^⚠: argmin(x, axis=k), f16 input, u32 output.
baracuda_kernels_arg_reduce_argmin_f32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_f32.
baracuda_kernels_arg_reduce_argmin_f32_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_f32_i32.
baracuda_kernels_arg_reduce_argmin_f32_i32_run^⚠: argmin(x, axis=k), f32 input, i32 output.
baracuda_kernels_arg_reduce_argmin_f32_run^⚠: argmin(x, axis=k), f32 input, i64 output.
baracuda_kernels_arg_reduce_argmin_f32_u32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_f32_u32.
baracuda_kernels_arg_reduce_argmin_f32_u32_run^⚠: argmin(x, axis=k), f32 input, u32 output.
baracuda_kernels_arg_reduce_argmin_f64_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_f64.
baracuda_kernels_arg_reduce_argmin_f64_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_f64_i32.
baracuda_kernels_arg_reduce_argmin_f64_i32_run^⚠: argmin(x, axis=k), f64 input, i32 output.
baracuda_kernels_arg_reduce_argmin_f64_run^⚠: argmin(x, axis=k), f64 input, i64 output.
baracuda_kernels_arg_reduce_argmin_f64_u32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_f64_u32.
baracuda_kernels_arg_reduce_argmin_f64_u32_run^⚠: argmin(x, axis=k), f64 input, u32 output.
baracuda_kernels_arg_reduce_argmin_i8_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_i8_i32.
baracuda_kernels_arg_reduce_argmin_i8_i32_run^⚠: argmin(x, axis=k) i8 input, i32 idx output.
baracuda_kernels_arg_reduce_argmin_i8_i64_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_i8_i64.
baracuda_kernels_arg_reduce_argmin_i8_i64_run^⚠: argmin(x, axis=k) i8 input, i64 idx output.
baracuda_kernels_arg_reduce_argmin_i16_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_i16_i32.
baracuda_kernels_arg_reduce_argmin_i16_i32_run^⚠: argmin(x, axis=k) i16 input, i32 idx output.
baracuda_kernels_arg_reduce_argmin_i16_i64_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_i16_i64.
baracuda_kernels_arg_reduce_argmin_i16_i64_run^⚠: argmin(x, axis=k) i16 input, i64 idx output.
baracuda_kernels_arg_reduce_argmin_i32_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_i32_i32.
baracuda_kernels_arg_reduce_argmin_i32_i32_run^⚠: argmin(x, axis=k) i32 input, i32 idx output.
baracuda_kernels_arg_reduce_argmin_i32_i64_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_i32_i64.
baracuda_kernels_arg_reduce_argmin_i32_i64_run^⚠: argmin(x, axis=k) i32 input, i64 idx output.
baracuda_kernels_arg_reduce_argmin_i64_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_i64_i32.
baracuda_kernels_arg_reduce_argmin_i64_i32_run^⚠: argmin(x, axis=k) i64 input, i32 idx output.
baracuda_kernels_arg_reduce_argmin_i64_i64_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_i64_i64.
baracuda_kernels_arg_reduce_argmin_i64_i64_run^⚠: argmin(x, axis=k) i64 input, i64 idx output.
baracuda_kernels_arg_reduce_argmin_u8_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_u8_i32.
baracuda_kernels_arg_reduce_argmin_u8_i32_run^⚠: argmin(x, axis=k) u8 input, i32 idx output.
baracuda_kernels_arg_reduce_argmin_u8_i64_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_u8_i64.
baracuda_kernels_arg_reduce_argmin_u8_i64_run^⚠: argmin(x, axis=k) u8 input, i64 idx output.
baracuda_kernels_arg_reduce_argmin_u32_i32_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_u32_i32.
baracuda_kernels_arg_reduce_argmin_u32_i32_run^⚠: argmin(x, axis=k) u32 input, i32 idx output.
baracuda_kernels_arg_reduce_argmin_u32_i64_can_implement^⚠: Pre-launch implementability check for arg_reduce_argmin_u32_i64.
baracuda_kernels_arg_reduce_argmin_u32_i64_run^⚠: argmin(x, axis=k) u32 input, i64 idx output.
baracuda_kernels_argsort_bf16_can_implement^⚠: baracuda_kernels_argsort_bf16_can_implement (baracuda kernels argsort bf16 can implement).
baracuda_kernels_argsort_bf16_run^⚠: Block-bitonic argsort, bf16. Comparator uses native __nv_bfloat16 operator< (CUDA device-side intrinsics).
baracuda_kernels_argsort_f16_can_implement^⚠: baracuda_kernels_argsort_f16_can_implement (baracuda kernels argsort f16 can implement).
baracuda_kernels_argsort_f16_run^⚠: Block-bitonic argsort, f16. Comparator uses native __half operator<.
baracuda_kernels_argsort_f32_big_can_implement^⚠: baracuda_kernels_argsort_f32_big_can_implement (baracuda kernels argsort f32 big can implement).
baracuda_kernels_argsort_f32_big_run^⚠: Multi-block radix argsort, f32, for row_len > 1024.
baracuda_kernels_argsort_f32_big_workspace_size^⚠: baracuda_kernels_argsort_f32_big_workspace_size (baracuda kernels argsort f32 big workspace size).
baracuda_kernels_argsort_f32_can_implement^⚠: baracuda_kernels_argsort_f32_can_implement (baracuda kernels argsort f32 can implement).
baracuda_kernels_argsort_f32_run^⚠: Block-bitonic argsort, f32. Returns indices; values not written.
baracuda_kernels_argsort_f64_big_can_implement^⚠: baracuda_kernels_argsort_f64_big_can_implement (baracuda kernels argsort f64 big can implement).
baracuda_kernels_argsort_f64_big_run^⚠: Multi-block radix argsort, f64.
baracuda_kernels_argsort_f64_big_workspace_size^⚠: baracuda_kernels_argsort_f64_big_workspace_size (baracuda kernels argsort f64 big workspace size).
baracuda_kernels_argsort_f64_can_implement^⚠: baracuda_kernels_argsort_f64_can_implement (baracuda kernels argsort f64 can implement).
baracuda_kernels_argsort_f64_run^⚠: Block-bitonic argsort, f64.
baracuda_kernels_argsort_fp8e4m3_can_implement^⚠: baracuda_kernels_argsort_fp8e4m3_can_implement (baracuda kernels argsort fp8e4m3 can implement).
baracuda_kernels_argsort_fp8e4m3_run^⚠: Block-bitonic argsort, FP8 E4M3. Storage is byte-identical to raw u8; the kernel wraps it in an Fp8E4M3Sort struct that decodes to float in the comparator. Raw-byte buffer in, i32 index buffer out.
baracuda_kernels_argsort_i8_can_implement^⚠: baracuda_kernels_argsort_i8_can_implement (baracuda kernels argsort i8 can implement).
baracuda_kernels_argsort_i8_run^⚠: Block-bitonic argsort, i8.
baracuda_kernels_argsort_i16_can_implement^⚠: baracuda_kernels_argsort_i16_can_implement (baracuda kernels argsort i16 can implement).
baracuda_kernels_argsort_i16_run^⚠: Block-bitonic argsort, i16.
baracuda_kernels_argsort_i32_big_can_implement^⚠: baracuda_kernels_argsort_i32_big_can_implement (baracuda kernels argsort i32 big can implement).
baracuda_kernels_argsort_i32_big_run^⚠: Multi-block radix argsort, i32.
baracuda_kernels_argsort_i32_big_workspace_size^⚠: baracuda_kernels_argsort_i32_big_workspace_size (baracuda kernels argsort i32 big workspace size).
baracuda_kernels_argsort_i32_can_implement^⚠: baracuda_kernels_argsort_i32_can_implement (baracuda kernels argsort i32 can implement).
baracuda_kernels_argsort_i32_run^⚠: Block-bitonic argsort, i32.
baracuda_kernels_argsort_i64_big_can_implement^⚠: baracuda_kernels_argsort_i64_big_can_implement (baracuda kernels argsort i64 big can implement).
baracuda_kernels_argsort_i64_big_run^⚠: Multi-block radix argsort, i64.
baracuda_kernels_argsort_i64_big_workspace_size^⚠: baracuda_kernels_argsort_i64_big_workspace_size (baracuda kernels argsort i64 big workspace size).
baracuda_kernels_argsort_i64_can_implement^⚠: baracuda_kernels_argsort_i64_can_implement (baracuda kernels argsort i64 can implement).
baracuda_kernels_argsort_i64_run^⚠: Block-bitonic argsort, i64.
baracuda_kernels_argsort_u8_can_implement^⚠: baracuda_kernels_argsort_u8_can_implement (baracuda kernels argsort u8 can implement).
baracuda_kernels_argsort_u8_run^⚠: Block-bitonic argsort, u8.
baracuda_kernels_argsort_u32_can_implement^⚠: baracuda_kernels_argsort_u32_can_implement (baracuda kernels argsort u32 can implement).
baracuda_kernels_argsort_u32_run^⚠: Block-bitonic argsort, u32.
baracuda_kernels_batch_norm_backward_bf16_can_implement^⚠: baracuda_kernels_batch_norm_backward_bf16_can_implement (baracuda kernels batch norm backward bf16 can implement).
baracuda_kernels_batch_norm_backward_bf16_run^⚠: BatchNorm BW, bf16.
baracuda_kernels_batch_norm_backward_f16_can_implement^⚠: baracuda_kernels_batch_norm_backward_f16_can_implement (baracuda kernels batch norm backward f16 can implement).
baracuda_kernels_batch_norm_backward_f16_run^⚠: BatchNorm BW, f16.
baracuda_kernels_batch_norm_backward_f32_can_implement^⚠: baracuda_kernels_batch_norm_backward_f32_can_implement (baracuda kernels batch norm backward f32 can implement).
baracuda_kernels_batch_norm_backward_f32_run^⚠: BatchNorm BW, f32. Three-stage: per-group sum_dxh / sum_dxhxh, per-cell dx, per-channel dgamma / dbeta. Requires workspace of 2 * group_count * sizeof(float) bytes for the stage-1 partial sums (group_count = c_extent for BN).
baracuda_kernels_batch_norm_backward_f64_can_implement^⚠: baracuda_kernels_batch_norm_backward_f64_can_implement (baracuda kernels batch norm backward f64 can implement).
baracuda_kernels_batch_norm_backward_f64_run^⚠: BatchNorm BW, f64.
baracuda_kernels_batch_norm_bf16_can_implement^⚠: baracuda_kernels_batch_norm_bf16_can_implement (baracuda kernels batch norm bf16 can implement).
baracuda_kernels_batch_norm_bf16_run^⚠: BatchNorm FW, bf16.
baracuda_kernels_batch_norm_f16_can_implement^⚠: baracuda_kernels_batch_norm_f16_can_implement (baracuda kernels batch norm f16 can implement).
baracuda_kernels_batch_norm_f16_run^⚠: BatchNorm FW, f16.
baracuda_kernels_batch_norm_f32_can_implement^⚠: baracuda_kernels_batch_norm_f32_can_implement (baracuda kernels batch norm f32 can implement).
baracuda_kernels_batch_norm_f32_run^⚠: BatchNorm FW, f32. Training mode: computes per-channel (mean, inv_std) from the batch + spatial cells, writes them to saved_mean / saved_rstd for BW. gamma / beta optional (both supplied together per PyTorch convention).
baracuda_kernels_batch_norm_f64_can_implement^⚠: baracuda_kernels_batch_norm_f64_can_implement (baracuda kernels batch norm f64 can implement).
baracuda_kernels_batch_norm_f64_run^⚠: BatchNorm FW, f64.
baracuda_kernels_batched_ormqr_complex32_can_implement^⚠: baracuda_kernels_batched_ormqr_complex32_can_implement (baracuda kernels batched ormqr complex32 can implement).
baracuda_kernels_batched_ormqr_complex32_run^⚠: Batched-unmqr, Complex32. Same shape/contract as the f32 variant but with cuFloatComplex storage. op = 2 (C — conjugate transpose) is supported; op = 1 (T — plain transpose) is rejected by the Rust safe layer for complex (mathematically unusual for Householder).
baracuda_kernels_batched_ormqr_complex64_can_implement^⚠: baracuda_kernels_batched_ormqr_complex64_can_implement (baracuda kernels batched ormqr complex64 can implement).
baracuda_kernels_batched_ormqr_complex64_run^⚠: Batched-unmqr, Complex64. Same as the complex32 variant with cuDoubleComplex storage.
baracuda_kernels_batched_ormqr_f32_can_implement^⚠: baracuda_kernels_batched_ormqr_f32_can_implement (baracuda kernels batched ormqr f32 can implement).
baracuda_kernels_batched_ormqr_f32_run^⚠: Batched-ormqr, f32. Applies the implicit Q (or Q^T) from a BatchedQrPlan packed output (A_packed [B, M, K] column-major
baracuda_kernels_batched_ormqr_f64_can_implement^⚠: baracuda_kernels_batched_ormqr_f64_can_implement (baracuda kernels batched ormqr f64 can implement).
baracuda_kernels_batched_ormqr_f64_run^⚠: Batched-ormqr, f64. Same contract as the f32 variant.
baracuda_kernels_batched_ormqr_wy_build_t_complex32_can_implement^⚠: baracuda_kernels_batched_ormqr_wy_build_t_complex32_can_implement (baracuda kernels batched ormqr wy build t complex32 can implement).
baracuda_kernels_batched_ormqr_wy_build_t_complex32_run^⚠: WY block T-build, Complex32. f32-complex analogue of the f32 variant. Storage is cuFloatComplex (== Complex32, ABI-compatible).
baracuda_kernels_batched_ormqr_wy_build_t_complex64_can_implement^⚠: baracuda_kernels_batched_ormqr_wy_build_t_complex64_can_implement (baracuda kernels batched ormqr wy build t complex64 can implement).
baracuda_kernels_batched_ormqr_wy_build_t_complex64_run^⚠: WY block T-build, Complex64. f64-complex analogue.
baracuda_kernels_batched_ormqr_wy_build_t_f32_can_implement^⚠: baracuda_kernels_batched_ormqr_wy_build_t_f32_can_implement (baracuda kernels batched ormqr wy build t f32 can implement).
baracuda_kernels_batched_ormqr_wy_build_t_f32_run^⚠: WY block T-build, f32. For each (batch_slot, block_index), builds the [nb, nb] upper-triangular block-reflector matrix T such that H_0 · ... · H_{nb-1} = I - V·T·V^T. One CUDA block per (batch, num_blocks) cell. Status codes: 0 success, 2 invalid problem, 5 launch failure.
baracuda_kernels_batched_ormqr_wy_build_t_f64_can_implement^⚠: baracuda_kernels_batched_ormqr_wy_build_t_f64_can_implement (baracuda kernels batched ormqr wy build t f64 can implement).
baracuda_kernels_batched_ormqr_wy_build_t_f64_run^⚠: WY block T-build, f64 analogue.
baracuda_kernels_batched_ormqr_wy_extract_v_complex32_can_implement^⚠: baracuda_kernels_batched_ormqr_wy_extract_v_complex32_can_implement (baracuda kernels batched ormqr wy extract v complex32 can implement).
baracuda_kernels_batched_ormqr_wy_extract_v_complex32_run^⚠: WY V-extraction, Complex32. f32-complex analogue. Pure copy kernel — sets the implicit-1 (as (1, 0)), zeroes above the diagonal (as (0, 0)), copies the strict lower below.
baracuda_kernels_batched_ormqr_wy_extract_v_complex64_can_implement^⚠: baracuda_kernels_batched_ormqr_wy_extract_v_complex64_can_implement (baracuda kernels batched ormqr wy extract v complex64 can implement).
baracuda_kernels_batched_ormqr_wy_extract_v_complex64_run^⚠: WY V-extraction, Complex64. f64-complex analogue.
baracuda_kernels_batched_ormqr_wy_extract_v_f32_can_implement^⚠: baracuda_kernels_batched_ormqr_wy_extract_v_f32_can_implement (baracuda kernels batched ormqr wy extract v f32 can implement).
baracuda_kernels_batched_ormqr_wy_extract_v_f32_run^⚠: WY V-extraction, f32. Materializes the dense V [B, M, nb] panel for one block of reflectors (block_start = block_start, block_k = min(nb, K - block_start)) into a contiguous workspace buffer. Sets the implicit-1 at each reflector’s diagonal, copies the packed-A strict lower below, zeros above the diagonal, and zeros entire columns past block_k (handles the partial-last- block case).
baracuda_kernels_batched_ormqr_wy_extract_v_f64_can_implement^⚠: baracuda_kernels_batched_ormqr_wy_extract_v_f64_can_implement (baracuda kernels batched ormqr wy extract v f64 can implement).
baracuda_kernels_batched_ormqr_wy_extract_v_f64_run^⚠: WY V-extraction, f64 analogue.
baracuda_kernels_batched_qr_materialize_identity_f32_can_implement^⚠: baracuda_kernels_batched_qr_materialize_identity_f32_can_implement (baracuda kernels batched qr materialize identity f32 can implement).
baracuda_kernels_batched_qr_materialize_identity_f32_run^⚠: Stage a column-major identity Q [B, M, M] (one identity per batch slot) into a freshly allocated buffer. Caller then chains baracuda_kernels_batched_ormqr_*_run with op = 0 (N) to overwrite Q in place with the dense Q matrix from the geqrf-packed input. f32.
baracuda_kernels_batched_qr_materialize_identity_f64_can_implement^⚠: baracuda_kernels_batched_qr_materialize_identity_f64_can_implement (baracuda kernels batched qr materialize identity f64 can implement).
baracuda_kernels_batched_qr_materialize_identity_f64_run^⚠: Stage identity, f64 analogue.
baracuda_kernels_batched_qr_materialize_r_f32_can_implement^⚠: baracuda_kernels_batched_qr_materialize_r_f32_can_implement (baracuda kernels batched qr materialize r f32 can implement).
baracuda_kernels_batched_qr_materialize_r_f32_run^⚠: Materialize dense R [B, K, N] from a geqrf-packed A [B, M, N] (column-major). K = min(M, N). Cell R[b, i, j] = A[b, i, j] if i ≤ j, else 0. One CUDA block per (batch_slot, column). f32.
baracuda_kernels_batched_qr_materialize_r_f64_can_implement^⚠: baracuda_kernels_batched_qr_materialize_r_f64_can_implement (baracuda kernels batched qr materialize r f64 can implement).
baracuda_kernels_batched_qr_materialize_r_f64_run^⚠: Materialize dense R, f64 analogue.
baracuda_kernels_bernoulli_can_implement^⚠: baracuda_kernels_bernoulli_can_implement (baracuda kernels bernoulli can implement).
baracuda_kernels_bernoulli_run^⚠: bernoulli over a float uniform-rand buffer.
baracuda_kernels_binary_add_backward_bf16_can_implement^⚠: baracuda_kernels_binary_add_backward_bf16_can_implement (baracuda kernels binary add backward bf16 can implement).
baracuda_kernels_binary_add_backward_bf16_run^⚠: Add backward, bf16.
baracuda_kernels_binary_add_backward_f16_can_implement^⚠: baracuda_kernels_binary_add_backward_f16_can_implement (baracuda kernels binary add backward f16 can implement).
baracuda_kernels_binary_add_backward_f16_run^⚠: Add backward, f16.
baracuda_kernels_binary_add_backward_f32_can_implement^⚠: baracuda_kernels_binary_add_backward_f32_can_implement (baracuda kernels binary add backward f32 can implement).
baracuda_kernels_binary_add_backward_f32_run^⚠: Add backward, f32. Writes da = dy and db = dy.
baracuda_kernels_binary_add_backward_f64_can_implement^⚠: baracuda_kernels_binary_add_backward_f64_can_implement (baracuda kernels binary add backward f64 can implement).
baracuda_kernels_binary_add_backward_f64_run^⚠: Add backward, f64.
baracuda_kernels_binary_add_bf16_can_implement^⚠: Pre-launch implementability check for binary_add_bf16.
baracuda_kernels_binary_add_bf16_run^⚠: Binary elementwise add, bf16 dtype, contiguous fast path.
baracuda_kernels_binary_add_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_add_bf16_strided.
baracuda_kernels_binary_add_bf16_strided_run^⚠: Binary elementwise add, bf16 dtype, strided / broadcast path.
baracuda_kernels_binary_add_f16_can_implement^⚠: Pre-launch implementability check for binary_add_f16.
baracuda_kernels_binary_add_f16_run^⚠: Binary elementwise add, f16 dtype, contiguous fast path.
baracuda_kernels_binary_add_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_add_f16_strided.
baracuda_kernels_binary_add_f16_strided_run^⚠: Binary elementwise add, f16 dtype, strided / broadcast path.
baracuda_kernels_binary_add_f32_can_implement^⚠: Pre-launch implementability check for binary_add_f32. Validates the problem size without launching a kernel. Returns the standard status code mapping.
baracuda_kernels_binary_add_f32_run^⚠: Binary elementwise add, f32 dtype, contiguous fast path. This is the binary-pointwise trailblazer — its safety contract carries over to every other binary contig launcher (add, sub, mul, div, min, max, pow, comparison ops, etc.) across all dtypes.
baracuda_kernels_binary_add_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_add_f32_strided.
baracuda_kernels_binary_add_f32_strided_run^⚠: Binary elementwise add, f32 dtype, strided / broadcast path. This is the binary-strided trailblazer — its safety contract (including aliasing) carries over to every other binary strided launcher across all dtypes.
baracuda_kernels_binary_add_f64_can_implement^⚠: Pre-launch implementability check for binary_add_f64.
baracuda_kernels_binary_add_f64_run^⚠: Binary elementwise add, f64 dtype, contiguous fast path.
baracuda_kernels_binary_add_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_add_f64_strided.
baracuda_kernels_binary_add_f64_strided_run^⚠: Binary elementwise add, f64 dtype, strided / broadcast path.
baracuda_kernels_binary_atan2_backward_bf16_can_implement^⚠: baracuda_kernels_binary_atan2_backward_bf16_can_implement (baracuda kernels binary atan2 backward bf16 can implement).
baracuda_kernels_binary_atan2_backward_bf16_run^⚠: Atan2 backward, bf16.
baracuda_kernels_binary_atan2_backward_f16_can_implement^⚠: baracuda_kernels_binary_atan2_backward_f16_can_implement (baracuda kernels binary atan2 backward f16 can implement).
baracuda_kernels_binary_atan2_backward_f16_run^⚠: Atan2 backward, f16.
baracuda_kernels_binary_atan2_backward_f32_can_implement^⚠: baracuda_kernels_binary_atan2_backward_f32_can_implement (baracuda kernels binary atan2 backward f32 can implement).
baracuda_kernels_binary_atan2_backward_f32_run^⚠: Atan2 backward, f32. denom = a²+b², da = dy*b/denom, db = -dy*a/denom. Caller responsible for guarding against a == 0 && b == 0 (denom == 0).
baracuda_kernels_binary_atan2_backward_f64_can_implement^⚠: baracuda_kernels_binary_atan2_backward_f64_can_implement (baracuda kernels binary atan2 backward f64 can implement).
baracuda_kernels_binary_atan2_backward_f64_run^⚠: Atan2 backward, f64.
baracuda_kernels_binary_atan2_bf16_can_implement^⚠: Binary atan2, bf16, can-implement.
baracuda_kernels_binary_atan2_bf16_run^⚠: Binary atan2, bf16, contig.
baracuda_kernels_binary_atan2_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_atan2_bf16_strided.
baracuda_kernels_binary_atan2_bf16_strided_run^⚠: Binary atan2, bf16, strided.
baracuda_kernels_binary_atan2_f16_can_implement^⚠: Binary atan2, f16, can-implement.
baracuda_kernels_binary_atan2_f16_run^⚠: Binary atan2, f16, contig.
baracuda_kernels_binary_atan2_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_atan2_f16_strided.
baracuda_kernels_binary_atan2_f16_strided_run^⚠: Binary atan2, f16, strided.
baracuda_kernels_binary_atan2_f32_can_implement^⚠: Binary atan2, f32, can-implement.
baracuda_kernels_binary_atan2_f32_run^⚠: Binary atan2, f32, contig.
baracuda_kernels_binary_atan2_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_atan2_f32_strided.
baracuda_kernels_binary_atan2_f32_strided_run^⚠: Binary atan2, f32, strided.
baracuda_kernels_binary_atan2_f64_can_implement^⚠: Binary atan2, f64, can-implement.
baracuda_kernels_binary_atan2_f64_run^⚠: Binary atan2, f64, contig.
baracuda_kernels_binary_atan2_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_atan2_f64_strided.
baracuda_kernels_binary_atan2_f64_strided_run^⚠: Binary atan2, f64, strided.
baracuda_kernels_binary_bitwise_and_i32_can_implement^⚠: Binary bitwise and, i32 dtype, can-implement.
baracuda_kernels_binary_bitwise_and_i32_run^⚠: Binary bitwise and, i32 dtype, contig.
baracuda_kernels_binary_bitwise_and_i64_can_implement^⚠: Binary bitwise and, i64 dtype, can-implement.
baracuda_kernels_binary_bitwise_and_i64_run^⚠: Binary bitwise and, i64 dtype, contig.
baracuda_kernels_binary_bitwise_left_shift_i32_can_implement^⚠: Binary bitwise left_shift, i32 dtype, can-implement.
baracuda_kernels_binary_bitwise_left_shift_i32_run^⚠: Binary bitwise left_shift, i32 dtype, contig.
baracuda_kernels_binary_bitwise_left_shift_i64_can_implement^⚠: Binary bitwise left_shift, i64 dtype, can-implement.
baracuda_kernels_binary_bitwise_left_shift_i64_run^⚠: Binary bitwise left_shift, i64 dtype, contig.
baracuda_kernels_binary_bitwise_or_i32_can_implement^⚠: Binary bitwise or, i32 dtype, can-implement.
baracuda_kernels_binary_bitwise_or_i32_run^⚠: Binary bitwise or, i32 dtype, contig.
baracuda_kernels_binary_bitwise_or_i64_can_implement^⚠: Binary bitwise or, i64 dtype, can-implement.
baracuda_kernels_binary_bitwise_or_i64_run^⚠: Binary bitwise or, i64 dtype, contig.
baracuda_kernels_binary_bitwise_right_shift_i32_can_implement^⚠: Binary bitwise right_shift, i32 dtype, can-implement.
baracuda_kernels_binary_bitwise_right_shift_i32_run^⚠: Binary bitwise right_shift, i32 dtype, contig. Arithmetic shift (sign-extending), matching PyTorch.
baracuda_kernels_binary_bitwise_right_shift_i64_can_implement^⚠: Binary bitwise right_shift, i64 dtype, can-implement.
baracuda_kernels_binary_bitwise_right_shift_i64_run^⚠: Binary bitwise right_shift, i64 dtype, contig. Arithmetic shift (sign-extending), matching PyTorch.
baracuda_kernels_binary_bitwise_xor_i32_can_implement^⚠: Binary bitwise xor, i32 dtype, can-implement.
baracuda_kernels_binary_bitwise_xor_i32_run^⚠: Binary bitwise xor, i32 dtype, contig.
baracuda_kernels_binary_bitwise_xor_i64_can_implement^⚠: Binary bitwise xor, i64 dtype, can-implement.
baracuda_kernels_binary_bitwise_xor_i64_run^⚠: Binary bitwise xor, i64 dtype, contig.
baracuda_kernels_binary_cmp_eq_bf16_can_implement^⚠: Pre-launch implementability check for binary_cmp_eq_bf16.
baracuda_kernels_binary_cmp_eq_bf16_run^⚠: Binary elementwise eq, bf16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_eq_bf16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_eq_bf16_strided_run^⚠: Binary elementwise eq, bf16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_eq_f16_can_implement^⚠: Pre-launch implementability check for binary_cmp_eq_f16.
baracuda_kernels_binary_cmp_eq_f16_run^⚠: Binary elementwise eq, f16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_eq_f16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_eq_f16_strided_run^⚠: Binary elementwise eq, f16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_eq_f32_can_implement^⚠: Pre-launch implementability check for binary_cmp_eq_f32.
baracuda_kernels_binary_cmp_eq_f32_run^⚠: Binary elementwise eq, f32 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_eq_f32_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_eq_f32_strided_run^⚠: Binary elementwise eq, f32 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_eq_f64_can_implement^⚠: Pre-launch implementability check for binary_cmp_eq_f64.
baracuda_kernels_binary_cmp_eq_f64_run^⚠: Binary elementwise eq, f64 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_eq_f64_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_eq_f64_strided_run^⚠: Binary elementwise eq, f64 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ge_bf16_can_implement^⚠: Pre-launch implementability check for binary_cmp_ge_bf16.
baracuda_kernels_binary_cmp_ge_bf16_run^⚠: Binary elementwise ge, bf16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ge_bf16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_ge_bf16_strided_run^⚠: Binary elementwise ge, bf16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ge_f16_can_implement^⚠: Pre-launch implementability check for binary_cmp_ge_f16.
baracuda_kernels_binary_cmp_ge_f16_run^⚠: Binary elementwise ge, f16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ge_f16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_ge_f16_strided_run^⚠: Binary elementwise ge, f16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ge_f32_can_implement^⚠: Pre-launch implementability check for binary_cmp_ge_f32.
baracuda_kernels_binary_cmp_ge_f32_run^⚠: Binary elementwise ge (a >= b), f32 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ge_f32_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_ge_f32_strided_run^⚠: Binary elementwise ge, f32 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ge_f64_can_implement^⚠: Pre-launch implementability check for binary_cmp_ge_f64.
baracuda_kernels_binary_cmp_ge_f64_run^⚠: Binary elementwise ge, f64 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ge_f64_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_ge_f64_strided_run^⚠: Binary elementwise ge, f64 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_gt_bf16_can_implement^⚠: Pre-launch implementability check for binary_cmp_gt_bf16.
baracuda_kernels_binary_cmp_gt_bf16_run^⚠: Binary elementwise gt, bf16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_gt_bf16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_gt_bf16_strided_run^⚠: Binary elementwise gt, bf16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_gt_f16_can_implement^⚠: Pre-launch implementability check for binary_cmp_gt_f16.
baracuda_kernels_binary_cmp_gt_f16_run^⚠: Binary elementwise gt, f16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_gt_f16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_gt_f16_strided_run^⚠: Binary elementwise gt, f16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_gt_f32_can_implement^⚠: Pre-launch implementability check for binary_cmp_gt_f32.
baracuda_kernels_binary_cmp_gt_f32_run^⚠: Binary elementwise gt (a > b), f32 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_gt_f32_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_gt_f32_strided_run^⚠: Binary elementwise gt, f32 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_gt_f64_can_implement^⚠: Pre-launch implementability check for binary_cmp_gt_f64.
baracuda_kernels_binary_cmp_gt_f64_run^⚠: Binary elementwise gt, f64 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_gt_f64_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_gt_f64_strided_run^⚠: Binary elementwise gt, f64 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_le_bf16_can_implement^⚠: Pre-launch implementability check for binary_cmp_le_bf16.
baracuda_kernels_binary_cmp_le_bf16_run^⚠: Binary elementwise le, bf16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_le_bf16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_le_bf16_strided_run^⚠: Binary elementwise le, bf16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_le_f16_can_implement^⚠: Pre-launch implementability check for binary_cmp_le_f16.
baracuda_kernels_binary_cmp_le_f16_run^⚠: Binary elementwise le, f16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_le_f16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_le_f16_strided_run^⚠: Binary elementwise le, f16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_le_f32_can_implement^⚠: Pre-launch implementability check for binary_cmp_le_f32.
baracuda_kernels_binary_cmp_le_f32_run^⚠: Binary elementwise le (a <= b), f32 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_le_f32_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_le_f32_strided_run^⚠: Binary elementwise le, f32 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_le_f64_can_implement^⚠: Pre-launch implementability check for binary_cmp_le_f64.
baracuda_kernels_binary_cmp_le_f64_run^⚠: Binary elementwise le, f64 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_le_f64_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_le_f64_strided_run^⚠: Binary elementwise le, f64 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_lt_bf16_can_implement^⚠: Pre-launch implementability check for binary_cmp_lt_bf16.
baracuda_kernels_binary_cmp_lt_bf16_run^⚠: Binary elementwise lt, bf16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_lt_bf16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_lt_bf16_strided_run^⚠: Binary elementwise lt, bf16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_lt_f16_can_implement^⚠: Pre-launch implementability check for binary_cmp_lt_f16.
baracuda_kernels_binary_cmp_lt_f16_run^⚠: Binary elementwise lt, f16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_lt_f16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_lt_f16_strided_run^⚠: Binary elementwise lt, f16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_lt_f32_can_implement^⚠: Pre-launch implementability check for binary_cmp_lt_f32.
baracuda_kernels_binary_cmp_lt_f32_run^⚠: Binary elementwise lt (a < b), f32 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_lt_f32_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_lt_f32_strided_run^⚠: Binary elementwise lt, f32 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_lt_f64_can_implement^⚠: Pre-launch implementability check for binary_cmp_lt_f64.
baracuda_kernels_binary_cmp_lt_f64_run^⚠: Binary elementwise lt, f64 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_lt_f64_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_lt_f64_strided_run^⚠: Binary elementwise lt, f64 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ne_bf16_can_implement^⚠: Pre-launch implementability check for binary_cmp_ne_bf16.
baracuda_kernels_binary_cmp_ne_bf16_run^⚠: Binary elementwise ne, bf16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ne_bf16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_ne_bf16_strided_run^⚠: Binary elementwise ne, bf16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ne_f16_can_implement^⚠: Pre-launch implementability check for binary_cmp_ne_f16.
baracuda_kernels_binary_cmp_ne_f16_run^⚠: Binary elementwise ne, f16 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ne_f16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_ne_f16_strided_run^⚠: Binary elementwise ne, f16 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ne_f32_can_implement^⚠: Pre-launch implementability check for binary_cmp_ne_f32.
baracuda_kernels_binary_cmp_ne_f32_run^⚠: Binary elementwise ne, f32 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ne_f32_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_ne_f32_strided_run^⚠: Binary elementwise ne, f32 inputs, u8 output, strided path.
baracuda_kernels_binary_cmp_ne_f64_can_implement^⚠: Pre-launch implementability check for binary_cmp_ne_f64.
baracuda_kernels_binary_cmp_ne_f64_run^⚠: Binary elementwise ne, f64 inputs, u8 output, contig fast path.
baracuda_kernels_binary_cmp_ne_f64_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_binary_cmp_ne_f64_strided_run^⚠: Binary elementwise ne, f64 inputs, u8 output, strided path.
baracuda_kernels_binary_copysign_bf16_can_implement^⚠: Binary copysign, bf16, can-implement.
baracuda_kernels_binary_copysign_bf16_run^⚠: Binary copysign, bf16, contig.
baracuda_kernels_binary_copysign_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_copysign_bf16_strided.
baracuda_kernels_binary_copysign_bf16_strided_run^⚠: Binary copysign, bf16, strided.
baracuda_kernels_binary_copysign_f16_can_implement^⚠: Binary copysign, f16, can-implement.
baracuda_kernels_binary_copysign_f16_run^⚠: Binary copysign, f16, contig.
baracuda_kernels_binary_copysign_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_copysign_f16_strided.
baracuda_kernels_binary_copysign_f16_strided_run^⚠: Binary copysign, f16, strided.
baracuda_kernels_binary_copysign_f32_can_implement^⚠: Binary copysign, f32, can-implement.
baracuda_kernels_binary_copysign_f32_run^⚠: Binary copysign, f32, contig.
baracuda_kernels_binary_copysign_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_copysign_f32_strided.
baracuda_kernels_binary_copysign_f32_strided_run^⚠: Binary copysign, f32, strided.
baracuda_kernels_binary_copysign_f64_can_implement^⚠: Binary copysign, f64, can-implement.
baracuda_kernels_binary_copysign_f64_run^⚠: Binary copysign, f64, contig.
baracuda_kernels_binary_copysign_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_copysign_f64_strided.
baracuda_kernels_binary_copysign_f64_strided_run^⚠: Binary copysign, f64, strided.
baracuda_kernels_binary_div_backward_bf16_can_implement^⚠: baracuda_kernels_binary_div_backward_bf16_can_implement (baracuda kernels binary div backward bf16 can implement).
baracuda_kernels_binary_div_backward_bf16_run^⚠: Div backward, bf16.
baracuda_kernels_binary_div_backward_f16_can_implement^⚠: baracuda_kernels_binary_div_backward_f16_can_implement (baracuda kernels binary div backward f16 can implement).
baracuda_kernels_binary_div_backward_f16_run^⚠: Div backward, f16.
baracuda_kernels_binary_div_backward_f32_can_implement^⚠: baracuda_kernels_binary_div_backward_f32_can_implement (baracuda kernels binary div backward f32 can implement).
baracuda_kernels_binary_div_backward_f32_run^⚠: Div backward, f32. Writes da = dy / b and db = -dy * a / b². Both saved tensors a and b must be non-null; callers must also ensure b[i] != 0 for every cell.
baracuda_kernels_binary_div_backward_f64_can_implement^⚠: baracuda_kernels_binary_div_backward_f64_can_implement (baracuda kernels binary div backward f64 can implement).
baracuda_kernels_binary_div_backward_f64_run^⚠: Div backward, f64.
baracuda_kernels_binary_div_bf16_can_implement^⚠: Pre-launch implementability check for binary_div_bf16.
baracuda_kernels_binary_div_bf16_run^⚠: Binary elementwise div, bf16 dtype, contiguous fast path.
baracuda_kernels_binary_div_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_div_bf16_strided.
baracuda_kernels_binary_div_bf16_strided_run^⚠: Binary elementwise div, bf16 dtype, strided / broadcast path.
baracuda_kernels_binary_div_f16_can_implement^⚠: Pre-launch implementability check for binary_div_f16.
baracuda_kernels_binary_div_f16_run^⚠: Binary elementwise div, f16 dtype, contiguous fast path.
baracuda_kernels_binary_div_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_div_f16_strided.
baracuda_kernels_binary_div_f16_strided_run^⚠: Binary elementwise div, f16 dtype, strided / broadcast path.
baracuda_kernels_binary_div_f32_can_implement^⚠: Pre-launch implementability check for binary_div_f32. Validates the problem size without launching a kernel. Returns the standard status code mapping.
baracuda_kernels_binary_div_f32_run^⚠: Binary elementwise div, f32 dtype, contiguous fast path.
baracuda_kernels_binary_div_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_div_f32_strided.
baracuda_kernels_binary_div_f32_strided_run^⚠: Binary elementwise div, f32 dtype, strided / broadcast path.
baracuda_kernels_binary_div_f64_can_implement^⚠: Pre-launch implementability check for binary_div_f64.
baracuda_kernels_binary_div_f64_run^⚠: Binary elementwise div, f64 dtype, contiguous fast path.
baracuda_kernels_binary_div_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_div_f64_strided.
baracuda_kernels_binary_div_f64_strided_run^⚠: Binary elementwise div, f64 dtype, strided / broadcast path.
baracuda_kernels_binary_floor_divide_bf16_can_implement^⚠: Binary floor_divide, bf16, can-implement.
baracuda_kernels_binary_floor_divide_bf16_run^⚠: Binary floor_divide, bf16, contig.
baracuda_kernels_binary_floor_divide_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_floor_divide_bf16_strided.
baracuda_kernels_binary_floor_divide_bf16_strided_run^⚠: Binary floor_divide, bf16, strided.
baracuda_kernels_binary_floor_divide_f16_can_implement^⚠: Binary floor_divide, f16, can-implement.
baracuda_kernels_binary_floor_divide_f16_run^⚠: Binary floor_divide, f16, contig.
baracuda_kernels_binary_floor_divide_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_floor_divide_f16_strided.
baracuda_kernels_binary_floor_divide_f16_strided_run^⚠: Binary floor_divide, f16, strided.
baracuda_kernels_binary_floor_divide_f32_can_implement^⚠: Binary floor_divide, f32, can-implement.
baracuda_kernels_binary_floor_divide_f32_run^⚠: Binary floor_divide, f32, contig.
baracuda_kernels_binary_floor_divide_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_floor_divide_f32_strided.
baracuda_kernels_binary_floor_divide_f32_strided_run^⚠: Binary floor_divide, f32, strided.
baracuda_kernels_binary_floor_divide_f64_can_implement^⚠: Binary floor_divide, f64, can-implement.
baracuda_kernels_binary_floor_divide_f64_run^⚠: Binary floor_divide, f64, contig.
baracuda_kernels_binary_floor_divide_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_floor_divide_f64_strided.
baracuda_kernels_binary_floor_divide_f64_strided_run^⚠: Binary floor_divide, f64, strided.
baracuda_kernels_binary_fmax_bf16_can_implement^⚠: Binary fmax, bf16, can-implement.
baracuda_kernels_binary_fmax_bf16_run^⚠: Binary fmax, bf16, contig.
baracuda_kernels_binary_fmax_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_fmax_bf16_strided.
baracuda_kernels_binary_fmax_bf16_strided_run^⚠: Binary fmax, bf16, strided.
baracuda_kernels_binary_fmax_f16_can_implement^⚠: Binary fmax, f16, can-implement.
baracuda_kernels_binary_fmax_f16_run^⚠: Binary fmax, f16, contig.
baracuda_kernels_binary_fmax_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_fmax_f16_strided.
baracuda_kernels_binary_fmax_f16_strided_run^⚠: Binary fmax, f16, strided.
baracuda_kernels_binary_fmax_f32_can_implement^⚠: Binary fmax, f32, can-implement.
baracuda_kernels_binary_fmax_f32_run^⚠: Binary fmax, f32, contig.
baracuda_kernels_binary_fmax_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_fmax_f32_strided.
baracuda_kernels_binary_fmax_f32_strided_run^⚠: Binary fmax, f32, strided.
baracuda_kernels_binary_fmax_f64_can_implement^⚠: Binary fmax, f64, can-implement.
baracuda_kernels_binary_fmax_f64_run^⚠: Binary fmax, f64, contig.
baracuda_kernels_binary_fmax_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_fmax_f64_strided.
baracuda_kernels_binary_fmax_f64_strided_run^⚠: Binary fmax, f64, strided.
baracuda_kernels_binary_fmin_bf16_can_implement^⚠: Binary fmin, bf16, can-implement.
baracuda_kernels_binary_fmin_bf16_run^⚠: Binary fmin, bf16, contig.
baracuda_kernels_binary_fmin_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_fmin_bf16_strided.
baracuda_kernels_binary_fmin_bf16_strided_run^⚠: Binary fmin, bf16, strided.
baracuda_kernels_binary_fmin_f16_can_implement^⚠: Binary fmin, f16, can-implement.
baracuda_kernels_binary_fmin_f16_run^⚠: Binary fmin, f16, contig.
baracuda_kernels_binary_fmin_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_fmin_f16_strided.
baracuda_kernels_binary_fmin_f16_strided_run^⚠: Binary fmin, f16, strided.
baracuda_kernels_binary_fmin_f32_can_implement^⚠: Binary fmin, f32, can-implement.
baracuda_kernels_binary_fmin_f32_run^⚠: Binary fmin, f32, contig.
baracuda_kernels_binary_fmin_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_fmin_f32_strided.
baracuda_kernels_binary_fmin_f32_strided_run^⚠: Binary fmin, f32, strided.
baracuda_kernels_binary_fmin_f64_can_implement^⚠: Binary fmin, f64, can-implement.
baracuda_kernels_binary_fmin_f64_run^⚠: Binary fmin, f64, contig.
baracuda_kernels_binary_fmin_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_fmin_f64_strided.
baracuda_kernels_binary_fmin_f64_strided_run^⚠: Binary fmin, f64, strided.
baracuda_kernels_binary_hypot_backward_bf16_can_implement^⚠: baracuda_kernels_binary_hypot_backward_bf16_can_implement (baracuda kernels binary hypot backward bf16 can implement).
baracuda_kernels_binary_hypot_backward_bf16_run^⚠: Hypot backward, bf16.
baracuda_kernels_binary_hypot_backward_f16_can_implement^⚠: baracuda_kernels_binary_hypot_backward_f16_can_implement (baracuda kernels binary hypot backward f16 can implement).
baracuda_kernels_binary_hypot_backward_f16_run^⚠: Hypot backward, f16.
baracuda_kernels_binary_hypot_backward_f32_can_implement^⚠: baracuda_kernels_binary_hypot_backward_f32_can_implement (baracuda kernels binary hypot backward f32 can implement).
baracuda_kernels_binary_hypot_backward_f32_run^⚠: Hypot backward, f32. y = sqrt(a²+b²) is reconstructed inside the kernel from saved a and b (no saved-y slot in BinaryBackwardArgs); da = dy*a/y, db = dy*b/y. Caller responsible for guarding against a == 0 && b == 0 (y == 0).
baracuda_kernels_binary_hypot_backward_f64_can_implement^⚠: baracuda_kernels_binary_hypot_backward_f64_can_implement (baracuda kernels binary hypot backward f64 can implement).
baracuda_kernels_binary_hypot_backward_f64_run^⚠: Hypot backward, f64.
baracuda_kernels_binary_hypot_bf16_can_implement^⚠: Binary hypot, bf16, can-implement.
baracuda_kernels_binary_hypot_bf16_run^⚠: Binary hypot, bf16, contig.
baracuda_kernels_binary_hypot_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_hypot_bf16_strided.
baracuda_kernels_binary_hypot_bf16_strided_run^⚠: Binary hypot, bf16, strided.
baracuda_kernels_binary_hypot_f16_can_implement^⚠: Binary hypot, f16, can-implement.
baracuda_kernels_binary_hypot_f16_run^⚠: Binary hypot, f16, contig.
baracuda_kernels_binary_hypot_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_hypot_f16_strided.
baracuda_kernels_binary_hypot_f16_strided_run^⚠: Binary hypot, f16, strided.
baracuda_kernels_binary_hypot_f32_can_implement^⚠: Binary hypot, f32, can-implement.
baracuda_kernels_binary_hypot_f32_run^⚠: Binary hypot, f32, contig.
baracuda_kernels_binary_hypot_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_hypot_f32_strided.
baracuda_kernels_binary_hypot_f32_strided_run^⚠: Binary hypot, f32, strided.
baracuda_kernels_binary_hypot_f64_can_implement^⚠: Binary hypot, f64, can-implement.
baracuda_kernels_binary_hypot_f64_run^⚠: Binary hypot, f64, contig.
baracuda_kernels_binary_hypot_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_hypot_f64_strided.
baracuda_kernels_binary_hypot_f64_strided_run^⚠: Binary hypot, f64, strided.
baracuda_kernels_binary_lerp_backward_bf16_can_implement^⚠: baracuda_kernels_binary_lerp_backward_bf16_can_implement (baracuda kernels binary lerp backward bf16 can implement).
baracuda_kernels_binary_lerp_backward_bf16_run^⚠: lerp BW, bf16.
baracuda_kernels_binary_lerp_backward_f16_can_implement^⚠: baracuda_kernels_binary_lerp_backward_f16_can_implement (baracuda kernels binary lerp backward f16 can implement).
baracuda_kernels_binary_lerp_backward_f16_run^⚠: lerp BW, f16.
baracuda_kernels_binary_lerp_backward_f32_can_implement^⚠: baracuda_kernels_binary_lerp_backward_f32_can_implement (baracuda kernels binary lerp backward f32 can implement).
baracuda_kernels_binary_lerp_backward_f32_run^⚠: lerp backward: da = (1 - weight)·dy, db = weight·dy, f32. No saves.
baracuda_kernels_binary_lerp_backward_f64_can_implement^⚠: baracuda_kernels_binary_lerp_backward_f64_can_implement (baracuda kernels binary lerp backward f64 can implement).
baracuda_kernels_binary_lerp_backward_f64_run^⚠: lerp BW, f64.
baracuda_kernels_binary_lerp_bf16_can_implement^⚠: baracuda_kernels_binary_lerp_bf16_can_implement (baracuda kernels binary lerp bf16 can implement).
baracuda_kernels_binary_lerp_bf16_run^⚠: lerp FW, bf16.
baracuda_kernels_binary_lerp_f16_can_implement^⚠: baracuda_kernels_binary_lerp_f16_can_implement (baracuda kernels binary lerp f16 can implement).
baracuda_kernels_binary_lerp_f16_run^⚠: lerp FW, f16.
baracuda_kernels_binary_lerp_f32_can_implement^⚠: baracuda_kernels_binary_lerp_f32_can_implement (baracuda kernels binary lerp f32 can implement).
baracuda_kernels_binary_lerp_f32_run^⚠: Binary elementwise lerp(a, b; weight) = a + weight·(b - a), f32, contig.
baracuda_kernels_binary_lerp_f64_can_implement^⚠: baracuda_kernels_binary_lerp_f64_can_implement (baracuda kernels binary lerp f64 can implement).
baracuda_kernels_binary_lerp_f64_run^⚠: lerp FW, f64. The f32 weight widens to f64 losslessly.
baracuda_kernels_binary_logical_and_bool_can_implement^⚠: Binary logical and, Bool dtype, can-implement.
baracuda_kernels_binary_logical_and_bool_run^⚠: Binary logical and, Bool dtype (1-byte storage), contig.
baracuda_kernels_binary_logical_or_bool_can_implement^⚠: Binary logical or, Bool dtype, can-implement.
baracuda_kernels_binary_logical_or_bool_run^⚠: Binary logical or, Bool dtype, contig.
baracuda_kernels_binary_logical_xor_bool_can_implement^⚠: Binary logical xor, Bool dtype, can-implement.
baracuda_kernels_binary_logical_xor_bool_run^⚠: Binary logical xor, Bool dtype, contig.
baracuda_kernels_binary_maximum_backward_bf16_can_implement^⚠: baracuda_kernels_binary_maximum_backward_bf16_can_implement (baracuda kernels binary maximum backward bf16 can implement).
baracuda_kernels_binary_maximum_backward_bf16_run^⚠: Maximum backward, bf16.
baracuda_kernels_binary_maximum_backward_f16_can_implement^⚠: baracuda_kernels_binary_maximum_backward_f16_can_implement (baracuda kernels binary maximum backward f16 can implement).
baracuda_kernels_binary_maximum_backward_f16_run^⚠: Maximum backward, f16.
baracuda_kernels_binary_maximum_backward_f32_can_implement^⚠: baracuda_kernels_binary_maximum_backward_f32_can_implement (baracuda kernels binary maximum backward f32 can implement).
baracuda_kernels_binary_maximum_backward_f32_run^⚠: Maximum backward, f32. Tie-break: split dy evenly on a == b; NaN inputs propagate dy to both. Saved a and b are used purely as references for the comparison.
baracuda_kernels_binary_maximum_backward_f64_can_implement^⚠: baracuda_kernels_binary_maximum_backward_f64_can_implement (baracuda kernels binary maximum backward f64 can implement).
baracuda_kernels_binary_maximum_backward_f64_run^⚠: Maximum backward, f64.
baracuda_kernels_binary_maximum_bf16_can_implement^⚠: Binary maximum, bf16, can-implement.
baracuda_kernels_binary_maximum_bf16_run^⚠: Binary maximum, bf16, contig.
baracuda_kernels_binary_maximum_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_maximum_bf16_strided.
baracuda_kernels_binary_maximum_bf16_strided_run^⚠: Binary maximum, bf16, strided.
baracuda_kernels_binary_maximum_f16_can_implement^⚠: Binary maximum, f16, can-implement.
baracuda_kernels_binary_maximum_f16_run^⚠: Binary maximum, f16, contig.
baracuda_kernels_binary_maximum_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_maximum_f16_strided.
baracuda_kernels_binary_maximum_f16_strided_run^⚠: Binary maximum, f16, strided.
baracuda_kernels_binary_maximum_f32_can_implement^⚠: Binary maximum, f32, can-implement.
baracuda_kernels_binary_maximum_f32_run^⚠: Binary maximum, f32, contig.
baracuda_kernels_binary_maximum_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_maximum_f32_strided.
baracuda_kernels_binary_maximum_f32_strided_run^⚠: Binary maximum, f32, strided.
baracuda_kernels_binary_maximum_f64_can_implement^⚠: Binary maximum, f64, can-implement.
baracuda_kernels_binary_maximum_f64_run^⚠: Binary maximum, f64, contig.
baracuda_kernels_binary_maximum_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_maximum_f64_strided.
baracuda_kernels_binary_maximum_f64_strided_run^⚠: Binary maximum, f64, strided.
baracuda_kernels_binary_minimum_backward_bf16_can_implement^⚠: baracuda_kernels_binary_minimum_backward_bf16_can_implement (baracuda kernels binary minimum backward bf16 can implement).
baracuda_kernels_binary_minimum_backward_bf16_run^⚠: Minimum backward, bf16.
baracuda_kernels_binary_minimum_backward_f16_can_implement^⚠: baracuda_kernels_binary_minimum_backward_f16_can_implement (baracuda kernels binary minimum backward f16 can implement).
baracuda_kernels_binary_minimum_backward_f16_run^⚠: Minimum backward, f16.
baracuda_kernels_binary_minimum_backward_f32_can_implement^⚠: baracuda_kernels_binary_minimum_backward_f32_can_implement (baracuda kernels binary minimum backward f32 can implement).
baracuda_kernels_binary_minimum_backward_f32_run^⚠: Minimum backward, f32. Tie-break: split dy evenly on a == b; NaN inputs propagate dy to both. Saved a and b are used purely as references for the comparison.
baracuda_kernels_binary_minimum_backward_f64_can_implement^⚠: baracuda_kernels_binary_minimum_backward_f64_can_implement (baracuda kernels binary minimum backward f64 can implement).
baracuda_kernels_binary_minimum_backward_f64_run^⚠: Minimum backward, f64.
baracuda_kernels_binary_minimum_bf16_can_implement^⚠: Binary minimum, bf16, can-implement.
baracuda_kernels_binary_minimum_bf16_run^⚠: Binary minimum, bf16, contig.
baracuda_kernels_binary_minimum_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_minimum_bf16_strided.
baracuda_kernels_binary_minimum_bf16_strided_run^⚠: Binary minimum, bf16, strided.
baracuda_kernels_binary_minimum_f16_can_implement^⚠: Binary minimum, f16, can-implement.
baracuda_kernels_binary_minimum_f16_run^⚠: Binary minimum, f16, contig.
baracuda_kernels_binary_minimum_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_minimum_f16_strided.
baracuda_kernels_binary_minimum_f16_strided_run^⚠: Binary minimum, f16, strided.
baracuda_kernels_binary_minimum_f32_can_implement^⚠: Binary minimum, f32, can-implement.
baracuda_kernels_binary_minimum_f32_run^⚠: Binary minimum, f32, contig.
baracuda_kernels_binary_minimum_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_minimum_f32_strided.
baracuda_kernels_binary_minimum_f32_strided_run^⚠: Binary minimum, f32, strided.
baracuda_kernels_binary_minimum_f64_can_implement^⚠: Binary minimum, f64, can-implement.
baracuda_kernels_binary_minimum_f64_run^⚠: Binary minimum, f64, contig.
baracuda_kernels_binary_minimum_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_minimum_f64_strided.
baracuda_kernels_binary_minimum_f64_strided_run^⚠: Binary minimum, f64, strided.
baracuda_kernels_binary_mod_bf16_can_implement^⚠: Binary mod, bf16, can-implement.
baracuda_kernels_binary_mod_bf16_run^⚠: Binary mod, bf16, contig.
baracuda_kernels_binary_mod_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_mod_bf16_strided.
baracuda_kernels_binary_mod_bf16_strided_run^⚠: Binary mod, bf16, strided.
baracuda_kernels_binary_mod_f16_can_implement^⚠: Binary mod, f16, can-implement.
baracuda_kernels_binary_mod_f16_run^⚠: Binary mod, f16, contig.
baracuda_kernels_binary_mod_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_mod_f16_strided.
baracuda_kernels_binary_mod_f16_strided_run^⚠: Binary mod, f16, strided.
baracuda_kernels_binary_mod_f32_can_implement^⚠: Binary mod, f32, can-implement.
baracuda_kernels_binary_mod_f32_run^⚠: Binary mod, f32, contig.
baracuda_kernels_binary_mod_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_mod_f32_strided.
baracuda_kernels_binary_mod_f32_strided_run^⚠: Binary mod, f32, strided.
baracuda_kernels_binary_mod_f64_can_implement^⚠: Binary mod, f64, can-implement.
baracuda_kernels_binary_mod_f64_run^⚠: Binary mod, f64, contig.
baracuda_kernels_binary_mod_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_mod_f64_strided.
baracuda_kernels_binary_mod_f64_strided_run^⚠: Binary mod, f64, strided.
baracuda_kernels_binary_mul_backward_bf16_can_implement^⚠: baracuda_kernels_binary_mul_backward_bf16_can_implement (baracuda kernels binary mul backward bf16 can implement).
baracuda_kernels_binary_mul_backward_bf16_run^⚠: Mul backward, bf16.
baracuda_kernels_binary_mul_backward_f16_can_implement^⚠: baracuda_kernels_binary_mul_backward_f16_can_implement (baracuda kernels binary mul backward f16 can implement).
baracuda_kernels_binary_mul_backward_f16_run^⚠: Mul backward, f16.
baracuda_kernels_binary_mul_backward_f32_can_implement^⚠: baracuda_kernels_binary_mul_backward_f32_can_implement (baracuda kernels binary mul backward f32 can implement).
baracuda_kernels_binary_mul_backward_f32_run^⚠: Mul backward, f32. Writes da = dy * b and db = dy * a. Both saved tensors a and b must be non-null.
baracuda_kernels_binary_mul_backward_f64_can_implement^⚠: baracuda_kernels_binary_mul_backward_f64_can_implement (baracuda kernels binary mul backward f64 can implement).
baracuda_kernels_binary_mul_backward_f64_run^⚠: Mul backward, f64.
baracuda_kernels_binary_mul_bf16_can_implement^⚠: Pre-launch implementability check for binary_mul_bf16.
baracuda_kernels_binary_mul_bf16_run^⚠: Binary elementwise mul, bf16 dtype, contiguous fast path.
baracuda_kernels_binary_mul_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_mul_bf16_strided.
baracuda_kernels_binary_mul_bf16_strided_run^⚠: Binary elementwise mul, bf16 dtype, strided / broadcast path.
baracuda_kernels_binary_mul_f16_can_implement^⚠: Pre-launch implementability check for binary_mul_f16.
baracuda_kernels_binary_mul_f16_run^⚠: Binary elementwise mul, f16 dtype, contiguous fast path.
baracuda_kernels_binary_mul_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_mul_f16_strided.
baracuda_kernels_binary_mul_f16_strided_run^⚠: Binary elementwise mul, f16 dtype, strided / broadcast path.
baracuda_kernels_binary_mul_f32_can_implement^⚠: Pre-launch implementability check for binary_mul_f32. Validates the problem size without launching a kernel. Returns the standard status code mapping.
baracuda_kernels_binary_mul_f32_run^⚠: Binary elementwise mul, f32 dtype, contiguous fast path.
baracuda_kernels_binary_mul_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_mul_f32_strided.
baracuda_kernels_binary_mul_f32_strided_run^⚠: Binary elementwise mul, f32 dtype, strided / broadcast path.
baracuda_kernels_binary_mul_f64_can_implement^⚠: Pre-launch implementability check for binary_mul_f64.
baracuda_kernels_binary_mul_f64_run^⚠: Binary elementwise mul, f64 dtype, contiguous fast path.
baracuda_kernels_binary_mul_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_mul_f64_strided.
baracuda_kernels_binary_mul_f64_strided_run^⚠: Binary elementwise mul, f64 dtype, strided / broadcast path.
baracuda_kernels_binary_nextafter_bf16_can_implement^⚠: Binary nextafter, bf16, can-implement.
baracuda_kernels_binary_nextafter_bf16_run^⚠: Binary nextafter, bf16, contig.
baracuda_kernels_binary_nextafter_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_nextafter_bf16_strided.
baracuda_kernels_binary_nextafter_bf16_strided_run^⚠: Binary nextafter, bf16, strided.
baracuda_kernels_binary_nextafter_f16_can_implement^⚠: Binary nextafter, f16, can-implement.
baracuda_kernels_binary_nextafter_f16_run^⚠: Binary nextafter, f16, contig.
baracuda_kernels_binary_nextafter_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_nextafter_f16_strided.
baracuda_kernels_binary_nextafter_f16_strided_run^⚠: Binary nextafter, f16, strided.
baracuda_kernels_binary_nextafter_f32_can_implement^⚠: Binary nextafter, f32, can-implement.
baracuda_kernels_binary_nextafter_f32_run^⚠: Binary nextafter, f32, contig.
baracuda_kernels_binary_nextafter_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_nextafter_f32_strided.
baracuda_kernels_binary_nextafter_f32_strided_run^⚠: Binary nextafter, f32, strided.
baracuda_kernels_binary_nextafter_f64_can_implement^⚠: Binary nextafter, f64, can-implement.
baracuda_kernels_binary_nextafter_f64_run^⚠: Binary nextafter, f64, contig.
baracuda_kernels_binary_nextafter_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_nextafter_f64_strided.
baracuda_kernels_binary_nextafter_f64_strided_run^⚠: Binary nextafter, f64, strided.
baracuda_kernels_binary_pow_backward_bf16_can_implement^⚠: baracuda_kernels_binary_pow_backward_bf16_can_implement (baracuda kernels binary pow backward bf16 can implement).
baracuda_kernels_binary_pow_backward_bf16_run^⚠: Pow backward, bf16.
baracuda_kernels_binary_pow_backward_f16_can_implement^⚠: baracuda_kernels_binary_pow_backward_f16_can_implement (baracuda kernels binary pow backward f16 can implement).
baracuda_kernels_binary_pow_backward_f16_run^⚠: Pow backward, f16.
baracuda_kernels_binary_pow_backward_f32_can_implement^⚠: baracuda_kernels_binary_pow_backward_f32_can_implement (baracuda kernels binary pow backward f32 can implement).
baracuda_kernels_binary_pow_backward_f32_run^⚠: Pow backward, f32. da = dy * b * a^(b-1), db = dy * a^b * ln(a). Caller responsible for guarding against undefined regions (a < 0 non-integer b, or a == 0 with b < 1).
baracuda_kernels_binary_pow_backward_f64_can_implement^⚠: baracuda_kernels_binary_pow_backward_f64_can_implement (baracuda kernels binary pow backward f64 can implement).
baracuda_kernels_binary_pow_backward_f64_run^⚠: Pow backward, f64.
baracuda_kernels_binary_pow_bf16_can_implement^⚠: Binary pow, bf16, can-implement.
baracuda_kernels_binary_pow_bf16_run^⚠: Binary pow, bf16, contig.
baracuda_kernels_binary_pow_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_pow_bf16_strided.
baracuda_kernels_binary_pow_bf16_strided_run^⚠: Binary pow, bf16, strided.
baracuda_kernels_binary_pow_f16_can_implement^⚠: Binary pow, f16, can-implement.
baracuda_kernels_binary_pow_f16_run^⚠: Binary pow, f16, contig.
baracuda_kernels_binary_pow_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_pow_f16_strided.
baracuda_kernels_binary_pow_f16_strided_run^⚠: Binary pow, f16, strided.
baracuda_kernels_binary_pow_f32_can_implement^⚠: Binary pow, f32, can-implement.
baracuda_kernels_binary_pow_f32_run^⚠: Binary pow, f32, contig.
baracuda_kernels_binary_pow_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_pow_f32_strided.
baracuda_kernels_binary_pow_f32_strided_run^⚠: Binary pow, f32, strided.
baracuda_kernels_binary_pow_f64_can_implement^⚠: Binary pow, f64, can-implement.
baracuda_kernels_binary_pow_f64_run^⚠: Binary pow, f64, contig.
baracuda_kernels_binary_pow_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_pow_f64_strided.
baracuda_kernels_binary_pow_f64_strided_run^⚠: Binary pow, f64, strided.
baracuda_kernels_binary_remainder_bf16_can_implement^⚠: Binary remainder, bf16, can-implement.
baracuda_kernels_binary_remainder_bf16_run^⚠: Binary remainder, bf16, contig.
baracuda_kernels_binary_remainder_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_remainder_bf16_strided.
baracuda_kernels_binary_remainder_bf16_strided_run^⚠: Binary remainder, bf16, strided.
baracuda_kernels_binary_remainder_f16_can_implement^⚠: Binary remainder, f16, can-implement.
baracuda_kernels_binary_remainder_f16_run^⚠: Binary remainder, f16, contig.
baracuda_kernels_binary_remainder_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_remainder_f16_strided.
baracuda_kernels_binary_remainder_f16_strided_run^⚠: Binary remainder, f16, strided.
baracuda_kernels_binary_remainder_f32_can_implement^⚠: Binary remainder, f32, can-implement.
baracuda_kernels_binary_remainder_f32_run^⚠: Binary remainder, f32, contig.
baracuda_kernels_binary_remainder_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_remainder_f32_strided.
baracuda_kernels_binary_remainder_f32_strided_run^⚠: Binary remainder, f32, strided.
baracuda_kernels_binary_remainder_f64_can_implement^⚠: Binary remainder, f64, can-implement.
baracuda_kernels_binary_remainder_f64_run^⚠: Binary remainder, f64, contig.
baracuda_kernels_binary_remainder_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_remainder_f64_strided.
baracuda_kernels_binary_remainder_f64_strided_run^⚠: Binary remainder, f64, strided.
baracuda_kernels_binary_sub_backward_bf16_can_implement^⚠: baracuda_kernels_binary_sub_backward_bf16_can_implement (baracuda kernels binary sub backward bf16 can implement).
baracuda_kernels_binary_sub_backward_bf16_run^⚠: Sub backward, bf16.
baracuda_kernels_binary_sub_backward_f16_can_implement^⚠: baracuda_kernels_binary_sub_backward_f16_can_implement (baracuda kernels binary sub backward f16 can implement).
baracuda_kernels_binary_sub_backward_f16_run^⚠: Sub backward, f16.
baracuda_kernels_binary_sub_backward_f32_can_implement^⚠: baracuda_kernels_binary_sub_backward_f32_can_implement (baracuda kernels binary sub backward f32 can implement).
baracuda_kernels_binary_sub_backward_f32_run^⚠: Sub backward, f32. Writes da = dy and db = -dy.
baracuda_kernels_binary_sub_backward_f64_can_implement^⚠: baracuda_kernels_binary_sub_backward_f64_can_implement (baracuda kernels binary sub backward f64 can implement).
baracuda_kernels_binary_sub_backward_f64_run^⚠: Sub backward, f64.
baracuda_kernels_binary_sub_bf16_can_implement^⚠: Pre-launch implementability check for binary_sub_bf16.
baracuda_kernels_binary_sub_bf16_run^⚠: Binary elementwise sub, bf16 dtype, contiguous fast path.
baracuda_kernels_binary_sub_bf16_strided_can_implement^⚠: Pre-launch implementability check for binary_sub_bf16_strided.
baracuda_kernels_binary_sub_bf16_strided_run^⚠: Binary elementwise sub, bf16 dtype, strided / broadcast path.
baracuda_kernels_binary_sub_f16_can_implement^⚠: Pre-launch implementability check for binary_sub_f16.
baracuda_kernels_binary_sub_f16_run^⚠: Binary elementwise sub, f16 dtype, contiguous fast path.
baracuda_kernels_binary_sub_f16_strided_can_implement^⚠: Pre-launch implementability check for binary_sub_f16_strided.
baracuda_kernels_binary_sub_f16_strided_run^⚠: Binary elementwise sub, f16 dtype, strided / broadcast path.
baracuda_kernels_binary_sub_f32_can_implement^⚠: Pre-launch implementability check for binary_sub_f32. Validates the problem size without launching a kernel. Returns the standard status code mapping.
baracuda_kernels_binary_sub_f32_run^⚠: Binary elementwise sub, f32 dtype, contiguous fast path.
baracuda_kernels_binary_sub_f32_strided_can_implement^⚠: Pre-launch implementability check for binary_sub_f32_strided.
baracuda_kernels_binary_sub_f32_strided_run^⚠: Binary elementwise sub, f32 dtype, strided / broadcast path.
baracuda_kernels_binary_sub_f64_can_implement^⚠: Pre-launch implementability check for binary_sub_f64.
baracuda_kernels_binary_sub_f64_run^⚠: Binary elementwise sub, f64 dtype, contiguous fast path.
baracuda_kernels_binary_sub_f64_strided_can_implement^⚠: Pre-launch implementability check for binary_sub_f64_strided.
baracuda_kernels_binary_sub_f64_strided_run^⚠: Binary elementwise sub, f64 dtype, strided / broadcast path.
baracuda_kernels_bincount_i32_can_implement^⚠: baracuda_kernels_bincount_i32_can_implement (baracuda kernels bincount i32 can implement).
baracuda_kernels_bincount_i32_run^⚠: bincount, i32 input. Out-of-range (< 0 or >= num_bins) silently dropped.
baracuda_kernels_bincount_i64_can_implement^⚠: baracuda_kernels_bincount_i64_can_implement (baracuda kernels bincount i64 can implement).
baracuda_kernels_bincount_i64_run^⚠: bincount, i64 input.
baracuda_kernels_cast_bf16_bf16_can_implement^⚠: Implementability check for cast_bf16_bf16.
baracuda_kernels_cast_bf16_bf16_run^⚠: Cast bf16 -> bf16.
baracuda_kernels_cast_bf16_bool_can_implement^⚠: Implementability check for cast_bf16_bool.
baracuda_kernels_cast_bf16_bool_run^⚠: Cast bf16 -> Bool.
baracuda_kernels_cast_bf16_f16_can_implement^⚠: Implementability check for cast_bf16_f16.
baracuda_kernels_cast_bf16_f16_run^⚠: Cast bf16 -> f16.
baracuda_kernels_cast_bf16_f32_can_implement^⚠: Implementability check for cast_bf16_f32.
baracuda_kernels_cast_bf16_f32_run^⚠: Cast bf16 -> f32.
baracuda_kernels_cast_bf16_f64_can_implement^⚠: Implementability check for cast_bf16_f64.
baracuda_kernels_cast_bf16_f64_run^⚠: Cast bf16 -> f64.
baracuda_kernels_cast_bf16_fp8e4m3_can_implement^⚠: baracuda_kernels_cast_bf16_fp8e4m3_can_implement (baracuda kernels cast bf16 fp8e4m3 can implement).
baracuda_kernels_cast_bf16_fp8e4m3_run^⚠: Cast bf16 -> Fp8E4M3.
baracuda_kernels_cast_bf16_fp8e5m2_can_implement^⚠: baracuda_kernels_cast_bf16_fp8e5m2_can_implement (baracuda kernels cast bf16 fp8e5m2 can implement).
baracuda_kernels_cast_bf16_fp8e5m2_run^⚠: Cast bf16 -> Fp8E5M2.
baracuda_kernels_cast_bf16_i8_can_implement^⚠: Implementability check for cast_bf16_i8.
baracuda_kernels_cast_bf16_i8_run^⚠: Cast bf16 -> i8.
baracuda_kernels_cast_bf16_i16_can_implement^⚠: baracuda_kernels_cast_bf16_i16_can_implement (baracuda kernels cast bf16 i16 can implement).
baracuda_kernels_cast_bf16_i16_run^⚠: Cast bf16 -> i16. Phase 31.
baracuda_kernels_cast_bf16_i32_can_implement^⚠: Implementability check for cast_bf16_i32.
baracuda_kernels_cast_bf16_i32_run^⚠: Cast bf16 -> i32.
baracuda_kernels_cast_bf16_i64_can_implement^⚠: Implementability check for cast_bf16_i64.
baracuda_kernels_cast_bf16_i64_run^⚠: Cast bf16 -> i64.
baracuda_kernels_cast_bf16_u8_can_implement^⚠: Implementability check for cast_bf16_u8.
baracuda_kernels_cast_bf16_u8_run^⚠: Cast bf16 -> u8.
baracuda_kernels_cast_bf16_u32_can_implement^⚠: baracuda_kernels_cast_bf16_u32_can_implement (baracuda kernels cast bf16 u32 can implement).
baracuda_kernels_cast_bf16_u32_run^⚠: Cast bf16 -> u32. Phase 31.
baracuda_kernels_cast_bool_bf16_can_implement^⚠: Implementability check for cast_bool_bf16.
baracuda_kernels_cast_bool_bf16_run^⚠: Cast Bool -> bf16.
baracuda_kernels_cast_bool_f16_can_implement^⚠: Implementability check for cast_bool_f16.
baracuda_kernels_cast_bool_f16_run^⚠: Cast Bool -> f16.
baracuda_kernels_cast_bool_f32_can_implement^⚠: Implementability check for cast_bool_f32.
baracuda_kernels_cast_bool_f32_run^⚠: Cast Bool -> f32.
baracuda_kernels_cast_bool_i32_can_implement^⚠: Implementability check for cast_bool_i32.
baracuda_kernels_cast_bool_i32_run^⚠: Cast Bool -> i32. x != 0 → 1.
baracuda_kernels_cast_bool_i64_can_implement^⚠: Implementability check for cast_bool_i64.
baracuda_kernels_cast_bool_i64_run^⚠: Cast Bool -> i64.
baracuda_kernels_cast_f16_bf16_can_implement^⚠: Implementability check for cast_f16_bf16.
baracuda_kernels_cast_f16_bf16_run^⚠: Cast f16 -> bf16.
baracuda_kernels_cast_f16_bool_can_implement^⚠: Implementability check for cast_f16_bool.
baracuda_kernels_cast_f16_bool_run^⚠: Cast f16 -> Bool.
baracuda_kernels_cast_f16_f16_can_implement^⚠: Implementability check for cast_f16_f16.
baracuda_kernels_cast_f16_f16_run^⚠: Cast f16 -> f16.
baracuda_kernels_cast_f16_f32_can_implement^⚠: Implementability check for cast_f16_f32.
baracuda_kernels_cast_f16_f32_run^⚠: Cast f16 -> f32.
baracuda_kernels_cast_f16_f64_can_implement^⚠: Implementability check for cast_f16_f64.
baracuda_kernels_cast_f16_f64_run^⚠: Cast f16 -> f64.
baracuda_kernels_cast_f16_fp8e4m3_can_implement^⚠: baracuda_kernels_cast_f16_fp8e4m3_can_implement (baracuda kernels cast f16 fp8e4m3 can implement).
baracuda_kernels_cast_f16_fp8e4m3_run^⚠: Cast f16 -> Fp8E4M3.
baracuda_kernels_cast_f16_fp8e5m2_can_implement^⚠: baracuda_kernels_cast_f16_fp8e5m2_can_implement (baracuda kernels cast f16 fp8e5m2 can implement).
baracuda_kernels_cast_f16_fp8e5m2_run^⚠: Cast f16 -> Fp8E5M2.
baracuda_kernels_cast_f16_i8_can_implement^⚠: Implementability check for cast_f16_i8.
baracuda_kernels_cast_f16_i8_run^⚠: Cast f16 -> i8.
baracuda_kernels_cast_f16_i16_can_implement^⚠: baracuda_kernels_cast_f16_i16_can_implement (baracuda kernels cast f16 i16 can implement).
baracuda_kernels_cast_f16_i16_run^⚠: Cast f16 -> i16. Phase 31.
baracuda_kernels_cast_f16_i32_can_implement^⚠: Implementability check for cast_f16_i32.
baracuda_kernels_cast_f16_i32_run^⚠: Cast f16 -> i32.
baracuda_kernels_cast_f16_i64_can_implement^⚠: Implementability check for cast_f16_i64.
baracuda_kernels_cast_f16_i64_run^⚠: Cast f16 -> i64.
baracuda_kernels_cast_f16_u8_can_implement^⚠: Implementability check for cast_f16_u8.
baracuda_kernels_cast_f16_u8_run^⚠: Cast f16 -> u8.
baracuda_kernels_cast_f16_u32_can_implement^⚠: baracuda_kernels_cast_f16_u32_can_implement (baracuda kernels cast f16 u32 can implement).
baracuda_kernels_cast_f16_u32_run^⚠: Cast f16 -> u32. Phase 31.
baracuda_kernels_cast_f32_bf16_can_implement^⚠: Implementability check for cast_f32_bf16.
baracuda_kernels_cast_f32_bf16_run^⚠: Cast f32 -> bf16.
baracuda_kernels_cast_f32_bool_can_implement^⚠: Implementability check for cast_f32_bool.
baracuda_kernels_cast_f32_bool_run^⚠: Cast f32 -> Bool.
baracuda_kernels_cast_f32_f16_can_implement^⚠: Implementability check for cast_f32_f16.
baracuda_kernels_cast_f32_f16_run^⚠: Cast f32 -> f16.
baracuda_kernels_cast_f32_f32_can_implement^⚠: Implementability check for cast_f32_f32.
baracuda_kernels_cast_f32_f32_run^⚠: Cast f32 -> f32. See LICENSE-thirdparty.md.
baracuda_kernels_cast_f32_f64_can_implement^⚠: Implementability check for cast_f32_f64.
baracuda_kernels_cast_f32_f64_run^⚠: Cast f32 -> f64.
baracuda_kernels_cast_f32_fp8e4m3_can_implement^⚠: baracuda_kernels_cast_f32_fp8e4m3_can_implement (baracuda kernels cast f32 fp8e4m3 can implement).
baracuda_kernels_cast_f32_fp8e4m3_run^⚠: Cast f32 -> Fp8E4M3 (saturates to ±448).
baracuda_kernels_cast_f32_fp8e5m2_can_implement^⚠: baracuda_kernels_cast_f32_fp8e5m2_can_implement (baracuda kernels cast f32 fp8e5m2 can implement).
baracuda_kernels_cast_f32_fp8e5m2_run^⚠: Cast f32 -> Fp8E5M2 (saturates to ±57344).
baracuda_kernels_cast_f32_i8_can_implement^⚠: Implementability check for cast_f32_i8.
baracuda_kernels_cast_f32_i8_run^⚠: Cast f32 -> i8.
baracuda_kernels_cast_f32_i16_can_implement^⚠: baracuda_kernels_cast_f32_i16_can_implement (baracuda kernels cast f32 i16 can implement).
baracuda_kernels_cast_f32_i16_run^⚠: Cast f32 -> i16. Phase 31.
baracuda_kernels_cast_f32_i32_can_implement^⚠: Implementability check for cast_f32_i32.
baracuda_kernels_cast_f32_i32_run^⚠: Cast f32 -> i32.
baracuda_kernels_cast_f32_i64_can_implement^⚠: Implementability check for cast_f32_i64.
baracuda_kernels_cast_f32_i64_run^⚠: Cast f32 -> i64.
baracuda_kernels_cast_f32_s4_can_implement^⚠: baracuda_kernels_cast_f32_s4_can_implement (baracuda kernels cast f32 s4 can implement).
baracuda_kernels_cast_f32_s4_run^⚠: Cast f32 -> S4 (round-to-nearest then saturate).
baracuda_kernels_cast_f32_u4_can_implement^⚠: baracuda_kernels_cast_f32_u4_can_implement (baracuda kernels cast f32 u4 can implement).
baracuda_kernels_cast_f32_u4_run^⚠: Cast f32 -> U4 (round-to-nearest then saturate).
baracuda_kernels_cast_f32_u8_can_implement^⚠: Implementability check for cast_f32_u8.
baracuda_kernels_cast_f32_u8_run^⚠: Cast f32 -> u8.
baracuda_kernels_cast_f32_u32_can_implement^⚠: baracuda_kernels_cast_f32_u32_can_implement (baracuda kernels cast f32 u32 can implement).
baracuda_kernels_cast_f32_u32_run^⚠: Cast f32 -> u32. Negative inputs are undefined per C++ rules (typical NVCC behaviour: saturates toward 0). Phase 31.
baracuda_kernels_cast_f64_bf16_can_implement^⚠: Implementability check for cast_f64_bf16.
baracuda_kernels_cast_f64_bf16_run^⚠: Cast f64 -> bf16.
baracuda_kernels_cast_f64_f16_can_implement^⚠: Implementability check for cast_f64_f16.
baracuda_kernels_cast_f64_f16_run^⚠: Cast f64 -> f16.
baracuda_kernels_cast_f64_f32_can_implement^⚠: Implementability check for cast_f64_f32.
baracuda_kernels_cast_f64_f32_run^⚠: Cast f64 -> f32.
baracuda_kernels_cast_f64_f64_can_implement^⚠: Implementability check for cast_f64_f64.
baracuda_kernels_cast_f64_f64_run^⚠: Cast f64 -> f64.
baracuda_kernels_cast_f64_i8_can_implement^⚠: Implementability check for cast_f64_i8.
baracuda_kernels_cast_f64_i8_run^⚠: Cast f64 -> i8.
baracuda_kernels_cast_f64_i16_can_implement^⚠: baracuda_kernels_cast_f64_i16_can_implement (baracuda kernels cast f64 i16 can implement).
baracuda_kernels_cast_f64_i16_run^⚠: Cast f64 -> i16. Phase 31.
baracuda_kernels_cast_f64_i32_can_implement^⚠: Implementability check for cast_f64_i32.
baracuda_kernels_cast_f64_i32_run^⚠: Cast f64 -> i32.
baracuda_kernels_cast_f64_i64_can_implement^⚠: Implementability check for cast_f64_i64.
baracuda_kernels_cast_f64_i64_run^⚠: Cast f64 -> i64.
baracuda_kernels_cast_f64_u8_can_implement^⚠: Implementability check for cast_f64_u8.
baracuda_kernels_cast_f64_u8_run^⚠: Cast f64 -> u8.
baracuda_kernels_cast_f64_u32_can_implement^⚠: baracuda_kernels_cast_f64_u32_can_implement (baracuda kernels cast f64 u32 can implement).
baracuda_kernels_cast_f64_u32_run^⚠: Cast f64 -> u32. Phase 31.
baracuda_kernels_cast_fp8e4m3_bf16_can_implement^⚠: baracuda_kernels_cast_fp8e4m3_bf16_can_implement (baracuda kernels cast fp8e4m3 bf16 can implement).
baracuda_kernels_cast_fp8e4m3_bf16_run^⚠: Cast Fp8E4M3 -> bf16.
baracuda_kernels_cast_fp8e4m3_f16_can_implement^⚠: baracuda_kernels_cast_fp8e4m3_f16_can_implement (baracuda kernels cast fp8e4m3 f16 can implement).
baracuda_kernels_cast_fp8e4m3_f16_run^⚠: Cast Fp8E4M3 -> f16.
baracuda_kernels_cast_fp8e4m3_f32_can_implement^⚠: baracuda_kernels_cast_fp8e4m3_f32_can_implement (baracuda kernels cast fp8e4m3 f32 can implement).
baracuda_kernels_cast_fp8e4m3_f32_run^⚠: Cast Fp8E4M3 -> f32.
baracuda_kernels_cast_fp8e5m2_bf16_can_implement^⚠: baracuda_kernels_cast_fp8e5m2_bf16_can_implement (baracuda kernels cast fp8e5m2 bf16 can implement).
baracuda_kernels_cast_fp8e5m2_bf16_run^⚠: Cast Fp8E5M2 -> bf16.
baracuda_kernels_cast_fp8e5m2_f16_can_implement^⚠: baracuda_kernels_cast_fp8e5m2_f16_can_implement (baracuda kernels cast fp8e5m2 f16 can implement).
baracuda_kernels_cast_fp8e5m2_f16_run^⚠: Cast Fp8E5M2 -> f16.
baracuda_kernels_cast_fp8e5m2_f32_can_implement^⚠: baracuda_kernels_cast_fp8e5m2_f32_can_implement (baracuda kernels cast fp8e5m2 f32 can implement).
baracuda_kernels_cast_fp8e5m2_f32_run^⚠: Cast Fp8E5M2 -> f32.
baracuda_kernels_cast_i8_bf16_can_implement^⚠: Implementability check for cast_i8_bf16.
baracuda_kernels_cast_i8_bf16_run^⚠: Cast i8 -> bf16.
baracuda_kernels_cast_i8_f16_can_implement^⚠: Implementability check for cast_i8_f16.
baracuda_kernels_cast_i8_f16_run^⚠: Cast i8 -> f16.
baracuda_kernels_cast_i8_f32_can_implement^⚠: Implementability check for cast_i8_f32.
baracuda_kernels_cast_i8_f32_run^⚠: Cast i8 -> f32.
baracuda_kernels_cast_i8_f64_can_implement^⚠: Implementability check for cast_i8_f64.
baracuda_kernels_cast_i8_f64_run^⚠: Cast i8 -> f64.
baracuda_kernels_cast_i8_i8_can_implement^⚠: Implementability check for cast_i8_i8.
baracuda_kernels_cast_i8_i8_run^⚠: Cast i8 -> i8.
baracuda_kernels_cast_i8_i16_can_implement^⚠: baracuda_kernels_cast_i8_i16_can_implement (baracuda kernels cast i8 i16 can implement).
baracuda_kernels_cast_i8_i16_run^⚠: Cast i8 -> i16. Sign-extends. Phase 31.
baracuda_kernels_cast_i8_i32_can_implement^⚠: Implementability check for cast_i8_i32.
baracuda_kernels_cast_i8_i32_run^⚠: Cast i8 -> i32.
baracuda_kernels_cast_i8_i64_can_implement^⚠: Implementability check for cast_i8_i64.
baracuda_kernels_cast_i8_i64_run^⚠: Cast i8 -> i64.
baracuda_kernels_cast_i8_u8_can_implement^⚠: Implementability check for cast_i8_u8.
baracuda_kernels_cast_i8_u8_run^⚠: Cast i8 -> u8.
baracuda_kernels_cast_i8_u32_can_implement^⚠: baracuda_kernels_cast_i8_u32_can_implement (baracuda kernels cast i8 u32 can implement).
baracuda_kernels_cast_i8_u32_run^⚠: Cast i8 -> u32. Sign-extends then reinterprets. Phase 31.
baracuda_kernels_cast_i16_bf16_can_implement^⚠: baracuda_kernels_cast_i16_bf16_can_implement (baracuda kernels cast i16 bf16 can implement).
baracuda_kernels_cast_i16_bf16_run^⚠: Cast i16 -> bf16. Phase 31.
baracuda_kernels_cast_i16_f16_can_implement^⚠: baracuda_kernels_cast_i16_f16_can_implement (baracuda kernels cast i16 f16 can implement).
baracuda_kernels_cast_i16_f16_run^⚠: Cast i16 -> f16. Phase 31.
baracuda_kernels_cast_i16_f32_can_implement^⚠: baracuda_kernels_cast_i16_f32_can_implement (baracuda kernels cast i16 f32 can implement).
baracuda_kernels_cast_i16_f32_run^⚠: Cast i16 -> f32. Phase 31.
baracuda_kernels_cast_i16_f64_can_implement^⚠: baracuda_kernels_cast_i16_f64_can_implement (baracuda kernels cast i16 f64 can implement).
baracuda_kernels_cast_i16_f64_run^⚠: Cast i16 -> f64. Phase 31.
baracuda_kernels_cast_i16_i8_can_implement^⚠: baracuda_kernels_cast_i16_i8_can_implement (baracuda kernels cast i16 i8 can implement).
baracuda_kernels_cast_i16_i8_run^⚠: Cast i16 -> i8. Truncates to low byte. Phase 31.
baracuda_kernels_cast_i16_i16_can_implement^⚠: baracuda_kernels_cast_i16_i16_can_implement (baracuda kernels cast i16 i16 can implement).
baracuda_kernels_cast_i16_i16_run^⚠: Cast i16 -> i16 (identity). Phase 31.
baracuda_kernels_cast_i16_i32_can_implement^⚠: baracuda_kernels_cast_i16_i32_can_implement (baracuda kernels cast i16 i32 can implement).
baracuda_kernels_cast_i16_i32_run^⚠: Cast i16 -> i32. Sign-extends. Phase 31.
baracuda_kernels_cast_i16_i64_can_implement^⚠: baracuda_kernels_cast_i16_i64_can_implement (baracuda kernels cast i16 i64 can implement).
baracuda_kernels_cast_i16_i64_run^⚠: Cast i16 -> i64. Sign-extends. Phase 31.
baracuda_kernels_cast_i16_u8_can_implement^⚠: baracuda_kernels_cast_i16_u8_can_implement (baracuda kernels cast i16 u8 can implement).
baracuda_kernels_cast_i16_u8_run^⚠: Cast i16 -> u8. Truncates to low byte then reinterprets. Phase 31.
baracuda_kernels_cast_i16_u32_can_implement^⚠: baracuda_kernels_cast_i16_u32_can_implement (baracuda kernels cast i16 u32 can implement).
baracuda_kernels_cast_i16_u32_run^⚠: Cast i16 -> u32. Sign-extends to i32 then reinterprets. Phase 31.
baracuda_kernels_cast_i32_bf16_can_implement^⚠: Implementability check for cast_i32_bf16.
baracuda_kernels_cast_i32_bf16_run^⚠: Cast i32 -> bf16.
baracuda_kernels_cast_i32_bool_can_implement^⚠: Implementability check for cast_i32_bool.
baracuda_kernels_cast_i32_bool_run^⚠: Cast i32 -> Bool. x != 0 → 1.
baracuda_kernels_cast_i32_f16_can_implement^⚠: Implementability check for cast_i32_f16.
baracuda_kernels_cast_i32_f16_run^⚠: Cast i32 -> f16.
baracuda_kernels_cast_i32_f32_can_implement^⚠: Implementability check for cast_i32_f32.
baracuda_kernels_cast_i32_f32_run^⚠: Cast i32 -> f32.
baracuda_kernels_cast_i32_f64_can_implement^⚠: Implementability check for cast_i32_f64.
baracuda_kernels_cast_i32_f64_run^⚠: Cast i32 -> f64.
baracuda_kernels_cast_i32_i8_can_implement^⚠: Implementability check for cast_i32_i8.
baracuda_kernels_cast_i32_i8_run^⚠: Cast i32 -> i8.
baracuda_kernels_cast_i32_i16_can_implement^⚠: baracuda_kernels_cast_i32_i16_can_implement (baracuda kernels cast i32 i16 can implement).
baracuda_kernels_cast_i32_i16_run^⚠: Cast i32 -> i16. Truncates to low 16 bits. Phase 31.
baracuda_kernels_cast_i32_i32_can_implement^⚠: Implementability check for cast_i32_i32.
baracuda_kernels_cast_i32_i32_run^⚠: Cast i32 -> i32.
baracuda_kernels_cast_i32_i64_can_implement^⚠: Implementability check for cast_i32_i64.
baracuda_kernels_cast_i32_i64_run^⚠: Cast i32 -> i64.
baracuda_kernels_cast_i32_s4_can_implement^⚠: baracuda_kernels_cast_i32_s4_can_implement (baracuda kernels cast i32 s4 can implement).
baracuda_kernels_cast_i32_s4_run^⚠: Cast i32 -> S4 (pack: saturate to [-8, +7] then nibble-mask).
baracuda_kernels_cast_i32_u4_can_implement^⚠: baracuda_kernels_cast_i32_u4_can_implement (baracuda kernels cast i32 u4 can implement).
baracuda_kernels_cast_i32_u4_run^⚠: Cast i32 -> U4 (pack: saturate to [0, 15] then nibble-mask).
baracuda_kernels_cast_i32_u8_can_implement^⚠: Implementability check for cast_i32_u8.
baracuda_kernels_cast_i32_u8_run^⚠: Cast i32 -> u8.
baracuda_kernels_cast_i32_u32_can_implement^⚠: baracuda_kernels_cast_i32_u32_can_implement (baracuda kernels cast i32 u32 can implement).
baracuda_kernels_cast_i32_u32_run^⚠: Cast i32 -> u32. Bitwise reinterpret for the common case (x >= 0); two’s-complement wraparound otherwise. Phase 31.
baracuda_kernels_cast_i64_bf16_can_implement^⚠: Implementability check for cast_i64_bf16.
baracuda_kernels_cast_i64_bf16_run^⚠: Cast i64 -> bf16.
baracuda_kernels_cast_i64_bool_can_implement^⚠: Implementability check for cast_i64_bool.
baracuda_kernels_cast_i64_bool_run^⚠: Cast i64 -> Bool.
baracuda_kernels_cast_i64_f16_can_implement^⚠: Implementability check for cast_i64_f16.
baracuda_kernels_cast_i64_f16_run^⚠: Cast i64 -> f16.
baracuda_kernels_cast_i64_f32_can_implement^⚠: Implementability check for cast_i64_f32.
baracuda_kernels_cast_i64_f32_run^⚠: Cast i64 -> f32.
baracuda_kernels_cast_i64_f64_can_implement^⚠: Implementability check for cast_i64_f64.
baracuda_kernels_cast_i64_f64_run^⚠: Cast i64 -> f64.
baracuda_kernels_cast_i64_i8_can_implement^⚠: Implementability check for cast_i64_i8.
baracuda_kernels_cast_i64_i8_run^⚠: Cast i64 -> i8.
baracuda_kernels_cast_i64_i16_can_implement^⚠: baracuda_kernels_cast_i64_i16_can_implement (baracuda kernels cast i64 i16 can implement).
baracuda_kernels_cast_i64_i16_run^⚠: Cast i64 -> i16. Truncates to low 16 bits. Phase 31.
baracuda_kernels_cast_i64_i32_can_implement^⚠: Implementability check for cast_i64_i32.
baracuda_kernels_cast_i64_i32_run^⚠: Cast i64 -> i32.
baracuda_kernels_cast_i64_i64_can_implement^⚠: Implementability check for cast_i64_i64.
baracuda_kernels_cast_i64_i64_run^⚠: Cast i64 -> i64.
baracuda_kernels_cast_i64_s4_can_implement^⚠: baracuda_kernels_cast_i64_s4_can_implement (baracuda kernels cast i64 s4 can implement).
baracuda_kernels_cast_i64_s4_run^⚠: Cast i64 -> S4.
baracuda_kernels_cast_i64_u4_can_implement^⚠: baracuda_kernels_cast_i64_u4_can_implement (baracuda kernels cast i64 u4 can implement).
baracuda_kernels_cast_i64_u4_run^⚠: Cast i64 -> U4.
baracuda_kernels_cast_i64_u8_can_implement^⚠: Implementability check for cast_i64_u8.
baracuda_kernels_cast_i64_u8_run^⚠: Cast i64 -> u8.
baracuda_kernels_cast_i64_u32_can_implement^⚠: baracuda_kernels_cast_i64_u32_can_implement (baracuda kernels cast i64 u32 can implement).
baracuda_kernels_cast_i64_u32_run^⚠: Cast i64 -> u32. Truncates the top 32 bits. Phase 31.
baracuda_kernels_cast_s4_f32_can_implement^⚠: baracuda_kernels_cast_s4_f32_can_implement (baracuda kernels cast s4 f32 can implement).
baracuda_kernels_cast_s4_f32_run^⚠: Cast S4 -> f32.
baracuda_kernels_cast_s4_i32_can_implement^⚠: baracuda_kernels_cast_s4_i32_can_implement (baracuda kernels cast s4 i32 can implement).
baracuda_kernels_cast_s4_i32_run^⚠: Cast S4 -> i32 (unpack: sign-extend nibble to int32).
baracuda_kernels_cast_s4_i64_can_implement^⚠: baracuda_kernels_cast_s4_i64_can_implement (baracuda kernels cast s4 i64 can implement).
baracuda_kernels_cast_s4_i64_run^⚠: Cast S4 -> i64.
baracuda_kernels_cast_u4_f32_can_implement^⚠: baracuda_kernels_cast_u4_f32_can_implement (baracuda kernels cast u4 f32 can implement).
baracuda_kernels_cast_u4_f32_run^⚠: Cast U4 -> f32.
baracuda_kernels_cast_u4_i32_can_implement^⚠: baracuda_kernels_cast_u4_i32_can_implement (baracuda kernels cast u4 i32 can implement).
baracuda_kernels_cast_u4_i32_run^⚠: Cast U4 -> i32 (unpack: zero-extend nibble to int32).
baracuda_kernels_cast_u4_i64_can_implement^⚠: baracuda_kernels_cast_u4_i64_can_implement (baracuda kernels cast u4 i64 can implement).
baracuda_kernels_cast_u4_i64_run^⚠: Cast U4 -> i64.
baracuda_kernels_cast_u8_bf16_can_implement^⚠: Implementability check for cast_u8_bf16.
baracuda_kernels_cast_u8_bf16_run^⚠: Cast u8 -> bf16.
baracuda_kernels_cast_u8_f16_can_implement^⚠: Implementability check for cast_u8_f16.
baracuda_kernels_cast_u8_f16_run^⚠: Cast u8 -> f16.
baracuda_kernels_cast_u8_f32_can_implement^⚠: Implementability check for cast_u8_f32.
baracuda_kernels_cast_u8_f32_run^⚠: Cast u8 -> f32.
baracuda_kernels_cast_u8_f64_can_implement^⚠: Implementability check for cast_u8_f64.
baracuda_kernels_cast_u8_f64_run^⚠: Cast u8 -> f64.
baracuda_kernels_cast_u8_i8_can_implement^⚠: Implementability check for cast_u8_i8.
baracuda_kernels_cast_u8_i8_run^⚠: Cast u8 -> i8.
baracuda_kernels_cast_u8_i16_can_implement^⚠: baracuda_kernels_cast_u8_i16_can_implement (baracuda kernels cast u8 i16 can implement).
baracuda_kernels_cast_u8_i16_run^⚠: Cast u8 -> i16. Zero-extends. Phase 31.
baracuda_kernels_cast_u8_i32_can_implement^⚠: Implementability check for cast_u8_i32.
baracuda_kernels_cast_u8_i32_run^⚠: Cast u8 -> i32.
baracuda_kernels_cast_u8_i64_can_implement^⚠: Implementability check for cast_u8_i64.
baracuda_kernels_cast_u8_i64_run^⚠: Cast u8 -> i64.
baracuda_kernels_cast_u8_u8_can_implement^⚠: Implementability check for cast_u8_u8.
baracuda_kernels_cast_u8_u8_run^⚠: Cast u8 -> u8.
baracuda_kernels_cast_u8_u32_can_implement^⚠: baracuda_kernels_cast_u8_u32_can_implement (baracuda kernels cast u8 u32 can implement).
baracuda_kernels_cast_u8_u32_run^⚠: Cast u8 -> u32. Zero-extends. Phase 31.
baracuda_kernels_cast_u32_bf16_can_implement^⚠: baracuda_kernels_cast_u32_bf16_can_implement (baracuda kernels cast u32 bf16 can implement).
baracuda_kernels_cast_u32_bf16_run^⚠: Cast u32 -> bf16. Phase 31.
baracuda_kernels_cast_u32_f16_can_implement^⚠: baracuda_kernels_cast_u32_f16_can_implement (baracuda kernels cast u32 f16 can implement).
baracuda_kernels_cast_u32_f16_run^⚠: Cast u32 -> f16. Phase 31.
baracuda_kernels_cast_u32_f32_can_implement^⚠: baracuda_kernels_cast_u32_f32_can_implement (baracuda kernels cast u32 f32 can implement).
baracuda_kernels_cast_u32_f32_run^⚠: Cast u32 -> f32. Phase 31.
baracuda_kernels_cast_u32_f64_can_implement^⚠: baracuda_kernels_cast_u32_f64_can_implement (baracuda kernels cast u32 f64 can implement).
baracuda_kernels_cast_u32_f64_run^⚠: Cast u32 -> f64. Phase 31.
baracuda_kernels_cast_u32_i8_can_implement^⚠: baracuda_kernels_cast_u32_i8_can_implement (baracuda kernels cast u32 i8 can implement).
baracuda_kernels_cast_u32_i8_run^⚠: Cast u32 -> i8. Truncates to low byte then reinterprets. Phase 31.
baracuda_kernels_cast_u32_i16_can_implement^⚠: baracuda_kernels_cast_u32_i16_can_implement (baracuda kernels cast u32 i16 can implement).
baracuda_kernels_cast_u32_i16_run^⚠: Cast u32 -> i16. Truncates to low 16 bits then reinterprets. Phase 31.
baracuda_kernels_cast_u32_i32_can_implement^⚠: baracuda_kernels_cast_u32_i32_can_implement (baracuda kernels cast u32 i32 can implement).
baracuda_kernels_cast_u32_i32_run^⚠: Cast u32 -> i32. Bitwise reinterpret. Phase 31.
baracuda_kernels_cast_u32_i64_can_implement^⚠: baracuda_kernels_cast_u32_i64_can_implement (baracuda kernels cast u32 i64 can implement).
baracuda_kernels_cast_u32_i64_run^⚠: Cast u32 -> i64. Zero-extends. Phase 31.
baracuda_kernels_cast_u32_u8_can_implement^⚠: baracuda_kernels_cast_u32_u8_can_implement (baracuda kernels cast u32 u8 can implement).
baracuda_kernels_cast_u32_u8_run^⚠: Cast u32 -> u8. Truncates to low byte. Phase 31.
baracuda_kernels_cast_u32_u32_can_implement^⚠: baracuda_kernels_cast_u32_u32_can_implement (baracuda kernels cast u32 u32 can implement).
baracuda_kernels_cast_u32_u32_run^⚠: Cast u32 -> u32 (identity). Phase 31.
baracuda_kernels_cholesky_batched_f32_run^⚠: Cholesky factorization (batched). Each a_array[b] is overwritten with the requested triangular factor. cuSOLVER’s potrfBatched is workspace-free internally but needs a device-resident array of device pointers — caller responsibility.
baracuda_kernels_cholesky_batched_f64_run^⚠: Cholesky factorization (batched). Each a_array[b] is overwritten with the requested triangular factor. cuSOLVER’s potrfBatched is workspace-free internally but needs a device-resident array of device pointers — caller responsibility.
baracuda_kernels_cholesky_f32_run^⚠: Cholesky factorization (non-batched). Overwrites a_inout in place with the requested triangular factor. uplo is 0 (lower, CUBLAS_FILL_MODE_LOWER) or 1 (upper, CUBLAS_FILL_MODE_UPPER).
baracuda_kernels_cholesky_f32_workspace_size^⚠: Cholesky factorization workspace size in bytes for the non-batched potrf path. Returns 0 on success and writes the byte count to *out_bytes; non-zero status on cuSOLVER failure (handle allocation / bufferSize query). Batched potrfBatched is workspace-free and has no equivalent query.
baracuda_kernels_cholesky_f64_run^⚠: Cholesky factorization (non-batched). Overwrites a_inout in place with the requested triangular factor. uplo is 0 (lower, CUBLAS_FILL_MODE_LOWER) or 1 (upper, CUBLAS_FILL_MODE_UPPER).
baracuda_kernels_cholesky_f64_workspace_size^⚠: Cholesky factorization workspace size in bytes for the non-batched potrf path. Returns 0 on success and writes the byte count to *out_bytes; non-zero status on cuSOLVER failure (handle allocation / bufferSize query). Batched potrfBatched is workspace-free and has no equivalent query.
baracuda_kernels_col2im_1d_bf16_can_implement^⚠: baracuda_kernels_col2im_1d_bf16_can_implement (baracuda kernels col2im 1d bf16 can implement).
baracuda_kernels_col2im_1d_bf16_run^⚠: col2im 1-D, bf16. Caller must zero output first.
baracuda_kernels_col2im_1d_f16_can_implement^⚠: baracuda_kernels_col2im_1d_f16_can_implement (baracuda kernels col2im 1d f16 can implement).
baracuda_kernels_col2im_1d_f16_run^⚠: col2im 1-D, f16. Caller must zero output first.
baracuda_kernels_col2im_1d_f32_can_implement^⚠: baracuda_kernels_col2im_1d_f32_can_implement (baracuda kernels col2im 1d f32 can implement).
baracuda_kernels_col2im_1d_f32_run^⚠: col2im 1-D, f32. Caller must zero output first.
baracuda_kernels_col2im_1d_f64_can_implement^⚠: baracuda_kernels_col2im_1d_f64_can_implement (baracuda kernels col2im 1d f64 can implement).
baracuda_kernels_col2im_1d_f64_run^⚠: col2im 1-D, f64. Caller must zero output first.
baracuda_kernels_concat2_backward_bf16_can_implement^⚠: baracuda_kernels_concat2_backward_bf16_can_implement (baracuda kernels concat2 backward bf16 can implement).
baracuda_kernels_concat2_backward_bf16_run^⚠: Concat2 backward (slice-split), bf16. See f32 variant.
baracuda_kernels_concat2_backward_f16_can_implement^⚠: baracuda_kernels_concat2_backward_f16_can_implement (baracuda kernels concat2 backward f16 can implement).
baracuda_kernels_concat2_backward_f16_run^⚠: Concat2 backward (slice-split), f16. See f32 variant.
baracuda_kernels_concat2_backward_f32_can_implement^⚠: baracuda_kernels_concat2_backward_f32_can_implement (baracuda kernels concat2 backward f32 can implement).
baracuda_kernels_concat2_backward_f32_run^⚠: Concat2 backward (slice-split), f32. Bit-exact, no arithmetic.
baracuda_kernels_concat2_backward_f64_can_implement^⚠: baracuda_kernels_concat2_backward_f64_can_implement (baracuda kernels concat2 backward f64 can implement).
baracuda_kernels_concat2_backward_f64_run^⚠: Concat2 backward (slice-split), f64. See f32 variant.
baracuda_kernels_concat2_bf16_can_implement^⚠: baracuda_kernels_concat2_bf16_can_implement (baracuda kernels concat2 bf16 can implement).
baracuda_kernels_concat2_bf16_run^⚠: cat(a, b, dim), bf16, contig output. See f32 variant.
baracuda_kernels_concat2_f16_can_implement^⚠: baracuda_kernels_concat2_f16_can_implement (baracuda kernels concat2 f16 can implement).
baracuda_kernels_concat2_f16_run^⚠: cat(a, b, dim), f16, contig output. See f32 variant.
baracuda_kernels_concat2_f32_can_implement^⚠: baracuda_kernels_concat2_f32_can_implement (baracuda kernels concat2 f32 can implement).
baracuda_kernels_concat2_f32_run^⚠: cat(a, b, dim), f32, contig output.
baracuda_kernels_concat2_f64_can_implement^⚠: baracuda_kernels_concat2_f64_can_implement (baracuda kernels concat2 f64 can implement).
baracuda_kernels_concat2_f64_run^⚠: cat(a, b, dim), f64, contig output. See f32 variant.
baracuda_kernels_contiguize_b1_can_implement^⚠: baracuda_kernels_contiguize_b1_can_implement (baracuda kernels contiguize b1 can implement).
baracuda_kernels_contiguize_b1_run^⚠: Contiguize, 1-byte element (Bool, S8, U8, Fp8E4M3, Fp8E5M2).
baracuda_kernels_contiguize_b2_can_implement^⚠: baracuda_kernels_contiguize_b2_can_implement (baracuda kernels contiguize b2 can implement).
baracuda_kernels_contiguize_b2_run^⚠: Contiguize, 2-byte element (f16, bf16).
baracuda_kernels_contiguize_b4_can_implement^⚠: baracuda_kernels_contiguize_b4_can_implement (baracuda kernels contiguize b4 can implement).
baracuda_kernels_contiguize_b4_run^⚠: Contiguize, 4-byte element (f32, F32Strict, i32).
baracuda_kernels_contiguize_b8_can_implement^⚠: baracuda_kernels_contiguize_b8_can_implement (baracuda kernels contiguize b8 can implement).
baracuda_kernels_contiguize_b8_run^⚠: Contiguize, 8-byte element (f64, i64, Complex32).
baracuda_kernels_contiguize_b16_can_implement^⚠: baracuda_kernels_contiguize_b16_can_implement (baracuda kernels contiguize b16 can implement).
baracuda_kernels_contiguize_b16_run^⚠: Contiguize, 16-byte element (Complex64).
baracuda_kernels_contiguize_nibble_can_implement^⚠: baracuda_kernels_contiguize_nibble_can_implement (baracuda kernels contiguize nibble can implement).
baracuda_kernels_contiguize_nibble_run^⚠: Contiguize, nibble-packed (S4 / U4). Returns status 3 (Unsupported) when the source’s innermost stride is not one of {1, -1, 2} — i.e. when the source layout breaks nibble alignment.
baracuda_kernels_curand_normal_f32_run^⚠: Sample numel f32 cells from Normal(mean, stddev).
baracuda_kernels_curand_normal_f32_workspace_size^⚠: Normal-sampler workspace size in bytes for f32 — always 0.
baracuda_kernels_curand_normal_f64_run^⚠: Sample numel f64 cells from Normal(mean, stddev).
baracuda_kernels_curand_normal_f64_workspace_size^⚠: Normal-sampler workspace size in bytes for f64 — always 0.
baracuda_kernels_curand_uniform_f32_run^⚠: Sample numel f32 cells from Uniform(low, high].
baracuda_kernels_curand_uniform_f32_workspace_size^⚠: Uniform-sampler workspace size in bytes for f32 — always 0.
baracuda_kernels_curand_uniform_f64_run^⚠: Sample numel f64 cells from Uniform(low, high].
baracuda_kernels_curand_uniform_f64_workspace_size^⚠: Uniform-sampler workspace size in bytes for f64 — always 0.
baracuda_kernels_dequantize_per_channel_backward_bf16_can_implement^⚠: Implementability check for dequantize_per_channel_backward_bf16.
baracuda_kernels_dequantize_per_channel_backward_bf16_run^⚠: dequantize_per_channel_backward — bf16.
baracuda_kernels_dequantize_per_channel_backward_f16_can_implement^⚠: Implementability check for dequantize_per_channel_backward_f16.
baracuda_kernels_dequantize_per_channel_backward_f16_run^⚠: dequantize_per_channel_backward — f16.
baracuda_kernels_dequantize_per_channel_backward_f32_can_implement^⚠: Implementability check for dequantize_per_channel_backward_f32.
baracuda_kernels_dequantize_per_channel_backward_f32_run^⚠: dq[i] = dy[i] * scale[c]. f32.
baracuda_kernels_dequantize_per_channel_backward_f64_can_implement^⚠: Implementability check for dequantize_per_channel_backward_f64.
baracuda_kernels_dequantize_per_channel_backward_f64_run^⚠: dequantize_per_channel_backward — f64.
baracuda_kernels_dequantize_per_channel_bf16_s8_can_implement^⚠: Implementability check for dequantize_per_channel_bf16_s8.
baracuda_kernels_dequantize_per_channel_bf16_s8_run^⚠: dequantize_per_channel — s8 → bf16.
baracuda_kernels_dequantize_per_channel_bf16_u8_can_implement^⚠: Implementability check for dequantize_per_channel_bf16_u8.
baracuda_kernels_dequantize_per_channel_bf16_u8_run^⚠: dequantize_per_channel — u8 → bf16.
baracuda_kernels_dequantize_per_channel_f16_s8_can_implement^⚠: Implementability check for dequantize_per_channel_f16_s8.
baracuda_kernels_dequantize_per_channel_f16_s8_run^⚠: dequantize_per_channel — s8 → f16.
baracuda_kernels_dequantize_per_channel_f16_u8_can_implement^⚠: Implementability check for dequantize_per_channel_f16_u8.
baracuda_kernels_dequantize_per_channel_f16_u8_run^⚠: dequantize_per_channel — u8 → f16.
baracuda_kernels_dequantize_per_channel_f32_s8_can_implement^⚠: Implementability check for dequantize_per_channel_f32_s8.
baracuda_kernels_dequantize_per_channel_f32_s8_run^⚠: x[i] = scale[c] * (q[i] - zp[c]). s8 → f32.
baracuda_kernels_dequantize_per_channel_f32_u8_can_implement^⚠: Implementability check for dequantize_per_channel_f32_u8.
baracuda_kernels_dequantize_per_channel_f32_u8_run^⚠: dequantize_per_channel — u8 → f32.
baracuda_kernels_dequantize_per_channel_f64_s8_can_implement^⚠: Implementability check for dequantize_per_channel_f64_s8.
baracuda_kernels_dequantize_per_channel_f64_s8_run^⚠: dequantize_per_channel — s8 → f64.
baracuda_kernels_dequantize_per_channel_f64_u8_can_implement^⚠: Implementability check for dequantize_per_channel_f64_u8.
baracuda_kernels_dequantize_per_channel_f64_u8_run^⚠: dequantize_per_channel — u8 → f64.
baracuda_kernels_dequantize_per_group_backward_bf16_can_implement^⚠: Implementability check for dequantize_per_group_backward_bf16.
baracuda_kernels_dequantize_per_group_backward_bf16_run^⚠: Dequant BW — bf16.
baracuda_kernels_dequantize_per_group_backward_f16_can_implement^⚠: Implementability check for dequantize_per_group_backward_f16.
baracuda_kernels_dequantize_per_group_backward_f16_run^⚠: Dequant BW — f16.
baracuda_kernels_dequantize_per_group_backward_f32_can_implement^⚠: Implementability check for dequantize_per_group_backward_f32.
baracuda_kernels_dequantize_per_group_backward_f32_run^⚠: Dequant BW — f32.
baracuda_kernels_dequantize_per_group_backward_f64_can_implement^⚠: Implementability check for dequantize_per_group_backward_f64.
baracuda_kernels_dequantize_per_group_backward_f64_run^⚠: Dequant BW — f64.
baracuda_kernels_dequantize_per_group_bf16_s8_can_implement^⚠: Implementability check for dequantize_per_group_bf16_s8.
baracuda_kernels_dequantize_per_group_bf16_s8_run^⚠: Dequant — bf16, s8.
baracuda_kernels_dequantize_per_group_bf16_u8_can_implement^⚠: Implementability check for dequantize_per_group_bf16_u8.
baracuda_kernels_dequantize_per_group_bf16_u8_run^⚠: Dequant — bf16, u8.
baracuda_kernels_dequantize_per_group_f16_s8_can_implement^⚠: Implementability check for dequantize_per_group_f16_s8.
baracuda_kernels_dequantize_per_group_f16_s8_run^⚠: Dequant — f16, s8.
baracuda_kernels_dequantize_per_group_f16_u8_can_implement^⚠: Implementability check for dequantize_per_group_f16_u8.
baracuda_kernels_dequantize_per_group_f16_u8_run^⚠: Dequant — f16, u8.
baracuda_kernels_dequantize_per_group_f32_s8_can_implement^⚠: Implementability check for dequantize_per_group_f32_s8.
baracuda_kernels_dequantize_per_group_f32_s8_run^⚠: Dequant — f32, s8.
baracuda_kernels_dequantize_per_group_f32_u8_can_implement^⚠: Implementability check for dequantize_per_group_f32_u8.
baracuda_kernels_dequantize_per_group_f32_u8_run^⚠: Dequant — f32, u8.
baracuda_kernels_dequantize_per_group_f64_s8_can_implement^⚠: Implementability check for dequantize_per_group_f64_s8.
baracuda_kernels_dequantize_per_group_f64_s8_run^⚠: Dequant — f64, s8.
baracuda_kernels_dequantize_per_group_f64_u8_can_implement^⚠: Implementability check for dequantize_per_group_f64_u8.
baracuda_kernels_dequantize_per_group_f64_u8_run^⚠: Dequant — f64, u8.
baracuda_kernels_dequantize_per_tensor_backward_bf16_can_implement^⚠: Implementability check for dequantize_per_tensor_backward_bf16.
baracuda_kernels_dequantize_per_tensor_backward_bf16_run^⚠: dequantize_per_tensor_backward — bf16.
baracuda_kernels_dequantize_per_tensor_backward_f16_can_implement^⚠: Implementability check for dequantize_per_tensor_backward_f16.
baracuda_kernels_dequantize_per_tensor_backward_f16_run^⚠: dequantize_per_tensor_backward — f16.
baracuda_kernels_dequantize_per_tensor_backward_f32_can_implement^⚠: Implementability check for dequantize_per_tensor_backward_f32.
baracuda_kernels_dequantize_per_tensor_backward_f32_run^⚠: dq = dy * scale. f32.
baracuda_kernels_dequantize_per_tensor_backward_f64_can_implement^⚠: Implementability check for dequantize_per_tensor_backward_f64.
baracuda_kernels_dequantize_per_tensor_backward_f64_run^⚠: dequantize_per_tensor_backward — f64.
baracuda_kernels_dequantize_per_tensor_bf16_s8_can_implement^⚠: Implementability check for dequantize_per_tensor_bf16_s8.
baracuda_kernels_dequantize_per_tensor_bf16_s8_run^⚠: dequantize_per_tensor — s8 → bf16.
baracuda_kernels_dequantize_per_tensor_bf16_u8_can_implement^⚠: Implementability check for dequantize_per_tensor_bf16_u8.
baracuda_kernels_dequantize_per_tensor_bf16_u8_run^⚠: dequantize_per_tensor — u8 → bf16.
baracuda_kernels_dequantize_per_tensor_f16_s8_can_implement^⚠: Implementability check for dequantize_per_tensor_f16_s8.
baracuda_kernels_dequantize_per_tensor_f16_s8_run^⚠: dequantize_per_tensor — s8 → f16.
baracuda_kernels_dequantize_per_tensor_f16_u8_can_implement^⚠: Implementability check for dequantize_per_tensor_f16_u8.
baracuda_kernels_dequantize_per_tensor_f16_u8_run^⚠: dequantize_per_tensor — u8 → f16.
baracuda_kernels_dequantize_per_tensor_f32_s8_can_implement^⚠: Implementability check for dequantize_per_tensor_f32_s8.
baracuda_kernels_dequantize_per_tensor_f32_s8_run^⚠: x = scale * (q - zp). s8 → f32.
baracuda_kernels_dequantize_per_tensor_f32_u8_can_implement^⚠: Implementability check for dequantize_per_tensor_f32_u8.
baracuda_kernels_dequantize_per_tensor_f32_u8_run^⚠: dequantize_per_tensor — u8 → f32.
baracuda_kernels_dequantize_per_tensor_f64_s8_can_implement^⚠: Implementability check for dequantize_per_tensor_f64_s8.
baracuda_kernels_dequantize_per_tensor_f64_s8_run^⚠: dequantize_per_tensor — s8 → f64.
baracuda_kernels_dequantize_per_tensor_f64_u8_can_implement^⚠: Implementability check for dequantize_per_tensor_f64_u8.
baracuda_kernels_dequantize_per_tensor_f64_u8_run^⚠: dequantize_per_tensor — u8 → f64.
baracuda_kernels_dequantize_per_token_backward_bf16_can_implement^⚠: Implementability check for dequantize_per_token_backward_bf16.
baracuda_kernels_dequantize_per_token_backward_bf16_run^⚠: Dequant BW — bf16.
baracuda_kernels_dequantize_per_token_backward_f16_can_implement^⚠: Implementability check for dequantize_per_token_backward_f16.
baracuda_kernels_dequantize_per_token_backward_f16_run^⚠: Dequant BW — f16.
baracuda_kernels_dequantize_per_token_backward_f32_can_implement^⚠: Implementability check for dequantize_per_token_backward_f32.
baracuda_kernels_dequantize_per_token_backward_f32_run^⚠: Dequant BW — f32.
baracuda_kernels_dequantize_per_token_backward_f64_can_implement^⚠: Implementability check for dequantize_per_token_backward_f64.
baracuda_kernels_dequantize_per_token_backward_f64_run^⚠: Dequant BW — f64.
baracuda_kernels_dequantize_per_token_bf16_s8_can_implement^⚠: Implementability check for dequantize_per_token_bf16_s8.
baracuda_kernels_dequantize_per_token_bf16_s8_run^⚠: dequantize_per_token — q s8 → y bf16.
baracuda_kernels_dequantize_per_token_bf16_u8_can_implement^⚠: Implementability check for dequantize_per_token_bf16_u8.
baracuda_kernels_dequantize_per_token_bf16_u8_run^⚠: dequantize_per_token — q u8 → y bf16.
baracuda_kernels_dequantize_per_token_f16_s8_can_implement^⚠: Implementability check for dequantize_per_token_f16_s8.
baracuda_kernels_dequantize_per_token_f16_s8_run^⚠: dequantize_per_token — q s8 → y f16.
baracuda_kernels_dequantize_per_token_f16_u8_can_implement^⚠: Implementability check for dequantize_per_token_f16_u8.
baracuda_kernels_dequantize_per_token_f16_u8_run^⚠: dequantize_per_token — q u8 → y f16.
baracuda_kernels_dequantize_per_token_f32_s8_can_implement^⚠: Implementability check for dequantize_per_token_f32_s8.
baracuda_kernels_dequantize_per_token_f32_s8_run^⚠: dequantize_per_token — q s8 → y f32.
baracuda_kernels_dequantize_per_token_f32_u8_can_implement^⚠: Implementability check for dequantize_per_token_f32_u8.
baracuda_kernels_dequantize_per_token_f32_u8_run^⚠: dequantize_per_token — q u8 → y f32.
baracuda_kernels_dequantize_per_token_f64_s8_can_implement^⚠: Implementability check for dequantize_per_token_f64_s8.
baracuda_kernels_dequantize_per_token_f64_s8_run^⚠: dequantize_per_token — q s8 → y f64.
baracuda_kernels_dequantize_per_token_f64_u8_can_implement^⚠: Implementability check for dequantize_per_token_f64_u8.
baracuda_kernels_dequantize_per_token_f64_u8_run^⚠: dequantize_per_token — q u8 → y f64.
baracuda_kernels_dequantize_q2_K_can_implement^⚠: baracuda_kernels_dequantize_q2_K_can_implement (baracuda kernels dequantize q2 k can implement).
baracuda_kernels_dequantize_q2_K_run^⚠: GGUF Q2_K dequantize → f32. numel must be a multiple of 256.
baracuda_kernels_dequantize_q3_K_can_implement^⚠: baracuda_kernels_dequantize_q3_K_can_implement (baracuda kernels dequantize q3 k can implement).
baracuda_kernels_dequantize_q3_K_run^⚠: GGUF Q3_K dequantize → f32. # Safety: as Q2_K.
baracuda_kernels_dequantize_q4_0_can_implement^⚠: baracuda_kernels_dequantize_q4_0_can_implement (baracuda kernels dequantize q4 0 can implement).
baracuda_kernels_dequantize_q4_0_run^⚠: GGUF Q4_0 block-format dequantize → f32. numel must be a multiple of 32. # Safety: device-resident x, y; valid stream.
baracuda_kernels_dequantize_q4_1_can_implement^⚠: baracuda_kernels_dequantize_q4_1_can_implement (baracuda kernels dequantize q4 1 can implement).
baracuda_kernels_dequantize_q4_1_run^⚠: GGUF Q4_1 dequantize → f32. # Safety: as Q4_0.
baracuda_kernels_dequantize_q4_K_can_implement^⚠: baracuda_kernels_dequantize_q4_K_can_implement (baracuda kernels dequantize q4 k can implement).
baracuda_kernels_dequantize_q4_K_run^⚠: GGUF Q4_K dequantize → f32. # Safety: as Q2_K.
baracuda_kernels_dequantize_q5_0_can_implement^⚠: baracuda_kernels_dequantize_q5_0_can_implement (baracuda kernels dequantize q5 0 can implement).
baracuda_kernels_dequantize_q5_0_run^⚠: GGUF Q5_0 dequantize → f32. # Safety: as Q4_0.
baracuda_kernels_dequantize_q5_1_can_implement^⚠: baracuda_kernels_dequantize_q5_1_can_implement (baracuda kernels dequantize q5 1 can implement).
baracuda_kernels_dequantize_q5_1_run^⚠: GGUF Q5_1 dequantize → f32. # Safety: as Q4_0.
baracuda_kernels_dequantize_q5_K_can_implement^⚠: baracuda_kernels_dequantize_q5_K_can_implement (baracuda kernels dequantize q5 k can implement).
baracuda_kernels_dequantize_q5_K_run^⚠: GGUF Q5_K dequantize → f32. # Safety: as Q2_K.
baracuda_kernels_dequantize_q6_K_can_implement^⚠: baracuda_kernels_dequantize_q6_K_can_implement (baracuda kernels dequantize q6 k can implement).
baracuda_kernels_dequantize_q6_K_run^⚠: GGUF Q6_K dequantize → f32. # Safety: as Q2_K.
baracuda_kernels_dequantize_q8_0_can_implement^⚠: baracuda_kernels_dequantize_q8_0_can_implement (baracuda kernels dequantize q8 0 can implement).
baracuda_kernels_dequantize_q8_0_run^⚠: GGUF Q8_0 dequantize → f32. # Safety: as Q4_0.
baracuda_kernels_dequantize_q8_K_can_implement^⚠: baracuda_kernels_dequantize_q8_K_can_implement (baracuda kernels dequantize q8 k can implement).
baracuda_kernels_dequantize_q8_K_run^⚠: GGUF Q8_K dequantize → f32. # Safety: as Q2_K.
baracuda_kernels_dropout_backward_f32_can_implement^⚠: baracuda_kernels_dropout_backward_f32_can_implement (baracuda kernels dropout backward f32 can implement).
baracuda_kernels_dropout_backward_f32_run^⚠: Dropout backward (f32). Writes dx[i] = dy[i] * mask[i] * scale where scale = 1 / (1 - p).
baracuda_kernels_dropout_backward_f64_can_implement^⚠: baracuda_kernels_dropout_backward_f64_can_implement (baracuda kernels dropout backward f64 can implement).
baracuda_kernels_dropout_backward_f64_run^⚠: Dropout backward (f64).
baracuda_kernels_dropout_f32_can_implement^⚠: baracuda_kernels_dropout_f32_can_implement (baracuda kernels dropout f32 can implement).
baracuda_kernels_dropout_f32_run^⚠: Dropout forward (f32). Writes:
baracuda_kernels_dropout_f64_can_implement^⚠: baracuda_kernels_dropout_f64_can_implement (baracuda kernels dropout f64 can implement).
baracuda_kernels_dropout_f64_run^⚠: Dropout forward (f64). Same shape as the f32 variant.
baracuda_kernels_dynamic_range_quantize_per_token_sym_f32_s8_can_implement^⚠: Implementability check for dynamic_range_quantize_per_token_sym_f32_s8.
baracuda_kernels_dynamic_range_quantize_per_token_sym_f32_s8_run^⚠: dynamic_range_quantize_per_token_sym — f32 → s8.
baracuda_kernels_dynamic_range_quantize_per_token_sym_f64_s8_can_implement^⚠: Implementability check for dynamic_range_quantize_per_token_sym_f64_s8.
baracuda_kernels_dynamic_range_quantize_per_token_sym_f64_s8_run^⚠: dynamic_range_quantize_per_token_sym — f64 → s8.
baracuda_kernels_eig_run^⚠: General eigendecomposition via Xgeev. a_inout is destroyed in place. dtype_tag selects between f32 / f64 / Complex32 / Complex64 (matches the input dtype; outputs use the same dtype). For real input, w_out is [2 * n] (packed wr/wi); for complex input, [n]. Workspace is split host + device per cuSOLVER’s 64-bit API convention.
baracuda_kernels_eig_workspace_size^⚠: Eig workspace sizes (Xgeev). Writes two byte counts — device + host. Caller must size both.
baracuda_kernels_eigh_c32_run^⚠: Hermitian eigendecomposition (Complex32). Eigenvalues are real f32 (the Hermitian eigenvalue spectrum is always real); the eigenvalues_out buffer is f32[n], not Complex32[n].
baracuda_kernels_eigh_c32_workspace_size^⚠: Hermitian eigendecomposition workspace size (Complex32).
baracuda_kernels_eigh_c64_run^⚠: Hermitian eigendecomposition (Complex64). Eigenvalues are real f64; eigenvalues_out is f64[n], not Complex64[n].
baracuda_kernels_eigh_c64_workspace_size^⚠: Hermitian eigendecomposition workspace size (Complex64).
baracuda_kernels_eigh_f32_run^⚠: Symmetric eigendecomposition A · v = λ · v. a_inout is overwritten with the eigenvector matrix (column-major); eigenvalues_out receives the n eigenvalues sorted ascending.
baracuda_kernels_eigh_f32_workspace_size^⚠: Eigh workspace size in bytes for the real symmetric syevd path.
baracuda_kernels_eigh_f64_run^⚠: Symmetric eigendecomposition A · v = λ · v. a_inout is overwritten with the eigenvector matrix (column-major); eigenvalues_out receives the n eigenvalues sorted ascending.
baracuda_kernels_eigh_f64_workspace_size^⚠: Eigh workspace size in bytes for the real symmetric syevd path.
baracuda_kernels_embedding_backward_f32_can_implement^⚠: Implementability check for embedding_backward_f32.
baracuda_kernels_embedding_backward_f32_run^⚠: embedding BW — dweight[indices[n], :] += dout[n, :] (atomicAdd), skipping rows where indices[n] == padding_idx. f32.
baracuda_kernels_embedding_backward_f64_can_implement^⚠: Implementability check for embedding_backward_f64.
baracuda_kernels_embedding_backward_f64_run^⚠: embedding BW — f64.
baracuda_kernels_embedding_backward_i64idx_f32_can_implement^⚠: Implementability check for embedding_backward_i64idx_f32.
baracuda_kernels_embedding_backward_i64idx_f32_run^⚠: embedding BW — f32, i64 indices.
baracuda_kernels_embedding_backward_i64idx_f64_can_implement^⚠: Implementability check for embedding_backward_i64idx_f64.
baracuda_kernels_embedding_backward_i64idx_f64_run^⚠: embedding BW — f64, i64 indices.
baracuda_kernels_embedding_bag_backward_f32_can_implement^⚠: Implementability check for embedding_bag_backward_f32.
baracuda_kernels_embedding_bag_backward_f32_run^⚠: embedding_bag BW — atomicAdd into dweight. f32.
baracuda_kernels_embedding_bag_backward_f64_can_implement^⚠: Implementability check for embedding_bag_backward_f64.
baracuda_kernels_embedding_bag_backward_f64_run^⚠: embedding_bag BW — f64.
baracuda_kernels_embedding_bag_backward_i64idx_f32_can_implement^⚠: Implementability check for embedding_bag_backward_i64idx_f32.
baracuda_kernels_embedding_bag_backward_i64idx_f32_run^⚠: embedding_bag BW — f32, i64 indices.
baracuda_kernels_embedding_bag_backward_i64idx_f64_can_implement^⚠: Implementability check for embedding_bag_backward_i64idx_f64.
baracuda_kernels_embedding_bag_backward_i64idx_f64_run^⚠: embedding_bag BW — f64, i64 indices.
baracuda_kernels_embedding_bag_bf16_can_implement^⚠: Implementability check for embedding_bag_bf16.
baracuda_kernels_embedding_bag_bf16_run^⚠: embedding_bag FW — bf16.
baracuda_kernels_embedding_bag_f16_can_implement^⚠: Implementability check for embedding_bag_f16.
baracuda_kernels_embedding_bag_f16_run^⚠: embedding_bag FW — f16.
baracuda_kernels_embedding_bag_f32_can_implement^⚠: Implementability check for embedding_bag_f32.
baracuda_kernels_embedding_bag_f32_run^⚠: embedding_bag FW — f32.
baracuda_kernels_embedding_bag_f64_can_implement^⚠: Implementability check for embedding_bag_f64.
baracuda_kernels_embedding_bag_f64_run^⚠: embedding_bag FW — f64.
baracuda_kernels_embedding_bag_i64idx_bf16_can_implement^⚠: Implementability check for embedding_bag_i64idx_bf16.
baracuda_kernels_embedding_bag_i64idx_bf16_run^⚠: embedding_bag FW — bf16, i64 indices.
baracuda_kernels_embedding_bag_i64idx_f16_can_implement^⚠: Implementability check for embedding_bag_i64idx_f16.
baracuda_kernels_embedding_bag_i64idx_f16_run^⚠: embedding_bag FW — f16, i64 indices.
baracuda_kernels_embedding_bag_i64idx_f32_can_implement^⚠: Implementability check for embedding_bag_i64idx_f32.
baracuda_kernels_embedding_bag_i64idx_f32_run^⚠: embedding_bag FW — f32, i64 indices.
baracuda_kernels_embedding_bag_i64idx_f64_can_implement^⚠: Implementability check for embedding_bag_i64idx_f64.
baracuda_kernels_embedding_bag_i64idx_f64_run^⚠: embedding_bag FW — f64, i64 indices.
baracuda_kernels_embedding_bag_max_backward_f32_can_implement^⚠: Implementability check for embedding_bag_max_backward_f32.
baracuda_kernels_embedding_bag_max_backward_f32_run^⚠: embedding_bag_max BW — f32. Index dtype is fixed at i32 (set by the FW’s out_index output).
baracuda_kernels_embedding_bag_max_backward_f64_can_implement^⚠: Implementability check for embedding_bag_max_backward_f64.
baracuda_kernels_embedding_bag_max_backward_f64_run^⚠: embedding_bag_max BW — f64.
baracuda_kernels_embedding_bag_max_bf16_can_implement^⚠: Implementability check for embedding_bag_max_bf16.
baracuda_kernels_embedding_bag_max_bf16_run^⚠: embedding_bag_max FW — bf16.
baracuda_kernels_embedding_bag_max_f16_can_implement^⚠: Implementability check for embedding_bag_max_f16.
baracuda_kernels_embedding_bag_max_f16_run^⚠: embedding_bag_max FW — f16.
baracuda_kernels_embedding_bag_max_f32_can_implement^⚠: Implementability check for embedding_bag_max_f32.
baracuda_kernels_embedding_bag_max_f32_run^⚠: embedding_bag Max-mode FW — f32 (i32 indices).
baracuda_kernels_embedding_bag_max_f64_can_implement^⚠: Implementability check for embedding_bag_max_f64.
baracuda_kernels_embedding_bag_max_f64_run^⚠: embedding_bag_max FW — f64.
baracuda_kernels_embedding_bag_max_i64idx_bf16_can_implement^⚠: Implementability check for embedding_bag_max_i64idx_bf16.
baracuda_kernels_embedding_bag_max_i64idx_bf16_run^⚠: embedding_bag_max FW — bf16, i64 indices.
baracuda_kernels_embedding_bag_max_i64idx_f16_can_implement^⚠: Implementability check for embedding_bag_max_i64idx_f16.
baracuda_kernels_embedding_bag_max_i64idx_f16_run^⚠: embedding_bag_max FW — f16, i64 indices.
baracuda_kernels_embedding_bag_max_i64idx_f32_can_implement^⚠: Implementability check for embedding_bag_max_i64idx_f32.
baracuda_kernels_embedding_bag_max_i64idx_f32_run^⚠: embedding_bag_max FW — f32, i64 indices.
baracuda_kernels_embedding_bag_max_i64idx_f64_can_implement^⚠: Implementability check for embedding_bag_max_i64idx_f64.
baracuda_kernels_embedding_bag_max_i64idx_f64_run^⚠: embedding_bag_max FW — f64, i64 indices.
baracuda_kernels_embedding_bf16_can_implement^⚠: Implementability check for embedding_bf16.
baracuda_kernels_embedding_bf16_run^⚠: embedding FW — bf16.
baracuda_kernels_embedding_f16_can_implement^⚠: Implementability check for embedding_f16.
baracuda_kernels_embedding_f16_run^⚠: embedding FW — f16.
baracuda_kernels_embedding_f32_can_implement^⚠: Implementability check for embedding_f32.
baracuda_kernels_embedding_f32_run^⚠: embedding FW — f32 (pure copy).
baracuda_kernels_embedding_f64_can_implement^⚠: Implementability check for embedding_f64.
baracuda_kernels_embedding_f64_run^⚠: embedding FW — f64.
baracuda_kernels_embedding_i64idx_bf16_can_implement^⚠: Implementability check for embedding_i64idx_bf16.
baracuda_kernels_embedding_i64idx_bf16_run^⚠: embedding FW — bf16, i64 indices.
baracuda_kernels_embedding_i64idx_f16_can_implement^⚠: Implementability check for embedding_i64idx_f16.
baracuda_kernels_embedding_i64idx_f16_run^⚠: embedding FW — f16, i64 indices.
baracuda_kernels_embedding_i64idx_f32_can_implement^⚠: Implementability check for embedding_i64idx_f32.
baracuda_kernels_embedding_i64idx_f32_run^⚠: embedding FW — f32, i64 indices.
baracuda_kernels_embedding_i64idx_f64_can_implement^⚠: Implementability check for embedding_i64idx_f64.
baracuda_kernels_embedding_i64idx_f64_run^⚠: embedding FW — f64, i64 indices.
baracuda_kernels_fake_quantize_backward_bf16_can_implement^⚠: Implementability check for fake_quantize_backward_bf16.
baracuda_kernels_fake_quantize_backward_bf16_run^⚠: fake_quantize_backward — bf16.
baracuda_kernels_fake_quantize_backward_f16_can_implement^⚠: Implementability check for fake_quantize_backward_f16.
baracuda_kernels_fake_quantize_backward_f16_run^⚠: fake_quantize_backward — f16.
baracuda_kernels_fake_quantize_backward_f32_can_implement^⚠: Implementability check for fake_quantize_backward_f32.
baracuda_kernels_fake_quantize_backward_f32_run^⚠: dx = dy * in_range_mask(x). STE, no 1/scale factor. f32.
baracuda_kernels_fake_quantize_backward_f64_can_implement^⚠: Implementability check for fake_quantize_backward_f64.
baracuda_kernels_fake_quantize_backward_f64_run^⚠: fake_quantize_backward — f64.
baracuda_kernels_fake_quantize_bf16_can_implement^⚠: Implementability check for fake_quantize_bf16.
baracuda_kernels_fake_quantize_bf16_run^⚠: fake_quantize — bf16.
baracuda_kernels_fake_quantize_f16_can_implement^⚠: Implementability check for fake_quantize_f16.
baracuda_kernels_fake_quantize_f16_run^⚠: fake_quantize — f16.
baracuda_kernels_fake_quantize_f32_can_implement^⚠: Implementability check for fake_quantize_f32.
baracuda_kernels_fake_quantize_f32_run^⚠: y = scale * (clamp(round(x/scale)+zp, qmin, qmax) - zp). f32.
baracuda_kernels_fake_quantize_f64_can_implement^⚠: Implementability check for fake_quantize_f64.
baracuda_kernels_fake_quantize_f64_run^⚠: fake_quantize — f64 (f64 scale).
baracuda_kernels_fft_1d_c32_run^⚠: 1-D C2C FFT (forward + inverse via flag). Wraps cuFFT’s cufftExecC2C (c32) / cufftExecZ2Z (c64). For inverse, applies 1/n normalization in-place after exec.
baracuda_kernels_fft_1d_c32_workspace_size^⚠: 1-D C2C FFT workspace size in bytes. cuFFT manages its own internal workspace; this entry always writes 0.
baracuda_kernels_fft_1d_c64_run^⚠: 1-D C2C FFT (forward + inverse via flag). Wraps cuFFT’s cufftExecC2C (c32) / cufftExecZ2Z (c64). For inverse, applies 1/n normalization in-place after exec.
baracuda_kernels_fft_1d_c64_workspace_size^⚠: 1-D C2C FFT workspace size in bytes. cuFFT manages its own internal workspace; this entry always writes 0.
baracuda_kernels_fft_nd_c32_run^⚠: ND C2C FFT (forward + inverse via flag).
baracuda_kernels_fft_nd_c32_workspace_size^⚠: ND C2C FFT workspace size in bytes — always 0.
baracuda_kernels_fft_nd_c64_run^⚠: ND C2C FFT (forward + inverse via flag).
baracuda_kernels_fft_nd_c64_workspace_size^⚠: ND C2C FFT workspace size in bytes — always 0.
baracuda_kernels_fftshift_4_can_implement^⚠: baracuda_kernels_fftshift_4_can_implement (baracuda kernels fftshift 4 can implement).
baracuda_kernels_fftshift_4_run^⚠: fftshift along the last axis of a [batch, n] tensor: y[b, i] = x[b, (i + n/2) % n]. Element-width specialization (4 bytes per element) — used for Bool / f32 / packed-Bool shifts; the same kernel re-instantiated at 8 / 16 bytes covers f64 / Complex32 and Complex64.
baracuda_kernels_fftshift_8_can_implement^⚠: baracuda_kernels_fftshift_8_can_implement (baracuda kernels fftshift 8 can implement).
baracuda_kernels_fftshift_8_run^⚠: 8-byte-element fftshift (covers f64 and Complex32).
baracuda_kernels_fftshift_16_can_implement^⚠: baracuda_kernels_fftshift_16_can_implement (baracuda kernels fftshift 16 can implement).
baracuda_kernels_fftshift_16_run^⚠: 16-byte-element fftshift (covers Complex64).
baracuda_kernels_fftshift_nd_4_can_implement^⚠: baracuda_kernels_fftshift_nd_4_can_implement (baracuda kernels fftshift nd 4 can implement).
baracuda_kernels_fftshift_nd_4_run^⚠: N-D fftshift / ifftshift — single-pass general-permutation kernel covering up to rank-8 tensors. The caller passes a per- axis shape, per-axis shift_amt (0 for pass-through axes; n/2 for fftshift / n - n/2 for ifftshift on shifted axes), and per-axis contiguous stride (in elements). The same kernel covers both directions — the direction lives entirely in the shift_amt array.
baracuda_kernels_fftshift_nd_8_can_implement^⚠: baracuda_kernels_fftshift_nd_8_can_implement (baracuda kernels fftshift nd 8 can implement).
baracuda_kernels_fftshift_nd_8_run^⚠: 8-byte-cell N-D fftshift (covers f64 and Complex32).
baracuda_kernels_fftshift_nd_16_can_implement^⚠: baracuda_kernels_fftshift_nd_16_can_implement (baracuda kernels fftshift nd 16 can implement).
baracuda_kernels_fftshift_nd_16_run^⚠: 16-byte-cell N-D fftshift (covers Complex64).
baracuda_kernels_fill_bf16_can_implement^⚠: Implementability check for fill_bf16. Host-side only.
baracuda_kernels_fill_bf16_run^⚠: Fill y with value, bf16 dtype. value_bits is the raw 16-bit pattern of a bf16 value.
baracuda_kernels_fill_bf16_strided_can_implement^⚠: baracuda_kernels_fill_bf16_strided_can_implement (baracuda kernels fill bf16 strided can implement).
baracuda_kernels_fill_bf16_strided_run^⚠: Strided fill, bf16. value_bits is the raw 16-bit pattern of a bf16 value.
baracuda_kernels_fill_f16_can_implement^⚠: Implementability check for fill_f16. Host-side only.
baracuda_kernels_fill_f16_run^⚠: Fill y with value, f16 dtype. value_bits is the raw 16-bit pattern of an f16 value (transport convention shared with the Pad-constant family).
baracuda_kernels_fill_f16_strided_can_implement^⚠: baracuda_kernels_fill_f16_strided_can_implement (baracuda kernels fill f16 strided can implement).
baracuda_kernels_fill_f16_strided_run^⚠: Strided fill, f16. value_bits is the raw 16-bit pattern of an f16 value.
baracuda_kernels_fill_f32_can_implement^⚠: Implementability check for fill_f32. Host-side only.
baracuda_kernels_fill_f32_run^⚠: Fill y with value, f32 dtype. This is the fill trailblazer — every fill_<dt>_run (and _strided_run) variant follows the same write-only contract.
baracuda_kernels_fill_f32_strided_can_implement^⚠: baracuda_kernels_fill_f32_strided_can_implement (baracuda kernels fill f32 strided can implement).
baracuda_kernels_fill_f32_strided_run^⚠: baracuda_kernels_fill_f32_strided_run (baracuda kernels fill f32 strided run).
baracuda_kernels_fill_f64_can_implement^⚠: Implementability check for fill_f64. Host-side only.
baracuda_kernels_fill_f64_run^⚠: Fill y with value, f64 dtype.
baracuda_kernels_fill_f64_strided_can_implement^⚠: baracuda_kernels_fill_f64_strided_can_implement (baracuda kernels fill f64 strided can implement).
baracuda_kernels_fill_f64_strided_run^⚠: baracuda_kernels_fill_f64_strided_run (baracuda kernels fill f64 strided run).
baracuda_kernels_fill_fp8e4m3_can_implement^⚠: baracuda_kernels_fill_fp8e4m3_can_implement (baracuda kernels fill fp8e4m3 can implement).
baracuda_kernels_fill_fp8e4m3_run^⚠: Fill y with value, FP8 E4M3 dtype. value is the raw 8-bit E4M3 encoding (storage is byte-identical to u8); callers compute the encoding via the cast family or __nv_cvt_float_to_fp8.
baracuda_kernels_fill_fp8e4m3_strided_can_implement^⚠: baracuda_kernels_fill_fp8e4m3_strided_can_implement (baracuda kernels fill fp8e4m3 strided can implement).
baracuda_kernels_fill_fp8e4m3_strided_run^⚠: baracuda_kernels_fill_fp8e4m3_strided_run (baracuda kernels fill fp8e4m3 strided run).
baracuda_kernels_fill_i8_can_implement^⚠: Implementability check for fill_i8. Host-side only.
baracuda_kernels_fill_i8_run^⚠: Fill y with value, i8 dtype.
baracuda_kernels_fill_i8_strided_can_implement^⚠: baracuda_kernels_fill_i8_strided_can_implement (baracuda kernels fill i8 strided can implement).
baracuda_kernels_fill_i8_strided_run^⚠: baracuda_kernels_fill_i8_strided_run (baracuda kernels fill i8 strided run).
baracuda_kernels_fill_i16_can_implement^⚠: baracuda_kernels_fill_i16_can_implement (baracuda kernels fill i16 can implement).
baracuda_kernels_fill_i16_run^⚠: Fill y with value, i16 dtype.
baracuda_kernels_fill_i16_strided_can_implement^⚠: baracuda_kernels_fill_i16_strided_can_implement (baracuda kernels fill i16 strided can implement).
baracuda_kernels_fill_i16_strided_run^⚠: baracuda_kernels_fill_i16_strided_run (baracuda kernels fill i16 strided run).
baracuda_kernels_fill_i32_can_implement^⚠: Implementability check for fill_i32. Host-side only.
baracuda_kernels_fill_i32_run^⚠: Fill y with value, i32 dtype.
baracuda_kernels_fill_i32_strided_can_implement^⚠: baracuda_kernels_fill_i32_strided_can_implement (baracuda kernels fill i32 strided can implement).
baracuda_kernels_fill_i32_strided_run^⚠: baracuda_kernels_fill_i32_strided_run (baracuda kernels fill i32 strided run).
baracuda_kernels_fill_i64_can_implement^⚠: Implementability check for fill_i64. Host-side only.
baracuda_kernels_fill_i64_run^⚠: Fill y with value, i64 dtype.
baracuda_kernels_fill_i64_strided_can_implement^⚠: baracuda_kernels_fill_i64_strided_can_implement (baracuda kernels fill i64 strided can implement).
baracuda_kernels_fill_i64_strided_run^⚠: baracuda_kernels_fill_i64_strided_run (baracuda kernels fill i64 strided run).
baracuda_kernels_fill_u8_can_implement^⚠: Implementability check for fill_u8. Host-side only.
baracuda_kernels_fill_u8_run^⚠: Fill y with value, u8 dtype.
baracuda_kernels_fill_u8_strided_can_implement^⚠: baracuda_kernels_fill_u8_strided_can_implement (baracuda kernels fill u8 strided can implement).
baracuda_kernels_fill_u8_strided_run^⚠: baracuda_kernels_fill_u8_strided_run (baracuda kernels fill u8 strided run).
baracuda_kernels_fill_u32_can_implement^⚠: baracuda_kernels_fill_u32_can_implement (baracuda kernels fill u32 can implement).
baracuda_kernels_fill_u32_run^⚠: Fill y with value, u32 dtype.
baracuda_kernels_fill_u32_strided_can_implement^⚠: baracuda_kernels_fill_u32_strided_can_implement (baracuda kernels fill u32 strided can implement).
baracuda_kernels_fill_u32_strided_run^⚠: baracuda_kernels_fill_u32_strided_run (baracuda kernels fill u32 strided run).
baracuda_kernels_flash_decoding_bf16_can_implement^⚠: Implementability check for flash_decoding_bf16. Host-side only.
baracuda_kernels_flash_decoding_bf16_run^⚠: FlashDecoding FW, bf16 (f32 accumulators).
baracuda_kernels_flash_decoding_bf16_workspace_bytes^⚠: Workspace requirement for flash_decoding_bf16 in bytes.
baracuda_kernels_flash_decoding_f16_can_implement^⚠: Implementability check for flash_decoding_f16. Host-side only.
baracuda_kernels_flash_decoding_f16_run^⚠: FlashDecoding FW, f16 (f32 accumulators). seq_q = 1; split-K over chunks of 256 K-rows each, combined via a second kernel.
baracuda_kernels_flash_decoding_f16_workspace_bytes^⚠: Workspace requirement for flash_decoding_f16 in bytes.
baracuda_kernels_flash_sdpa_backward_bf16_can_implement^⚠: Implementability check for flash_sdpa_backward_bf16. Host-side only.
baracuda_kernels_flash_sdpa_backward_bf16_run^⚠: Flash SDPA BW, bf16.
baracuda_kernels_flash_sdpa_backward_f16_can_implement^⚠: Implementability check for flash_sdpa_backward_f16. Host-side only.
baracuda_kernels_flash_sdpa_backward_f16_run^⚠: Flash SDPA BW, f16.
baracuda_kernels_flash_sdpa_backward_f32_can_implement^⚠: Implementability check for flash_sdpa_backward_f32. Host-side only.
baracuda_kernels_flash_sdpa_backward_f32_run^⚠: Flash SDPA BW, f32. Given the FW-saved y, lse, plus upstream dy, computes dQ, dK, dV. The d_ws argument is a caller-allocated [B, H, Q] scratch buffer (overwritten with the per-row D = rowsum(y ⊙ dy) intermediate; element type matches T).
baracuda_kernels_flash_sdpa_backward_f64_can_implement^⚠: Implementability check for flash_sdpa_backward_f64. Host-side only.
baracuda_kernels_flash_sdpa_backward_f64_run^⚠: Flash SDPA BW, f64.
baracuda_kernels_flash_sdpa_bf16_can_implement^⚠: Implementability check for flash_sdpa_bf16. Host-side only.
baracuda_kernels_flash_sdpa_bf16_run^⚠: Flash SDPA FW, bf16 (f32 accumulators).
baracuda_kernels_flash_sdpa_f16_can_implement^⚠: Implementability check for flash_sdpa_f16. Host-side only.
baracuda_kernels_flash_sdpa_f16_run^⚠: Flash SDPA FW, f16 (f32 accumulators).
baracuda_kernels_flash_sdpa_f32_can_implement^⚠: Implementability check for flash_sdpa_f32. Host-side only.
baracuda_kernels_flash_sdpa_f32_run^⚠: Flash SDPA FW, f32. Computes y = softmax(Q·K^T·scale) · V via tiled fused online softmax. Optional upper-triangular causal mask (is_causal = 1); explicit additive mask is not supported in the trailblazer. Writes y: [B, H, Q, D_v] and the saved lse: [B, H, Q] log-sum-exp tensor that BW consumes.
baracuda_kernels_flash_sdpa_f64_can_implement^⚠: Implementability check for flash_sdpa_f64. Host-side only.
baracuda_kernels_flash_sdpa_f64_run^⚠: Flash SDPA FW, f64.
baracuda_kernels_flip_bf16_can_implement^⚠: Pre-launch implementability check for flip_bf16.
baracuda_kernels_flip_bf16_run^⚠: Flip, bf16. Pure element copy — no math.
baracuda_kernels_flip_bf16_strided_can_implement^⚠: flip_bf16_strided_can_implement companion.
baracuda_kernels_flip_bf16_strided_run^⚠: Flip strided sibling, bf16.
baracuda_kernels_flip_f16_can_implement^⚠: Pre-launch implementability check for flip_f16.
baracuda_kernels_flip_f16_run^⚠: Flip, f16. Pure element copy — no math.
baracuda_kernels_flip_f16_strided_can_implement^⚠: flip_f16_strided_can_implement companion.
baracuda_kernels_flip_f16_strided_run^⚠: Flip strided sibling, f16.
baracuda_kernels_flip_f32_can_implement^⚠: Pre-launch implementability check for flip_f32.
baracuda_kernels_flip_f32_run^⚠: Flip (reverse along selected axes), f32. flip_axes[d] is 1 = reverse axis d, 0 = no-op.
baracuda_kernels_flip_f32_strided_can_implement^⚠: flip_f32_strided_can_implement companion.
baracuda_kernels_flip_f32_strided_run^⚠: Flip strided sibling, f32.
baracuda_kernels_flip_f64_can_implement^⚠: Pre-launch implementability check for flip_f64.
baracuda_kernels_flip_f64_run^⚠: Flip, f64. Pure element copy — no math.
baracuda_kernels_flip_f64_strided_can_implement^⚠: flip_f64_strided_can_implement companion.
baracuda_kernels_flip_f64_strided_run^⚠: Flip strided sibling, f64.
baracuda_kernels_fractional_max_pool_2d_bw_bf16_can_implement^⚠: baracuda_kernels_fractional_max_pool_2d_bw_bf16_can_implement (baracuda kernels fractional max pool 2d bw bf16 can implement).
baracuda_kernels_fractional_max_pool_2d_bw_bf16_run^⚠: FractionalMaxPool2d BW, bf16.
baracuda_kernels_fractional_max_pool_2d_bw_f16_can_implement^⚠: baracuda_kernels_fractional_max_pool_2d_bw_f16_can_implement (baracuda kernels fractional max pool 2d bw f16 can implement).
baracuda_kernels_fractional_max_pool_2d_bw_f16_run^⚠: FractionalMaxPool2d BW, f16.
baracuda_kernels_fractional_max_pool_2d_bw_f32_can_implement^⚠: baracuda_kernels_fractional_max_pool_2d_bw_f32_can_implement (baracuda kernels fractional max pool 2d bw f32 can implement).
baracuda_kernels_fractional_max_pool_2d_bw_f32_run^⚠: FractionalMaxPool2d BW, f32.
baracuda_kernels_fractional_max_pool_2d_bw_f64_can_implement^⚠: baracuda_kernels_fractional_max_pool_2d_bw_f64_can_implement (baracuda kernels fractional max pool 2d bw f64 can implement).
baracuda_kernels_fractional_max_pool_2d_bw_f64_run^⚠: FractionalMaxPool2d BW, f64.
baracuda_kernels_fractional_max_pool_2d_fw_bf16_can_implement^⚠: baracuda_kernels_fractional_max_pool_2d_fw_bf16_can_implement (baracuda kernels fractional max pool 2d fw bf16 can implement).
baracuda_kernels_fractional_max_pool_2d_fw_bf16_run^⚠: FractionalMaxPool2d FW, bf16.
baracuda_kernels_fractional_max_pool_2d_fw_f16_can_implement^⚠: baracuda_kernels_fractional_max_pool_2d_fw_f16_can_implement (baracuda kernels fractional max pool 2d fw f16 can implement).
baracuda_kernels_fractional_max_pool_2d_fw_f16_run^⚠: FractionalMaxPool2d FW, f16.
baracuda_kernels_fractional_max_pool_2d_fw_f32_can_implement^⚠: baracuda_kernels_fractional_max_pool_2d_fw_f32_can_implement (baracuda kernels fractional max pool 2d fw f32 can implement).
baracuda_kernels_fractional_max_pool_2d_fw_f32_run^⚠: FractionalMaxPool2d FW, f32.
baracuda_kernels_fractional_max_pool_2d_fw_f64_can_implement^⚠: baracuda_kernels_fractional_max_pool_2d_fw_f64_can_implement (baracuda kernels fractional max pool 2d fw f64 can implement).
baracuda_kernels_fractional_max_pool_2d_fw_f64_run^⚠: FractionalMaxPool2d FW, f64.
baracuda_kernels_fractional_max_pool_3d_bw_bf16_can_implement^⚠: baracuda_kernels_fractional_max_pool_3d_bw_bf16_can_implement (baracuda kernels fractional max pool 3d bw bf16 can implement).
baracuda_kernels_fractional_max_pool_3d_bw_bf16_run^⚠: FractionalMaxPool3d BW, bf16.
baracuda_kernels_fractional_max_pool_3d_bw_f16_can_implement^⚠: baracuda_kernels_fractional_max_pool_3d_bw_f16_can_implement (baracuda kernels fractional max pool 3d bw f16 can implement).
baracuda_kernels_fractional_max_pool_3d_bw_f16_run^⚠: FractionalMaxPool3d BW, f16.
baracuda_kernels_fractional_max_pool_3d_bw_f32_can_implement^⚠: baracuda_kernels_fractional_max_pool_3d_bw_f32_can_implement (baracuda kernels fractional max pool 3d bw f32 can implement).
baracuda_kernels_fractional_max_pool_3d_bw_f32_run^⚠: FractionalMaxPool3d BW, f32.
baracuda_kernels_fractional_max_pool_3d_bw_f64_can_implement^⚠: baracuda_kernels_fractional_max_pool_3d_bw_f64_can_implement (baracuda kernels fractional max pool 3d bw f64 can implement).
baracuda_kernels_fractional_max_pool_3d_bw_f64_run^⚠: FractionalMaxPool3d BW, f64.
baracuda_kernels_fractional_max_pool_3d_fw_bf16_can_implement^⚠: baracuda_kernels_fractional_max_pool_3d_fw_bf16_can_implement (baracuda kernels fractional max pool 3d fw bf16 can implement).
baracuda_kernels_fractional_max_pool_3d_fw_bf16_run^⚠: FractionalMaxPool3d FW, bf16.
baracuda_kernels_fractional_max_pool_3d_fw_f16_can_implement^⚠: baracuda_kernels_fractional_max_pool_3d_fw_f16_can_implement (baracuda kernels fractional max pool 3d fw f16 can implement).
baracuda_kernels_fractional_max_pool_3d_fw_f16_run^⚠: FractionalMaxPool3d FW, f16.
baracuda_kernels_fractional_max_pool_3d_fw_f32_can_implement^⚠: baracuda_kernels_fractional_max_pool_3d_fw_f32_can_implement (baracuda kernels fractional max pool 3d fw f32 can implement).
baracuda_kernels_fractional_max_pool_3d_fw_f32_run^⚠: FractionalMaxPool3d FW, f32.
baracuda_kernels_fractional_max_pool_3d_fw_f64_can_implement^⚠: baracuda_kernels_fractional_max_pool_3d_fw_f64_can_implement (baracuda kernels fractional max pool 3d fw f64 can implement).
baracuda_kernels_fractional_max_pool_3d_fw_f64_run^⚠: FractionalMaxPool3d FW, f64.
baracuda_kernels_gated_geglu_backward_bf16_can_implement^⚠: baracuda_kernels_gated_geglu_backward_bf16_can_implement (baracuda kernels gated geglu backward bf16 can implement).
baracuda_kernels_gated_geglu_backward_bf16_run^⚠: GeGLU backward, bf16.
baracuda_kernels_gated_geglu_backward_f16_can_implement^⚠: baracuda_kernels_gated_geglu_backward_f16_can_implement (baracuda kernels gated geglu backward f16 can implement).
baracuda_kernels_gated_geglu_backward_f16_run^⚠: GeGLU backward, f16.
baracuda_kernels_gated_geglu_backward_f32_can_implement^⚠: baracuda_kernels_gated_geglu_backward_f32_can_implement (baracuda kernels gated geglu backward f32 can implement).
baracuda_kernels_gated_geglu_backward_f32_run^⚠: GeGLU backward, f32. da = dy·gelu(b), db = dy·a·gelu'(b).
baracuda_kernels_gated_geglu_backward_f64_can_implement^⚠: baracuda_kernels_gated_geglu_backward_f64_can_implement (baracuda kernels gated geglu backward f64 can implement).
baracuda_kernels_gated_geglu_backward_f64_run^⚠: GeGLU backward, f64.
baracuda_kernels_gated_geglu_bf16_can_implement^⚠: baracuda_kernels_gated_geglu_bf16_can_implement (baracuda kernels gated geglu bf16 can implement).
baracuda_kernels_gated_geglu_bf16_run^⚠: GeGLU forward, bf16.
baracuda_kernels_gated_geglu_f16_can_implement^⚠: baracuda_kernels_gated_geglu_f16_can_implement (baracuda kernels gated geglu f16 can implement).
baracuda_kernels_gated_geglu_f16_run^⚠: GeGLU forward, f16.
baracuda_kernels_gated_geglu_f32_can_implement^⚠: baracuda_kernels_gated_geglu_f32_can_implement (baracuda kernels gated geglu f32 can implement).
baracuda_kernels_gated_geglu_f32_run^⚠: GeGLU forward, f32. y = a · gelu(b), exact erf-based.
baracuda_kernels_gated_geglu_f64_can_implement^⚠: baracuda_kernels_gated_geglu_f64_can_implement (baracuda kernels gated geglu f64 can implement).
baracuda_kernels_gated_geglu_f64_run^⚠: GeGLU forward, f64.
baracuda_kernels_gated_glu_backward_bf16_can_implement^⚠: baracuda_kernels_gated_glu_backward_bf16_can_implement (baracuda kernels gated glu backward bf16 can implement).
baracuda_kernels_gated_glu_backward_bf16_run^⚠: GLU backward, bf16.
baracuda_kernels_gated_glu_backward_f16_can_implement^⚠: baracuda_kernels_gated_glu_backward_f16_can_implement (baracuda kernels gated glu backward f16 can implement).
baracuda_kernels_gated_glu_backward_f16_run^⚠: GLU backward, f16.
baracuda_kernels_gated_glu_backward_f32_can_implement^⚠: baracuda_kernels_gated_glu_backward_f32_can_implement (baracuda kernels gated glu backward f32 can implement).
baracuda_kernels_gated_glu_backward_f32_run^⚠: GLU backward, f32. da = dy·sigmoid(b), db = dy·a·sigmoid(b)·(1-sigmoid(b)).
baracuda_kernels_gated_glu_backward_f64_can_implement^⚠: baracuda_kernels_gated_glu_backward_f64_can_implement (baracuda kernels gated glu backward f64 can implement).
baracuda_kernels_gated_glu_backward_f64_run^⚠: GLU backward, f64.
baracuda_kernels_gated_glu_bf16_can_implement^⚠: baracuda_kernels_gated_glu_bf16_can_implement (baracuda kernels gated glu bf16 can implement).
baracuda_kernels_gated_glu_bf16_run^⚠: GLU forward, bf16.
baracuda_kernels_gated_glu_f16_can_implement^⚠: baracuda_kernels_gated_glu_f16_can_implement (baracuda kernels gated glu f16 can implement).
baracuda_kernels_gated_glu_f16_run^⚠: GLU forward, f16.
baracuda_kernels_gated_glu_f32_can_implement^⚠: baracuda_kernels_gated_glu_f32_can_implement (baracuda kernels gated glu f32 can implement).
baracuda_kernels_gated_glu_f32_run^⚠: GLU forward, f32. y = a · sigmoid(b).
baracuda_kernels_gated_glu_f64_can_implement^⚠: baracuda_kernels_gated_glu_f64_can_implement (baracuda kernels gated glu f64 can implement).
baracuda_kernels_gated_glu_f64_run^⚠: GLU forward, f64.
baracuda_kernels_gated_reglu_backward_bf16_can_implement^⚠: baracuda_kernels_gated_reglu_backward_bf16_can_implement (baracuda kernels gated reglu backward bf16 can implement).
baracuda_kernels_gated_reglu_backward_bf16_run^⚠: ReGLU backward, bf16.
baracuda_kernels_gated_reglu_backward_f16_can_implement^⚠: baracuda_kernels_gated_reglu_backward_f16_can_implement (baracuda kernels gated reglu backward f16 can implement).
baracuda_kernels_gated_reglu_backward_f16_run^⚠: ReGLU backward, f16.
baracuda_kernels_gated_reglu_backward_f32_can_implement^⚠: baracuda_kernels_gated_reglu_backward_f32_can_implement (baracuda kernels gated reglu backward f32 can implement).
baracuda_kernels_gated_reglu_backward_f32_run^⚠: ReGLU backward, f32. da = (b>0)?dy·b:0, db = (b>0)?dy·a:0.
baracuda_kernels_gated_reglu_backward_f64_can_implement^⚠: baracuda_kernels_gated_reglu_backward_f64_can_implement (baracuda kernels gated reglu backward f64 can implement).
baracuda_kernels_gated_reglu_backward_f64_run^⚠: ReGLU backward, f64.
baracuda_kernels_gated_reglu_bf16_can_implement^⚠: baracuda_kernels_gated_reglu_bf16_can_implement (baracuda kernels gated reglu bf16 can implement).
baracuda_kernels_gated_reglu_bf16_run^⚠: ReGLU forward, bf16.
baracuda_kernels_gated_reglu_f16_can_implement^⚠: baracuda_kernels_gated_reglu_f16_can_implement (baracuda kernels gated reglu f16 can implement).
baracuda_kernels_gated_reglu_f16_run^⚠: ReGLU forward, f16.
baracuda_kernels_gated_reglu_f32_can_implement^⚠: baracuda_kernels_gated_reglu_f32_can_implement (baracuda kernels gated reglu f32 can implement).
baracuda_kernels_gated_reglu_f32_run^⚠: ReGLU forward, f32. y = a · relu(b) = a · max(b, 0).
baracuda_kernels_gated_reglu_f64_can_implement^⚠: baracuda_kernels_gated_reglu_f64_can_implement (baracuda kernels gated reglu f64 can implement).
baracuda_kernels_gated_reglu_f64_run^⚠: ReGLU forward, f64.
baracuda_kernels_gated_swiglu_backward_bf16_can_implement^⚠: baracuda_kernels_gated_swiglu_backward_bf16_can_implement (baracuda kernels gated swiglu backward bf16 can implement).
baracuda_kernels_gated_swiglu_backward_bf16_run^⚠: SwiGLU backward, bf16.
baracuda_kernels_gated_swiglu_backward_f16_can_implement^⚠: baracuda_kernels_gated_swiglu_backward_f16_can_implement (baracuda kernels gated swiglu backward f16 can implement).
baracuda_kernels_gated_swiglu_backward_f16_run^⚠: SwiGLU backward, f16.
baracuda_kernels_gated_swiglu_backward_f32_can_implement^⚠: baracuda_kernels_gated_swiglu_backward_f32_can_implement (baracuda kernels gated swiglu backward f32 can implement).
baracuda_kernels_gated_swiglu_backward_f32_run^⚠: SwiGLU backward, f32. da = dy·silu(b), db = dy·a·silu'(b).
baracuda_kernels_gated_swiglu_backward_f64_can_implement^⚠: baracuda_kernels_gated_swiglu_backward_f64_can_implement (baracuda kernels gated swiglu backward f64 can implement).
baracuda_kernels_gated_swiglu_backward_f64_run^⚠: SwiGLU backward, f64.
baracuda_kernels_gated_swiglu_bf16_can_implement^⚠: baracuda_kernels_gated_swiglu_bf16_can_implement (baracuda kernels gated swiglu bf16 can implement).
baracuda_kernels_gated_swiglu_bf16_run^⚠: SwiGLU forward, bf16.
baracuda_kernels_gated_swiglu_f16_can_implement^⚠: baracuda_kernels_gated_swiglu_f16_can_implement (baracuda kernels gated swiglu f16 can implement).
baracuda_kernels_gated_swiglu_f16_run^⚠: SwiGLU forward, f16.
baracuda_kernels_gated_swiglu_f32_can_implement^⚠: baracuda_kernels_gated_swiglu_f32_can_implement (baracuda kernels gated swiglu f32 can implement).
baracuda_kernels_gated_swiglu_f32_run^⚠: SwiGLU forward, f32. y = a · b · sigmoid(b).
baracuda_kernels_gated_swiglu_f64_can_implement^⚠: baracuda_kernels_gated_swiglu_f64_can_implement (baracuda kernels gated swiglu f64 can implement).
baracuda_kernels_gated_swiglu_f64_run^⚠: SwiGLU forward, f64.
baracuda_kernels_gather_backward_f32_can_implement^⚠: Implementability check for gather_backward_f32.
baracuda_kernels_gather_backward_f32_run^⚠: dsrc[..., index[..., j, ...], ...] += dout[..., j, ...] along gather_dim. f32 (atomicAdd).
baracuda_kernels_gather_backward_f64_can_implement^⚠: Implementability check for gather_backward_f64.
baracuda_kernels_gather_backward_f64_run^⚠: gather_backward — f64 (atomicAdd).
baracuda_kernels_gather_backward_i64idx_f32_can_implement^⚠: Implementability check for gather_backward_i64idx_f32.
baracuda_kernels_gather_backward_i64idx_f32_run^⚠: gather BW — f32, i64 indices (atomicAdd).
baracuda_kernels_gather_backward_i64idx_f64_can_implement^⚠: Implementability check for gather_backward_i64idx_f64.
baracuda_kernels_gather_backward_i64idx_f64_run^⚠: gather BW — f64, i64 indices (atomicAdd).
baracuda_kernels_gather_f32_can_implement^⚠: Implementability check for gather_f32.
baracuda_kernels_gather_f32_run^⚠: out[..., j, ...] = src[..., index[..., j, ...], ...] along gather_dim. f32.
baracuda_kernels_gather_f64_can_implement^⚠: Implementability check for gather_f64.
baracuda_kernels_gather_f64_run^⚠: gather along gather_dim. f64.
baracuda_kernels_gather_i8_can_implement^⚠: baracuda_kernels_gather_i8_can_implement (baracuda kernels gather i8 can implement).
baracuda_kernels_gather_i8_run^⚠: baracuda_kernels_gather_i8_run (baracuda kernels gather i8 run).
baracuda_kernels_gather_i16_can_implement^⚠: baracuda_kernels_gather_i16_can_implement (baracuda kernels gather i16 can implement).
baracuda_kernels_gather_i16_run^⚠: baracuda_kernels_gather_i16_run (baracuda kernels gather i16 run).
baracuda_kernels_gather_i32_can_implement^⚠: Implementability check for gather_i32.
baracuda_kernels_gather_i32_run^⚠: gather along gather_dim. i32.
baracuda_kernels_gather_i64_can_implement^⚠: baracuda_kernels_gather_i64_can_implement (baracuda kernels gather i64 can implement).
baracuda_kernels_gather_i64_run^⚠: baracuda_kernels_gather_i64_run (baracuda kernels gather i64 run).
baracuda_kernels_gather_i64idx_f32_can_implement^⚠: Implementability check for gather_i64idx_f32.
baracuda_kernels_gather_i64idx_f32_run^⚠: gather FW — f32, i64 indices.
baracuda_kernels_gather_i64idx_f64_can_implement^⚠: Implementability check for gather_i64idx_f64.
baracuda_kernels_gather_i64idx_f64_run^⚠: gather FW — f64, i64 indices.
baracuda_kernels_gather_i64idx_i8_can_implement^⚠: baracuda_kernels_gather_i64idx_i8_can_implement (baracuda kernels gather i64idx i8 can implement).
baracuda_kernels_gather_i64idx_i8_run^⚠: baracuda_kernels_gather_i64idx_i8_run (baracuda kernels gather i64idx i8 run).
baracuda_kernels_gather_i64idx_i16_can_implement^⚠: baracuda_kernels_gather_i64idx_i16_can_implement (baracuda kernels gather i64idx i16 can implement).
baracuda_kernels_gather_i64idx_i16_run^⚠: baracuda_kernels_gather_i64idx_i16_run (baracuda kernels gather i64idx i16 run).
baracuda_kernels_gather_i64idx_i32_can_implement^⚠: Implementability check for gather_i64idx_i32.
baracuda_kernels_gather_i64idx_i32_run^⚠: gather FW — i32 values, i64 indices.
baracuda_kernels_gather_i64idx_i64_can_implement^⚠: baracuda_kernels_gather_i64idx_i64_can_implement (baracuda kernels gather i64idx i64 can implement).
baracuda_kernels_gather_i64idx_i64_run^⚠: baracuda_kernels_gather_i64idx_i64_run (baracuda kernels gather i64idx i64 run).
baracuda_kernels_gather_i64idx_u8_can_implement^⚠: baracuda_kernels_gather_i64idx_u8_can_implement (baracuda kernels gather i64idx u8 can implement).
baracuda_kernels_gather_i64idx_u8_run^⚠: baracuda_kernels_gather_i64idx_u8_run (baracuda kernels gather i64idx u8 run).
baracuda_kernels_gather_i64idx_u16_can_implement^⚠: baracuda_kernels_gather_i64idx_u16_can_implement (baracuda kernels gather i64idx u16 can implement).
baracuda_kernels_gather_i64idx_u16_run^⚠: baracuda_kernels_gather_i64idx_u16_run (baracuda kernels gather i64idx u16 run).
baracuda_kernels_gather_i64idx_u32_can_implement^⚠: baracuda_kernels_gather_i64idx_u32_can_implement (baracuda kernels gather i64idx u32 can implement).
baracuda_kernels_gather_i64idx_u32_run^⚠: baracuda_kernels_gather_i64idx_u32_run (baracuda kernels gather i64idx u32 run).
baracuda_kernels_gather_u8_can_implement^⚠: baracuda_kernels_gather_u8_can_implement (baracuda kernels gather u8 can implement).
baracuda_kernels_gather_u8_run^⚠: baracuda_kernels_gather_u8_run (baracuda kernels gather u8 run).
baracuda_kernels_gather_u8idx_f32_can_implement^⚠: Implementability check for gather_u8idx_f32.
baracuda_kernels_gather_u8idx_f32_run^⚠: gather FW — f32, u8 idx.
baracuda_kernels_gather_u8idx_f64_can_implement^⚠: Implementability check for gather_u8idx_f64.
baracuda_kernels_gather_u8idx_f64_run^⚠: gather FW — f64, u8 idx.
baracuda_kernels_gather_u16_can_implement^⚠: baracuda_kernels_gather_u16_can_implement (baracuda kernels gather u16 can implement).
baracuda_kernels_gather_u16_run^⚠: baracuda_kernels_gather_u16_run (baracuda kernels gather u16 run).
baracuda_kernels_gather_u32_can_implement^⚠: baracuda_kernels_gather_u32_can_implement (baracuda kernels gather u32 can implement).
baracuda_kernels_gather_u32_run^⚠: baracuda_kernels_gather_u32_run (baracuda kernels gather u32 run).
baracuda_kernels_gemm_batched_bf16_rcr_sm80_can_implement^⚠: Pre-launch implementability check (batched).
baracuda_kernels_gemm_batched_bf16_rcr_sm80_run^⚠: Launch strided-batched Cutlass GEMM. Batch i operates on A + i * stride_a, B + i * stride_b, etc. (strides in elements, not bytes).
baracuda_kernels_gemm_batched_bf16_rcr_sm80_workspace_size^⚠: Workspace bytes required for a batch_count-deep batched launch.
baracuda_kernels_gemm_batched_f16_rcr_sm80_can_implement^⚠: Pre-launch implementability check (batched).
baracuda_kernels_gemm_batched_f16_rcr_sm80_run^⚠: Launch strided-batched Cutlass GEMM. Batch i operates on A + i * stride_a, B + i * stride_b, etc. (strides in elements, not bytes).
baracuda_kernels_gemm_batched_f16_rcr_sm80_workspace_size^⚠: Workspace bytes required for a batch_count-deep batched launch.
baracuda_kernels_gemm_bf16_rcr_sm80_can_implement^⚠: Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_bf16_rcr_sm80_run^⚠: Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_bf16_rcr_sm80_workspace_size^⚠: Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_bf16_rrr_sm80_can_implement^⚠: Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_bf16_rrr_sm80_run^⚠: Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_bf16_rrr_sm80_workspace_size^⚠: Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_bias_bf16_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_bf16_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_bf16_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_bf16_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_bf16_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_bf16_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_f16_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_f16_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_f16_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_f16_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_f16_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_f16_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_f32_simt_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_f32_simt_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_f32_simt_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_f32_simt_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_f32_simt_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_f32_simt_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_f32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_f32bias_s8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_f32bias_s8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_f32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_f32bias_u8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_f32bias_u8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_f64_rcr_sm80_can_implement^⚠: Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_f64_rcr_sm80_run^⚠: Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_f64_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_f64_rrr_sm80_can_implement^⚠: Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_f64_rrr_sm80_run^⚠: Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_f64_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_bf16_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_bf16_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_bf16_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_bf16_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_bf16_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_bf16_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f16_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_f16_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_f16_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f16_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_f16_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_f16_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f32_simt_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_f32_simt_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_f32_simt_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f32_simt_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_f32_simt_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_f32_simt_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_f32bias_s8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_f32bias_s8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_f32bias_u8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_f32bias_u8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f64_rcr_sm80_can_implement^⚠: Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_gelu_f64_rcr_sm80_run^⚠: Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_gelu_f64_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_f64_rrr_sm80_can_implement^⚠: Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_gelu_f64_rrr_sm80_run^⚠: Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_gelu_f64_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_i32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_i32bias_s8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_i32bias_s8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_i32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_i32bias_u8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_i32bias_u8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_tf32_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_tf32_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_tf32_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_gelu_tf32_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_gelu_tf32_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_gelu_tf32_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_i32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_i32bias_s8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_i32bias_s8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_i32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_i32bias_u8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_i32bias_u8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_relu_bf16_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_bf16_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_bf16_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_relu_bf16_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_bf16_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_bf16_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f16_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_f16_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_f16_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f16_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_f16_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_f16_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f32_simt_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_f32_simt_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_f32_simt_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f32_simt_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_f32_simt_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_f32_simt_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_f32bias_s8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_f32bias_s8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_f32bias_u8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_f32bias_u8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f64_rcr_sm80_can_implement^⚠: Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_relu_f64_rcr_sm80_run^⚠: Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_relu_f64_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_relu_f64_rrr_sm80_can_implement^⚠: Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_relu_f64_rrr_sm80_run^⚠: Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_relu_f64_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_relu_i32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_i32bias_s8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_i32bias_s8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_relu_i32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_i32bias_u8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_i32bias_u8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_relu_tf32_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_tf32_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_tf32_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_relu_tf32_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_relu_tf32_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_relu_tf32_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_silu_bf16_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_bf16_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_bf16_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_silu_bf16_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_bf16_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_bf16_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f16_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_f16_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_f16_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f16_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_f16_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_f16_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f32_simt_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_f32_simt_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_f32_simt_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f32_simt_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_f32_simt_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_f32_simt_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_f32bias_s8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_f32bias_s8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_f32bias_u8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_f32bias_u8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f64_rcr_sm80_can_implement^⚠: Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_silu_f64_rcr_sm80_run^⚠: Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_silu_f64_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_silu_f64_rrr_sm80_can_implement^⚠: Pre-launch implementability check (DGEMM bias variant).
baracuda_kernels_gemm_bias_silu_f64_rrr_sm80_run^⚠: Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
baracuda_kernels_gemm_bias_silu_f64_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_silu_i32bias_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_i32bias_s8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_i32bias_s8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_silu_i32bias_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_i32bias_u8_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_i32bias_u8_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_silu_tf32_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_tf32_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_tf32_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_silu_tf32_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_silu_tf32_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_silu_tf32_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_tf32_rcr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_tf32_rcr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_tf32_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_bias_tf32_rrr_sm80_can_implement^⚠: Pre-launch implementability check (bias variant).
baracuda_kernels_gemm_bias_tf32_rrr_sm80_run^⚠: Launch bias-fused Cutlass GEMM on stream. bias is an [N] device vector broadcast across rows of D.
baracuda_kernels_gemm_bias_tf32_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_dense_bf16_can_implement: Host-side validity check for baracuda_kernels_gemm_dense_bf16_run. Validates extents, the layout tag, leading-dim minimums, i32-fit of leading dims, and stride_d != 0 at batch > 1. stride_a / stride_b are accepted unconditionally (any value, including 0-broadcast).
baracuda_kernels_gemm_dense_bf16_run^⚠: Dense bf16 GEMM (cuBLAS-backed): D[g] = α · A[g] · B[g] + β · D[g] for g ∈ [0, batch), accumulating in f32. Row-major problem; see the module docs for the layout tag (0 = RRR, 1 = RCR, 2 = CRR), leading-dim minimums, and the batch-stride contract (element strides; stride_a/stride_b may be 0 to broadcast; strides ignored at batch == 1).
baracuda_kernels_gemm_dense_bf16_workspace_size: Workspace query for baracuda_kernels_gemm_dense_bf16_run. Always 0 — cuBLAS allocates its workspace internally per handle.
baracuda_kernels_gemm_dense_f16_can_implement: Host-side validity check for baracuda_kernels_gemm_dense_f16_run. Validates extents, the layout tag, leading-dim minimums, i32-fit of leading dims, and stride_d != 0 at batch > 1. stride_a / stride_b are accepted unconditionally (any value, including 0-broadcast).
baracuda_kernels_gemm_dense_f16_run^⚠: Dense f16 GEMM (cuBLAS-backed): D[g] = α · A[g] · B[g] + β · D[g] for g ∈ [0, batch), accumulating in f32. Row-major problem; see the module docs for the layout tag (0 = RRR, 1 = RCR, 2 = CRR), leading-dim minimums, and the batch-stride contract (element strides; stride_a/stride_b may be 0 to broadcast; strides ignored at batch == 1).
baracuda_kernels_gemm_dense_f16_workspace_size: Workspace query for baracuda_kernels_gemm_dense_f16_run. Always 0 — cuBLAS allocates its workspace internally per handle.
baracuda_kernels_gemm_dense_f32_can_implement: Host-side validity check for baracuda_kernels_gemm_dense_f32_run. Validates extents, the layout tag, leading-dim minimums, i32-fit of leading dims, and stride_d != 0 at batch > 1. stride_a / stride_b are accepted unconditionally (any value, including 0-broadcast).
baracuda_kernels_gemm_dense_f32_run^⚠: Dense f32 GEMM (cuBLAS-backed): D[g] = α · A[g] · B[g] + β · D[g] for g ∈ [0, batch), accumulating in IEEE binary32 (default math mode — NOT TF32). Row-major problem; see the module docs for the layout tag (0 = RRR, 1 = RCR, 2 = CRR), leading-dim minimums, and the batch-stride contract (element strides; stride_a/stride_b may be 0 to broadcast; strides ignored at batch == 1).
baracuda_kernels_gemm_dense_f32_workspace_size: Workspace query for baracuda_kernels_gemm_dense_f32_run. Always 0 — cuBLAS allocates its workspace internally per handle.
baracuda_kernels_gemm_dense_f64_can_implement: Host-side validity check for baracuda_kernels_gemm_dense_f64_run. Validates extents, the layout tag, leading-dim minimums, i32-fit of leading dims, and stride_d != 0 at batch > 1. stride_a / stride_b are accepted unconditionally (any value, including 0-broadcast).
baracuda_kernels_gemm_dense_f64_run^⚠: Dense f64 GEMM (cuBLAS-backed): D[g] = α · A[g] · B[g] + β · D[g] for g ∈ [0, batch), accumulating in f64. Row-major problem; see the module docs for the layout tag (0 = RRR, 1 = RCR, 2 = CRR), leading-dim minimums, and the batch-stride contract (element strides; stride_a/stride_b may be 0 to broadcast; strides ignored at batch == 1).
baracuda_kernels_gemm_dense_f64_workspace_size: Workspace query for baracuda_kernels_gemm_dense_f64_run. Always 0 — cuBLAS allocates its workspace internally per handle.
baracuda_kernels_gemm_f16_rcr_sm80_can_implement^⚠: Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_f16_rcr_sm80_run^⚠: Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_f16_rcr_sm80_workspace_size^⚠: Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_f16_rrr_sm80_can_implement^⚠: Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_f16_rrr_sm80_run^⚠: Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_f16_rrr_sm80_workspace_size^⚠: Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_f32_simt_rcr_sm80_can_implement^⚠: Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_f32_simt_rcr_sm80_run^⚠: Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_f32_simt_rcr_sm80_workspace_size^⚠: Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_f32_simt_rrr_sm80_can_implement^⚠: Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_f32_simt_rrr_sm80_run^⚠: Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_f32_simt_rrr_sm80_workspace_size^⚠: Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_f64_rcr_sm80_can_implement^⚠: Pre-launch implementability check.
baracuda_kernels_gemm_f64_rcr_sm80_run^⚠: Launch DGEMM. f64 alpha/beta.
baracuda_kernels_gemm_f64_rcr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_f64_rrr_sm80_can_implement^⚠: Pre-launch implementability check.
baracuda_kernels_gemm_f64_rrr_sm80_run^⚠: Launch DGEMM. f64 alpha/beta.
baracuda_kernels_gemm_f64_rrr_sm80_workspace_size^⚠: Workspace bytes required.
baracuda_kernels_gemm_s8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_s8_rcr_sm80_run^⚠: Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_s8_rcr_sm80_workspace_size^⚠: Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_s8_rrr_sm80_bias_f32_can_implement^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_f32_can_implement (baracuda kernels gemm s8 rrr sm80 bias f32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_f32_run^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_f32_run (baracuda kernels gemm s8 rrr sm80 bias f32 run).
baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_f32_can_implement^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_f32_can_implement (baracuda kernels gemm s8 rrr sm80 bias gelu f32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_f32_run^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_f32_run (baracuda kernels gemm s8 rrr sm80 bias gelu f32 run).
baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_i32_can_implement^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_i32_can_implement (baracuda kernels gemm s8 rrr sm80 bias gelu i32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_i32_run^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_i32_run (baracuda kernels gemm s8 rrr sm80 bias gelu i32 run).
baracuda_kernels_gemm_s8_rrr_sm80_bias_i32_can_implement^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_i32_can_implement (baracuda kernels gemm s8 rrr sm80 bias i32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_i32_run^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_i32_run (baracuda kernels gemm s8 rrr sm80 bias i32 run).
baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_f32_can_implement^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_f32_can_implement (baracuda kernels gemm s8 rrr sm80 bias relu f32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_f32_run^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_f32_run (baracuda kernels gemm s8 rrr sm80 bias relu f32 run).
baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_i32_can_implement^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_i32_can_implement (baracuda kernels gemm s8 rrr sm80 bias relu i32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_i32_run^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_i32_run (baracuda kernels gemm s8 rrr sm80 bias relu i32 run).
baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_f32_can_implement^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_f32_can_implement (baracuda kernels gemm s8 rrr sm80 bias silu f32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_f32_run^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_f32_run (baracuda kernels gemm s8 rrr sm80 bias silu f32 run).
baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_i32_can_implement^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_i32_can_implement (baracuda kernels gemm s8 rrr sm80 bias silu i32 can implement).
baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_i32_run^⚠: baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_i32_run (baracuda kernels gemm s8 rrr sm80 bias silu i32 run).
baracuda_kernels_gemm_s8_rrr_sm80_can_implement^⚠: Pre-launch implementability check for the S8 RRR sm_80 Identity SKU.
baracuda_kernels_gemm_s8_rrr_sm80_run^⚠: S8 GEMM, RRR layout, Identity epilogue, sm_80.
baracuda_kernels_gemm_s8_rrr_sm80_workspace_size^⚠: Workspace size in bytes for the S8 RRR sm_80 Identity SKU at the given problem size. Always returns zero today; reserved for future SKUs that need scratch.
baracuda_kernels_gemm_tf32_rcr_sm80_can_implement^⚠: Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_tf32_rcr_sm80_run^⚠: Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_tf32_rcr_sm80_workspace_size^⚠: Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_tf32_rrr_sm80_can_implement^⚠: Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_tf32_rrr_sm80_run^⚠: Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_tf32_rrr_sm80_workspace_size^⚠: Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_u8_rcr_sm80_can_implement^⚠: Pre-launch implementability check for this Cutlass GEMM SKU.
baracuda_kernels_gemm_u8_rcr_sm80_run^⚠: Launch this Cutlass GEMM SKU on stream.
baracuda_kernels_gemm_u8_rcr_sm80_workspace_size^⚠: Workspace bytes required by this Cutlass GEMM SKU.
baracuda_kernels_gemm_u8_rrr_sm80_bias_f32_can_implement^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_f32_can_implement (baracuda kernels gemm u8 rrr sm80 bias f32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_f32_run^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_f32_run (baracuda kernels gemm u8 rrr sm80 bias f32 run).
baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_f32_can_implement^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_f32_can_implement (baracuda kernels gemm u8 rrr sm80 bias gelu f32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_f32_run^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_f32_run (baracuda kernels gemm u8 rrr sm80 bias gelu f32 run).
baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_i32_can_implement^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_i32_can_implement (baracuda kernels gemm u8 rrr sm80 bias gelu i32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_i32_run^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_i32_run (baracuda kernels gemm u8 rrr sm80 bias gelu i32 run).
baracuda_kernels_gemm_u8_rrr_sm80_bias_i32_can_implement^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_i32_can_implement (baracuda kernels gemm u8 rrr sm80 bias i32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_i32_run^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_i32_run (baracuda kernels gemm u8 rrr sm80 bias i32 run).
baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_f32_can_implement^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_f32_can_implement (baracuda kernels gemm u8 rrr sm80 bias relu f32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_f32_run^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_f32_run (baracuda kernels gemm u8 rrr sm80 bias relu f32 run).
baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_i32_can_implement^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_i32_can_implement (baracuda kernels gemm u8 rrr sm80 bias relu i32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_i32_run^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_i32_run (baracuda kernels gemm u8 rrr sm80 bias relu i32 run).
baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_f32_can_implement^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_f32_can_implement (baracuda kernels gemm u8 rrr sm80 bias silu f32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_f32_run^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_f32_run (baracuda kernels gemm u8 rrr sm80 bias silu f32 run).
baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_i32_can_implement^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_i32_can_implement (baracuda kernels gemm u8 rrr sm80 bias silu i32 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_i32_run^⚠: baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_i32_run (baracuda kernels gemm u8 rrr sm80 bias silu i32 run).
baracuda_kernels_gemm_u8_rrr_sm80_can_implement^⚠: baracuda_kernels_gemm_u8_rrr_sm80_can_implement (baracuda kernels gemm u8 rrr sm80 can implement).
baracuda_kernels_gemm_u8_rrr_sm80_run^⚠: U8 GEMM, RRR layout, Identity epilogue, sm_80.
baracuda_kernels_gemm_u8_rrr_sm80_workspace_size^⚠: baracuda_kernels_gemm_u8_rrr_sm80_workspace_size (baracuda kernels gemm u8 rrr sm80 workspace size).
baracuda_kernels_grid_sample_2d_backward_f32_can_implement^⚠: baracuda_kernels_grid_sample_2d_backward_f32_can_implement (baracuda kernels grid sample 2d backward f32 can implement).
baracuda_kernels_grid_sample_2d_backward_f32_run^⚠: grid_sample_2d BW, f32. Caller pre-zeros dinput and dgrid. dgrid: [N, OH, OW, 2]. # Safety: as FW.
baracuda_kernels_grid_sample_2d_backward_f64_can_implement^⚠: baracuda_kernels_grid_sample_2d_backward_f64_can_implement (baracuda kernels grid sample 2d backward f64 can implement).
baracuda_kernels_grid_sample_2d_backward_f64_run^⚠: grid_sample_2d BW, f64. # Safety: as f32 BW.
baracuda_kernels_grid_sample_2d_f32_can_implement^⚠: baracuda_kernels_grid_sample_2d_f32_can_implement (baracuda kernels grid sample 2d f32 can implement).
baracuda_kernels_grid_sample_2d_f32_run^⚠: grid_sample(input, grid) FW, f32. grid: [N, OH, OW, 2] with (x, y) normalized in [-1, 1]. # Safety: as interpolate_*.
baracuda_kernels_grid_sample_2d_f64_can_implement^⚠: baracuda_kernels_grid_sample_2d_f64_can_implement (baracuda kernels grid sample 2d f64 can implement).
baracuda_kernels_grid_sample_2d_f64_run^⚠: grid_sample_2d FW, f64. # Safety: as f32.
baracuda_kernels_group_norm_backward_bf16_can_implement^⚠: baracuda_kernels_group_norm_backward_bf16_can_implement (baracuda kernels group norm backward bf16 can implement).
baracuda_kernels_group_norm_backward_bf16_run^⚠: GroupNorm BW, bf16.
baracuda_kernels_group_norm_backward_f16_can_implement^⚠: baracuda_kernels_group_norm_backward_f16_can_implement (baracuda kernels group norm backward f16 can implement).
baracuda_kernels_group_norm_backward_f16_run^⚠: GroupNorm BW, f16.
baracuda_kernels_group_norm_backward_f32_can_implement^⚠: baracuda_kernels_group_norm_backward_f32_can_implement (baracuda kernels group norm backward f32 can implement).
baracuda_kernels_group_norm_backward_f32_run^⚠: GroupNorm BW, f32. Workspace size: 2 * (n_extent * num_groups) * sizeof(float) bytes for the stage-1 partial sums.
baracuda_kernels_group_norm_backward_f64_can_implement^⚠: baracuda_kernels_group_norm_backward_f64_can_implement (baracuda kernels group norm backward f64 can implement).
baracuda_kernels_group_norm_backward_f64_run^⚠: GroupNorm BW, f64.
baracuda_kernels_group_norm_bf16_can_implement^⚠: baracuda_kernels_group_norm_bf16_can_implement (baracuda kernels group norm bf16 can implement).
baracuda_kernels_group_norm_bf16_run^⚠: GroupNorm FW, bf16.
baracuda_kernels_group_norm_f16_can_implement^⚠: baracuda_kernels_group_norm_f16_can_implement (baracuda kernels group norm f16 can implement).
baracuda_kernels_group_norm_f16_run^⚠: GroupNorm FW, f16.
baracuda_kernels_group_norm_f32_can_implement^⚠: baracuda_kernels_group_norm_f32_can_implement (baracuda kernels group norm f32 can implement).
baracuda_kernels_group_norm_f32_run^⚠: GroupNorm FW, f32. Per (sample, group) mean / inv_std, per-channel affine. num_groups must divide c_extent. group_kind = 1 selects the GN dispatch (also used by InstanceNorm with num_groups == c_extent).
baracuda_kernels_group_norm_f64_can_implement^⚠: baracuda_kernels_group_norm_f64_can_implement (baracuda kernels group norm f64 can implement).
baracuda_kernels_group_norm_f64_run^⚠: GroupNorm FW, f64.
baracuda_kernels_gumbel_softmax_bf16_can_implement^⚠: baracuda_kernels_gumbel_softmax_bf16_can_implement (baracuda kernels gumbel softmax bf16 can implement).
baracuda_kernels_gumbel_softmax_bf16_run^⚠: GumbelSoftmax FW, bf16.
baracuda_kernels_gumbel_softmax_f16_can_implement^⚠: baracuda_kernels_gumbel_softmax_f16_can_implement (baracuda kernels gumbel softmax f16 can implement).
baracuda_kernels_gumbel_softmax_f16_run^⚠: GumbelSoftmax FW, f16.
baracuda_kernels_gumbel_softmax_f32_can_implement^⚠: baracuda_kernels_gumbel_softmax_f32_can_implement (baracuda kernels gumbel softmax f32 can implement).
baracuda_kernels_gumbel_softmax_f32_run^⚠: GumbelSoftmax FW, f32. y = softmax((x + g) / τ) where g = -log(-log(u)) and u is a caller-supplied cuRAND uniform buffer (one f32 per output cell, dense / contiguous layout). inv_tau = 1/τ. hard != 0 → one-hot at the noisy argmax.
baracuda_kernels_gumbel_softmax_f64_can_implement^⚠: baracuda_kernels_gumbel_softmax_f64_can_implement (baracuda kernels gumbel softmax f64 can implement).
baracuda_kernels_gumbel_softmax_f64_run^⚠: GumbelSoftmax FW, f64.
baracuda_kernels_histogram_f32_can_implement^⚠: baracuda_kernels_histogram_f32_can_implement (baracuda kernels histogram f32 can implement).
baracuda_kernels_histogram_f32_run^⚠: 1-D histogram, f32 input. lo / hi passed as double — kernel casts to T (keeps the FFI shape uniform across dtypes).
baracuda_kernels_histogram_f64_can_implement^⚠: baracuda_kernels_histogram_f64_can_implement (baracuda kernels histogram f64 can implement).
baracuda_kernels_histogram_f64_run^⚠: 1-D histogram, f64 input.
baracuda_kernels_ifftshift_4_can_implement^⚠: baracuda_kernels_ifftshift_4_can_implement (baracuda kernels ifftshift 4 can implement).
baracuda_kernels_ifftshift_4_run^⚠: Inverse fftshift along the last axis of a [batch, n] tensor: y[b, i] = x[b, (i + (n + 1) / 2) % n]. Differs from fftshift only for odd n; for even n the two are identical (each permutation is self-inverse). 4-byte cells.
baracuda_kernels_ifftshift_8_can_implement^⚠: baracuda_kernels_ifftshift_8_can_implement (baracuda kernels ifftshift 8 can implement).
baracuda_kernels_ifftshift_8_run^⚠: 8-byte-element inverse fftshift.
baracuda_kernels_ifftshift_16_can_implement^⚠: baracuda_kernels_ifftshift_16_can_implement (baracuda kernels ifftshift 16 can implement).
baracuda_kernels_ifftshift_16_run^⚠: 16-byte-element inverse fftshift.
baracuda_kernels_im2col_1d_bf16_can_implement^⚠: baracuda_kernels_im2col_1d_bf16_can_implement (baracuda kernels im2col 1d bf16 can implement).
baracuda_kernels_im2col_1d_bf16_run^⚠: im2col 1-D, bf16.
baracuda_kernels_im2col_1d_f16_can_implement^⚠: baracuda_kernels_im2col_1d_f16_can_implement (baracuda kernels im2col 1d f16 can implement).
baracuda_kernels_im2col_1d_f16_run^⚠: im2col 1-D, f16.
baracuda_kernels_im2col_1d_f32_can_implement^⚠: baracuda_kernels_im2col_1d_f32_can_implement (baracuda kernels im2col 1d f32 can implement).
baracuda_kernels_im2col_1d_f32_run^⚠: im2col 1-D, f32.
baracuda_kernels_im2col_1d_f64_can_implement^⚠: baracuda_kernels_im2col_1d_f64_can_implement (baracuda kernels im2col 1d f64 can implement).
baracuda_kernels_im2col_1d_f64_run^⚠: im2col 1-D, f64.
baracuda_kernels_im2col_2d_bf16_can_implement^⚠: baracuda_kernels_im2col_2d_bf16_can_implement (baracuda kernels im2col 2d bf16 can implement).
baracuda_kernels_im2col_2d_bf16_run^⚠: im2col 2-D, bf16.
baracuda_kernels_im2col_2d_f16_can_implement^⚠: baracuda_kernels_im2col_2d_f16_can_implement (baracuda kernels im2col 2d f16 can implement).
baracuda_kernels_im2col_2d_f16_run^⚠: im2col 2-D, f16.
baracuda_kernels_im2col_2d_f32_can_implement^⚠: baracuda_kernels_im2col_2d_f32_can_implement (baracuda kernels im2col 2d f32 can implement).
baracuda_kernels_im2col_2d_f32_run^⚠: im2col 2-D, f32.
baracuda_kernels_im2col_2d_f64_can_implement^⚠: baracuda_kernels_im2col_2d_f64_can_implement (baracuda kernels im2col 2d f64 can implement).
baracuda_kernels_im2col_2d_f64_run^⚠: im2col 2-D, f64.
baracuda_kernels_index_add_bf16_can_implement^⚠: Implementability check for index_add_bf16.
baracuda_kernels_index_add_bf16_run^⚠: index_add — bf16, i32 idx. atomicCAS-via- baracuda::atomic::add<__nv_bfloat16> (same caveats as f16).
baracuda_kernels_index_add_f16_can_implement^⚠: Implementability check for index_add_f16.
baracuda_kernels_index_add_f16_run^⚠: index_add — f16, i32 idx. Uses atomicCAS-via- baracuda::atomic::add<__half> (deterministic per-thread arithmetic regardless of CUDA toolkit; non-deterministic accumulation order).
baracuda_kernels_index_add_f32_can_implement^⚠: Implementability check for index_add_f32.
baracuda_kernels_index_add_f32_run^⚠: index_add — dst[idx[i], ...] += src[i, ...], f32, i32 idx.
baracuda_kernels_index_add_f64_can_implement^⚠: Implementability check for index_add_f64.
baracuda_kernels_index_add_f64_run^⚠: index_add — f64, i32 idx.
baracuda_kernels_index_add_i32_can_implement^⚠: baracuda_kernels_index_add_i32_can_implement (baracuda kernels index add i32 can implement).
baracuda_kernels_index_add_i32_run^⚠: baracuda_kernels_index_add_i32_run (baracuda kernels index add i32 run).
baracuda_kernels_index_add_i64_can_implement^⚠: baracuda_kernels_index_add_i64_can_implement (baracuda kernels index add i64 can implement).
baracuda_kernels_index_add_i64_run^⚠: baracuda_kernels_index_add_i64_run (baracuda kernels index add i64 run).
baracuda_kernels_index_add_i64idx_bf16_can_implement^⚠: Implementability check for index_add_i64idx_bf16.
baracuda_kernels_index_add_i64idx_bf16_run^⚠: index_add — bf16, i64 idx.
baracuda_kernels_index_add_i64idx_f16_can_implement^⚠: Implementability check for index_add_i64idx_f16.
baracuda_kernels_index_add_i64idx_f16_run^⚠: index_add — f16, i64 idx.
baracuda_kernels_index_add_i64idx_f32_can_implement^⚠: Implementability check for index_add_i64idx_f32.
baracuda_kernels_index_add_i64idx_f32_run^⚠: index_add — f32, i64 idx.
baracuda_kernels_index_add_i64idx_f64_can_implement^⚠: Implementability check for index_add_i64idx_f64.
baracuda_kernels_index_add_i64idx_f64_run^⚠: index_add — f64, i64 idx.
baracuda_kernels_index_add_i64idx_i32_can_implement^⚠: baracuda_kernels_index_add_i64idx_i32_can_implement (baracuda kernels index add i64idx i32 can implement).
baracuda_kernels_index_add_i64idx_i32_run^⚠: baracuda_kernels_index_add_i64idx_i32_run (baracuda kernels index add i64idx i32 run).
baracuda_kernels_index_add_i64idx_i64_can_implement^⚠: baracuda_kernels_index_add_i64idx_i64_can_implement (baracuda kernels index add i64idx i64 can implement).
baracuda_kernels_index_add_i64idx_i64_run^⚠: baracuda_kernels_index_add_i64idx_i64_run (baracuda kernels index add i64idx i64 run).
baracuda_kernels_index_add_i64idx_u32_can_implement^⚠: baracuda_kernels_index_add_i64idx_u32_can_implement (baracuda kernels index add i64idx u32 can implement).
baracuda_kernels_index_add_i64idx_u32_run^⚠: baracuda_kernels_index_add_i64idx_u32_run (baracuda kernels index add i64idx u32 run).
baracuda_kernels_index_add_u32_can_implement^⚠: baracuda_kernels_index_add_u32_can_implement (baracuda kernels index add u32 can implement).
baracuda_kernels_index_add_u32_run^⚠: baracuda_kernels_index_add_u32_run (baracuda kernels index add u32 run).
baracuda_kernels_index_select_backward_f32_can_implement^⚠: Implementability check for index_select_backward_f32.
baracuda_kernels_index_select_backward_f32_run^⚠: dsrc[..., idx[j], ...] += dout[..., j, ...] along select_dim. f32 (atomicAdd).
baracuda_kernels_index_select_backward_f64_can_implement^⚠: Implementability check for index_select_backward_f64.
baracuda_kernels_index_select_backward_f64_run^⚠: index_select_backward — f64 (atomicAdd).
baracuda_kernels_index_select_backward_i64idx_f32_can_implement^⚠: Implementability check for index_select_backward_i64idx_f32.
baracuda_kernels_index_select_backward_i64idx_f32_run^⚠: index_select BW — f32, i64 indices.
baracuda_kernels_index_select_backward_i64idx_f64_can_implement^⚠: Implementability check for index_select_backward_i64idx_f64.
baracuda_kernels_index_select_backward_i64idx_f64_run^⚠: index_select BW — f64, i64 indices.
baracuda_kernels_index_select_f32_can_implement^⚠: Implementability check for index_select_f32.
baracuda_kernels_index_select_f32_run^⚠: out[..., j, ...] = src[..., idx[j], ...] along select_dim. idx is 1-D i32. f32.
baracuda_kernels_index_select_f64_can_implement^⚠: Implementability check for index_select_f64.
baracuda_kernels_index_select_f64_run^⚠: index_select — f64.
baracuda_kernels_index_select_i8_can_implement^⚠: baracuda_kernels_index_select_i8_can_implement (baracuda kernels index select i8 can implement).
baracuda_kernels_index_select_i8_run^⚠: baracuda_kernels_index_select_i8_run (baracuda kernels index select i8 run).
baracuda_kernels_index_select_i16_can_implement^⚠: baracuda_kernels_index_select_i16_can_implement (baracuda kernels index select i16 can implement).
baracuda_kernels_index_select_i16_run^⚠: baracuda_kernels_index_select_i16_run (baracuda kernels index select i16 run).
baracuda_kernels_index_select_i32_can_implement^⚠: Implementability check for index_select_i32.
baracuda_kernels_index_select_i32_run^⚠: index_select — i32.
baracuda_kernels_index_select_i64_can_implement^⚠: baracuda_kernels_index_select_i64_can_implement (baracuda kernels index select i64 can implement).
baracuda_kernels_index_select_i64_run^⚠: baracuda_kernels_index_select_i64_run (baracuda kernels index select i64 run).
baracuda_kernels_index_select_i64idx_f32_can_implement^⚠: Implementability check for index_select_i64idx_f32.
baracuda_kernels_index_select_i64idx_f32_run^⚠: index_select — f32, i64 indices.
baracuda_kernels_index_select_i64idx_f64_can_implement^⚠: Implementability check for index_select_i64idx_f64.
baracuda_kernels_index_select_i64idx_f64_run^⚠: index_select — f64, i64 indices.
baracuda_kernels_index_select_i64idx_i8_can_implement^⚠: baracuda_kernels_index_select_i64idx_i8_can_implement (baracuda kernels index select i64idx i8 can implement).
baracuda_kernels_index_select_i64idx_i8_run^⚠: baracuda_kernels_index_select_i64idx_i8_run (baracuda kernels index select i64idx i8 run).
baracuda_kernels_index_select_i64idx_i16_can_implement^⚠: baracuda_kernels_index_select_i64idx_i16_can_implement (baracuda kernels index select i64idx i16 can implement).
baracuda_kernels_index_select_i64idx_i16_run^⚠: baracuda_kernels_index_select_i64idx_i16_run (baracuda kernels index select i64idx i16 run).
baracuda_kernels_index_select_i64idx_i32_can_implement^⚠: Implementability check for index_select_i64idx_i32.
baracuda_kernels_index_select_i64idx_i32_run^⚠: index_select — i32 values, i64 indices.
baracuda_kernels_index_select_i64idx_i64_can_implement^⚠: baracuda_kernels_index_select_i64idx_i64_can_implement (baracuda kernels index select i64idx i64 can implement).
baracuda_kernels_index_select_i64idx_i64_run^⚠: baracuda_kernels_index_select_i64idx_i64_run (baracuda kernels index select i64idx i64 run).
baracuda_kernels_index_select_i64idx_u8_can_implement^⚠: baracuda_kernels_index_select_i64idx_u8_can_implement (baracuda kernels index select i64idx u8 can implement).
baracuda_kernels_index_select_i64idx_u8_run^⚠: baracuda_kernels_index_select_i64idx_u8_run (baracuda kernels index select i64idx u8 run).
baracuda_kernels_index_select_i64idx_u16_can_implement^⚠: baracuda_kernels_index_select_i64idx_u16_can_implement (baracuda kernels index select i64idx u16 can implement).
baracuda_kernels_index_select_i64idx_u16_run^⚠: baracuda_kernels_index_select_i64idx_u16_run (baracuda kernels index select i64idx u16 run).
baracuda_kernels_index_select_i64idx_u32_can_implement^⚠: baracuda_kernels_index_select_i64idx_u32_can_implement (baracuda kernels index select i64idx u32 can implement).
baracuda_kernels_index_select_i64idx_u32_run^⚠: baracuda_kernels_index_select_i64idx_u32_run (baracuda kernels index select i64idx u32 run).
baracuda_kernels_index_select_u8_can_implement^⚠: baracuda_kernels_index_select_u8_can_implement (baracuda kernels index select u8 can implement).
baracuda_kernels_index_select_u8_run^⚠: baracuda_kernels_index_select_u8_run (baracuda kernels index select u8 run).
baracuda_kernels_index_select_u16_can_implement^⚠: baracuda_kernels_index_select_u16_can_implement (baracuda kernels index select u16 can implement).
baracuda_kernels_index_select_u16_run^⚠: baracuda_kernels_index_select_u16_run (baracuda kernels index select u16 run).
baracuda_kernels_index_select_u32_can_implement^⚠: baracuda_kernels_index_select_u32_can_implement (baracuda kernels index select u32 can implement).
baracuda_kernels_index_select_u32_run^⚠: baracuda_kernels_index_select_u32_run (baracuda kernels index select u32 run).
baracuda_kernels_interpolate_bilinear_2d_backward_bf16_can_implement^⚠: baracuda_kernels_interpolate_bilinear_2d_backward_bf16_can_implement (baracuda kernels interpolate bilinear 2d backward bf16 can implement).
baracuda_kernels_interpolate_bilinear_2d_backward_bf16_run^⚠: interpolate_bilinear_2d BW, bf16. Caller pre-zeros dinput. atomicCAS-based bf16 atomic add. # Safety: as f32 BW.
baracuda_kernels_interpolate_bilinear_2d_backward_f16_can_implement^⚠: baracuda_kernels_interpolate_bilinear_2d_backward_f16_can_implement (baracuda kernels interpolate bilinear 2d backward f16 can implement).
baracuda_kernels_interpolate_bilinear_2d_backward_f16_run^⚠: interpolate_bilinear_2d BW, f16. Caller pre-zeros dinput. atomicCAS-based half atomic add. # Safety: as f32 BW.
baracuda_kernels_interpolate_bilinear_2d_backward_f32_can_implement^⚠: baracuda_kernels_interpolate_bilinear_2d_backward_f32_can_implement (baracuda kernels interpolate bilinear 2d backward f32 can implement).
baracuda_kernels_interpolate_bilinear_2d_backward_f32_run^⚠: interpolate_bilinear_2d BW, f32. Caller pre-zeros dinput.
baracuda_kernels_interpolate_bilinear_2d_backward_f64_can_implement^⚠: baracuda_kernels_interpolate_bilinear_2d_backward_f64_can_implement (baracuda kernels interpolate bilinear 2d backward f64 can implement).
baracuda_kernels_interpolate_bilinear_2d_backward_f64_run^⚠: interpolate_bilinear_2d BW, f64. # Safety: as f32 BW.
baracuda_kernels_interpolate_bilinear_2d_bf16_can_implement^⚠: baracuda_kernels_interpolate_bilinear_2d_bf16_can_implement (baracuda kernels interpolate bilinear 2d bf16 can implement).
baracuda_kernels_interpolate_bilinear_2d_bf16_run^⚠: interpolate_bilinear_2d FW, bf16. Cast-at-read / f32 accumulator / cast-at-write. # Safety: as f32.
baracuda_kernels_interpolate_bilinear_2d_f16_can_implement^⚠: baracuda_kernels_interpolate_bilinear_2d_f16_can_implement (baracuda kernels interpolate bilinear 2d f16 can implement).
baracuda_kernels_interpolate_bilinear_2d_f16_run^⚠: interpolate_bilinear_2d FW, f16 (half). Cast-at-read / f32 accumulator / cast-at-write. # Safety: as f32.
baracuda_kernels_interpolate_bilinear_2d_f32_can_implement^⚠: baracuda_kernels_interpolate_bilinear_2d_f32_can_implement (baracuda kernels interpolate bilinear 2d f32 can implement).
baracuda_kernels_interpolate_bilinear_2d_f32_run^⚠: interpolate(x, mode='bilinear') FW, f32. input: [N, C, IH, IW]; output: [N, C, OH, OW]. NCHW. align_corners: 0 = false (PyTorch default), nonzero = true. scale_h_factor / scale_w_factor: 0.0 = derive from sizes; nonzero = use as SCALE override.
baracuda_kernels_interpolate_bilinear_2d_f64_can_implement^⚠: baracuda_kernels_interpolate_bilinear_2d_f64_can_implement (baracuda kernels interpolate bilinear 2d f64 can implement).
baracuda_kernels_interpolate_bilinear_2d_f64_run^⚠: interpolate_bilinear_2d FW, f64. # Safety: as f32.
baracuda_kernels_inverse_f32_run^⚠: Matrix inverse via getrf + getrs over caller-staged identity in inv_inout. The caller MUST pre-stage an n × n identity in inv_inout (column-major) before invoking. After the call, inv_inout holds A^{-1} and a_inout holds the packed LU factors.
baracuda_kernels_inverse_f32_workspace_size^⚠: Inverse workspace size (== getrf workspace).
baracuda_kernels_inverse_f64_run^⚠: Matrix inverse via getrf + getrs over caller-staged identity in inv_inout. The caller MUST pre-stage an n × n identity in inv_inout (column-major) before invoking. After the call, inv_inout holds A^{-1} and a_inout holds the packed LU factors.
baracuda_kernels_inverse_f64_workspace_size^⚠: Inverse workspace size (== getrf workspace).
baracuda_kernels_irfft_1d_f32_run^⚠: 1-D C2R FFT (Hermitian-half complex → real). Applies 1/n normalization in-place (PyTorch norm="backward"). n is the real-side output length; complex input shape is [batch, n/2 + 1].
baracuda_kernels_irfft_1d_f32_workspace_size^⚠: 1-D C2R FFT workspace size in bytes — always 0.
baracuda_kernels_irfft_1d_f64_run^⚠: 1-D C2R FFT (Hermitian-half complex → real). Applies 1/n normalization in-place (PyTorch norm="backward"). n is the real-side output length; complex input shape is [batch, n/2 + 1].
baracuda_kernels_irfft_1d_f64_workspace_size^⚠: 1-D C2R FFT workspace size in bytes — always 0.
baracuda_kernels_irfft_nd_f32_run^⚠: ND C2R FFT (Hermitian-half complex → real). Applies 1/product(dims[..rank]) normalization in-place. dims carries the real-side extents.
baracuda_kernels_irfft_nd_f32_workspace_size^⚠: ND C2R FFT workspace size in bytes — always 0.
baracuda_kernels_irfft_nd_f64_run^⚠: ND C2R FFT (Hermitian-half complex → real). Applies 1/product(dims[..rank]) normalization in-place. dims carries the real-side extents.
baracuda_kernels_irfft_nd_f64_workspace_size^⚠: ND C2R FFT workspace size in bytes — always 0.
baracuda_kernels_kv_cache_append_bf16_can_implement^⚠: Implementability check for kv_cache_append_bf16. Host-side only.
baracuda_kernels_kv_cache_append_bf16_run^⚠: KV-cache append, bf16.
baracuda_kernels_kv_cache_append_f16_can_implement^⚠: Implementability check for kv_cache_append_f16. Host-side only.
baracuda_kernels_kv_cache_append_f16_run^⚠: KV-cache append, f16.
baracuda_kernels_kv_cache_append_f32_can_implement^⚠: Implementability check for kv_cache_append_f32. Host-side only.
baracuda_kernels_kv_cache_append_f32_run^⚠: KV-cache append, f32.
baracuda_kernels_kv_cache_append_f64_can_implement^⚠: Implementability check for kv_cache_append_f64. Host-side only.
baracuda_kernels_kv_cache_append_f64_run^⚠: KV-cache append, f64.
baracuda_kernels_layer_norm_backward_bf16_can_implement^⚠: baracuda_kernels_layer_norm_backward_bf16_can_implement (baracuda kernels layer norm backward bf16 can implement).
baracuda_kernels_layer_norm_backward_bf16_run^⚠: LayerNorm BW, bf16.
baracuda_kernels_layer_norm_backward_bf16_strided_can_implement^⚠: layer_norm_backward_bf16_strided_can_implement companion.
baracuda_kernels_layer_norm_backward_bf16_strided_run^⚠: LayerNorm BW strided sibling, bf16.
baracuda_kernels_layer_norm_backward_f16_can_implement^⚠: baracuda_kernels_layer_norm_backward_f16_can_implement (baracuda kernels layer norm backward f16 can implement).
baracuda_kernels_layer_norm_backward_f16_run^⚠: LayerNorm BW, f16.
baracuda_kernels_layer_norm_backward_f16_strided_can_implement^⚠: layer_norm_backward_f16_strided_can_implement companion.
baracuda_kernels_layer_norm_backward_f16_strided_run^⚠: LayerNorm BW strided sibling, f16.
baracuda_kernels_layer_norm_backward_f32_can_implement^⚠: baracuda_kernels_layer_norm_backward_f32_can_implement (baracuda kernels layer norm backward f32 can implement).
baracuda_kernels_layer_norm_backward_f32_run^⚠: LayerNorm BW, f32. Computes dx and (when non-null) dgamma / dbeta reductions. Caller passes saved mean + inv_std from FW.
baracuda_kernels_layer_norm_backward_f32_strided_can_implement^⚠: layer_norm_backward_f32_strided_can_implement companion.
baracuda_kernels_layer_norm_backward_f32_strided_run^⚠: LayerNorm BW strided sibling, f32. Same contract as baracuda_kernels_layer_norm_backward_f32_run; identical launcher.
baracuda_kernels_layer_norm_backward_f64_can_implement^⚠: baracuda_kernels_layer_norm_backward_f64_can_implement (baracuda kernels layer norm backward f64 can implement).
baracuda_kernels_layer_norm_backward_f64_run^⚠: LayerNorm BW, f64.
baracuda_kernels_layer_norm_backward_f64_strided_can_implement^⚠: layer_norm_backward_f64_strided_can_implement companion.
baracuda_kernels_layer_norm_backward_f64_strided_run^⚠: LayerNorm BW strided sibling, f64.
baracuda_kernels_layer_norm_bf16_can_implement^⚠: baracuda_kernels_layer_norm_bf16_can_implement (baracuda kernels layer norm bf16 can implement).
baracuda_kernels_layer_norm_bf16_run^⚠: LayerNorm FW, bf16. f32 accumulator inside the kernel.
baracuda_kernels_layer_norm_bf16_strided_can_implement^⚠: layer_norm_bf16_strided_can_implement companion.
baracuda_kernels_layer_norm_bf16_strided_run^⚠: LayerNorm FW strided sibling, bf16.
baracuda_kernels_layer_norm_f16_can_implement^⚠: baracuda_kernels_layer_norm_f16_can_implement (baracuda kernels layer norm f16 can implement).
baracuda_kernels_layer_norm_f16_run^⚠: LayerNorm FW, f16. f32 accumulator inside the kernel.
baracuda_kernels_layer_norm_f16_strided_can_implement^⚠: layer_norm_f16_strided_can_implement companion.
baracuda_kernels_layer_norm_f16_strided_run^⚠: LayerNorm FW strided sibling, f16.
baracuda_kernels_layer_norm_f32_can_implement^⚠: baracuda_kernels_layer_norm_f32_can_implement (baracuda kernels layer norm f32 can implement).
baracuda_kernels_layer_norm_f32_run^⚠: LayerNorm FW, f32. y = (x - mean) / sqrt(var + eps) * gamma + beta. gamma / beta independently optional. Biased (“population”) variance. Save buffers mean_out / inv_std_out share stride_save, each shape == input with norm axes collapsed to 1.
baracuda_kernels_layer_norm_f32_strided_can_implement^⚠: layer_norm_f32_strided_can_implement companion.
baracuda_kernels_layer_norm_f32_strided_run^⚠: LayerNorm FW strided sibling, f32. Same contract as baracuda_kernels_layer_norm_f32_run; identical underlying launcher.
baracuda_kernels_layer_norm_f64_can_implement^⚠: baracuda_kernels_layer_norm_f64_can_implement (baracuda kernels layer norm f64 can implement).
baracuda_kernels_layer_norm_f64_run^⚠: LayerNorm FW, f64.
baracuda_kernels_layer_norm_f64_strided_can_implement^⚠: layer_norm_f64_strided_can_implement companion.
baracuda_kernels_layer_norm_f64_strided_run^⚠: LayerNorm FW strided sibling, f64.
baracuda_kernels_log_softmax_backward_bf16_can_implement^⚠: baracuda_kernels_log_softmax_backward_bf16_can_implement (baracuda kernels log softmax backward bf16 can implement).
baracuda_kernels_log_softmax_backward_bf16_run^⚠: LogSoftmax BW, bf16.
baracuda_kernels_log_softmax_backward_bf16_strided_can_implement^⚠: log_softmax_backward_bf16_strided_can_implement companion.
baracuda_kernels_log_softmax_backward_bf16_strided_run^⚠: LogSoftmax BW strided sibling, bf16.
baracuda_kernels_log_softmax_backward_f16_can_implement^⚠: baracuda_kernels_log_softmax_backward_f16_can_implement (baracuda kernels log softmax backward f16 can implement).
baracuda_kernels_log_softmax_backward_f16_run^⚠: LogSoftmax BW, f16.
baracuda_kernels_log_softmax_backward_f16_strided_can_implement^⚠: log_softmax_backward_f16_strided_can_implement companion.
baracuda_kernels_log_softmax_backward_f16_strided_run^⚠: LogSoftmax BW strided sibling, f16.
baracuda_kernels_log_softmax_backward_f32_can_implement^⚠: baracuda_kernels_log_softmax_backward_f32_can_implement (baracuda kernels log softmax backward f32 can implement).
baracuda_kernels_log_softmax_backward_f32_run^⚠: LogSoftmax BW, f32. dx[k] = dy[k] - exp(y[k]) * Σ_j dy[j]. Caller passes the saved forward output y (log-softmax values).
baracuda_kernels_log_softmax_backward_f32_strided_can_implement^⚠: log_softmax_backward_f32_strided_can_implement companion.
baracuda_kernels_log_softmax_backward_f32_strided_run^⚠: LogSoftmax BW strided sibling, f32. ABI identical to softmax BW.
baracuda_kernels_log_softmax_backward_f64_can_implement^⚠: baracuda_kernels_log_softmax_backward_f64_can_implement (baracuda kernels log softmax backward f64 can implement).
baracuda_kernels_log_softmax_backward_f64_run^⚠: LogSoftmax BW, f64.
baracuda_kernels_log_softmax_backward_f64_strided_can_implement^⚠: log_softmax_backward_f64_strided_can_implement companion.
baracuda_kernels_log_softmax_backward_f64_strided_run^⚠: LogSoftmax BW strided sibling, f64.
baracuda_kernels_log_softmax_bf16_can_implement^⚠: baracuda_kernels_log_softmax_bf16_can_implement (baracuda kernels log softmax bf16 can implement).
baracuda_kernels_log_softmax_bf16_run^⚠: LogSoftmax FW, bf16.
baracuda_kernels_log_softmax_bf16_strided_can_implement^⚠: log_softmax_bf16_strided_can_implement companion.
baracuda_kernels_log_softmax_bf16_strided_run^⚠: LogSoftmax FW strided sibling, bf16.
baracuda_kernels_log_softmax_f16_can_implement^⚠: baracuda_kernels_log_softmax_f16_can_implement (baracuda kernels log softmax f16 can implement).
baracuda_kernels_log_softmax_f16_run^⚠: LogSoftmax FW, f16. f32 accumulator inside the kernel.
baracuda_kernels_log_softmax_f16_strided_can_implement^⚠: log_softmax_f16_strided_can_implement companion.
baracuda_kernels_log_softmax_f16_strided_run^⚠: LogSoftmax FW strided sibling, f16.
baracuda_kernels_log_softmax_f32_can_implement^⚠: baracuda_kernels_log_softmax_f32_can_implement (baracuda kernels log softmax f32 can implement).
baracuda_kernels_log_softmax_f32_run^⚠: LogSoftmax FW, f32. y[k] = (x[k] - max) - log(Σ exp(x[j] - max)) along softmax_axis. Numerically stable.
baracuda_kernels_log_softmax_f32_strided_can_implement^⚠: log_softmax_f32_strided_can_implement companion.
baracuda_kernels_log_softmax_f32_strided_run^⚠: LogSoftmax FW strided sibling, f32. ABI identical to softmax FW.
baracuda_kernels_log_softmax_f64_can_implement^⚠: baracuda_kernels_log_softmax_f64_can_implement (baracuda kernels log softmax f64 can implement).
baracuda_kernels_log_softmax_f64_run^⚠: LogSoftmax FW, f64.
baracuda_kernels_log_softmax_f64_strided_can_implement^⚠: log_softmax_f64_strided_can_implement companion.
baracuda_kernels_log_softmax_f64_strided_run^⚠: LogSoftmax FW strided sibling, f64.
baracuda_kernels_loss_bce_backward_bf16_can_implement^⚠: BCE BW _can_implement, bf16.
baracuda_kernels_loss_bce_backward_bf16_run^⚠: BCE BW, bf16.
baracuda_kernels_loss_bce_backward_f16_can_implement^⚠: BCE BW _can_implement, f16.
baracuda_kernels_loss_bce_backward_f16_run^⚠: BCE BW, f16.
baracuda_kernels_loss_bce_backward_f32_can_implement^⚠: BCE BW _can_implement, f32.
baracuda_kernels_loss_bce_backward_f32_run^⚠: BCE BW, f32. dpred = (pred - target) / (pred·(1-pred)) · scale.
baracuda_kernels_loss_bce_backward_f64_can_implement^⚠: BCE BW _can_implement, f64.
baracuda_kernels_loss_bce_backward_f64_run^⚠: BCE BW, f64.
baracuda_kernels_loss_bce_bf16_can_implement^⚠: BCE FW _can_implement, bf16.
baracuda_kernels_loss_bce_bf16_run^⚠: BCE FW, bf16.
baracuda_kernels_loss_bce_f16_can_implement^⚠: BCE FW _can_implement, f16.
baracuda_kernels_loss_bce_f16_run^⚠: BCE FW, f16.
baracuda_kernels_loss_bce_f32_can_implement^⚠: BCE FW _can_implement, f32.
baracuda_kernels_loss_bce_f32_run^⚠: BCE FW, f32. -(t·log(p) + (1-t)·log(1-p)) per-cell, then reduce. Caller ensures pred ∈ (0, 1).
baracuda_kernels_loss_bce_f64_can_implement^⚠: BCE FW _can_implement, f64.
baracuda_kernels_loss_bce_f64_run^⚠: BCE FW, f64.
baracuda_kernels_loss_bce_with_logits_backward_bf16_can_implement^⚠: baracuda_kernels_loss_bce_with_logits_backward_bf16_can_implement (baracuda kernels loss bce with logits backward bf16 can implement).
baracuda_kernels_loss_bce_with_logits_backward_bf16_run^⚠: BCEWithLogits BW, bf16.
baracuda_kernels_loss_bce_with_logits_backward_f16_can_implement^⚠: baracuda_kernels_loss_bce_with_logits_backward_f16_can_implement (baracuda kernels loss bce with logits backward f16 can implement).
baracuda_kernels_loss_bce_with_logits_backward_f16_run^⚠: BCEWithLogits BW, f16.
baracuda_kernels_loss_bce_with_logits_backward_f32_can_implement^⚠: baracuda_kernels_loss_bce_with_logits_backward_f32_can_implement (baracuda kernels loss bce with logits backward f32 can implement).
baracuda_kernels_loss_bce_with_logits_backward_f32_run^⚠: BCEWithLogits BW, f32. dlogits = (sigmoid(x) - target) · scale.
baracuda_kernels_loss_bce_with_logits_backward_f64_can_implement^⚠: baracuda_kernels_loss_bce_with_logits_backward_f64_can_implement (baracuda kernels loss bce with logits backward f64 can implement).
baracuda_kernels_loss_bce_with_logits_backward_f64_run^⚠: BCEWithLogits BW, f64.
baracuda_kernels_loss_bce_with_logits_bf16_can_implement^⚠: baracuda_kernels_loss_bce_with_logits_bf16_can_implement (baracuda kernels loss bce with logits bf16 can implement).
baracuda_kernels_loss_bce_with_logits_bf16_run^⚠: BCEWithLogits FW, bf16.
baracuda_kernels_loss_bce_with_logits_f16_can_implement^⚠: baracuda_kernels_loss_bce_with_logits_f16_can_implement (baracuda kernels loss bce with logits f16 can implement).
baracuda_kernels_loss_bce_with_logits_f16_run^⚠: BCEWithLogits FW, f16.
baracuda_kernels_loss_bce_with_logits_f32_can_implement^⚠: baracuda_kernels_loss_bce_with_logits_f32_can_implement (baracuda kernels loss bce with logits f32 can implement).
baracuda_kernels_loss_bce_with_logits_f32_run^⚠: BCEWithLogits FW, f32. Stable BCE for raw logits.
baracuda_kernels_loss_bce_with_logits_f64_can_implement^⚠: baracuda_kernels_loss_bce_with_logits_f64_can_implement (baracuda kernels loss bce with logits f64 can implement).
baracuda_kernels_loss_bce_with_logits_f64_run^⚠: BCEWithLogits FW, f64.
baracuda_kernels_loss_cosine_embedding_backward_bf16_can_implement^⚠: baracuda_kernels_loss_cosine_embedding_backward_bf16_can_implement (baracuda kernels loss cosine embedding backward bf16 can implement).
baracuda_kernels_loss_cosine_embedding_backward_bf16_run^⚠: CosineEmbedding BW, bf16.
baracuda_kernels_loss_cosine_embedding_backward_f16_can_implement^⚠: baracuda_kernels_loss_cosine_embedding_backward_f16_can_implement (baracuda kernels loss cosine embedding backward f16 can implement).
baracuda_kernels_loss_cosine_embedding_backward_f16_run^⚠: CosineEmbedding BW, f16.
baracuda_kernels_loss_cosine_embedding_backward_f32_can_implement^⚠: baracuda_kernels_loss_cosine_embedding_backward_f32_can_implement (baracuda kernels loss cosine embedding backward f32 can implement).
baracuda_kernels_loss_cosine_embedding_backward_f32_run^⚠: CosineEmbedding BW.
baracuda_kernels_loss_cosine_embedding_backward_f64_can_implement^⚠: baracuda_kernels_loss_cosine_embedding_backward_f64_can_implement (baracuda kernels loss cosine embedding backward f64 can implement).
baracuda_kernels_loss_cosine_embedding_backward_f64_run^⚠: CosineEmbedding BW, f64.
baracuda_kernels_loss_cosine_embedding_bf16_can_implement^⚠: baracuda_kernels_loss_cosine_embedding_bf16_can_implement (baracuda kernels loss cosine embedding bf16 can implement).
baracuda_kernels_loss_cosine_embedding_bf16_run^⚠: CosineEmbedding FW, bf16.
baracuda_kernels_loss_cosine_embedding_f16_can_implement^⚠: baracuda_kernels_loss_cosine_embedding_f16_can_implement (baracuda kernels loss cosine embedding f16 can implement).
baracuda_kernels_loss_cosine_embedding_f16_run^⚠: CosineEmbedding FW, f16.
baracuda_kernels_loss_cosine_embedding_f32_can_implement^⚠: baracuda_kernels_loss_cosine_embedding_f32_can_implement (baracuda kernels loss cosine embedding f32 can implement).
baracuda_kernels_loss_cosine_embedding_f32_run^⚠: CosineEmbedding FW (per-row). ABI: (n_rows, d_extent, row_stride_x, reduction_mode, margin, x1, x2, t, out, workspace, workspace_bytes, stream).
baracuda_kernels_loss_cosine_embedding_f64_can_implement^⚠: baracuda_kernels_loss_cosine_embedding_f64_can_implement (baracuda kernels loss cosine embedding f64 can implement).
baracuda_kernels_loss_cosine_embedding_f64_run^⚠: CosineEmbedding FW, f64.
baracuda_kernels_loss_cross_entropy_backward_bf16_can_implement^⚠: baracuda_kernels_loss_cross_entropy_backward_bf16_can_implement (baracuda kernels loss cross entropy backward bf16 can implement).
baracuda_kernels_loss_cross_entropy_backward_bf16_run^⚠: CrossEntropy BW, bf16.
baracuda_kernels_loss_cross_entropy_backward_f16_can_implement^⚠: baracuda_kernels_loss_cross_entropy_backward_f16_can_implement (baracuda kernels loss cross entropy backward f16 can implement).
baracuda_kernels_loss_cross_entropy_backward_f16_run^⚠: CrossEntropy BW, f16.
baracuda_kernels_loss_cross_entropy_backward_f32_can_implement^⚠: CrossEntropy BW _can_implement, f32.
baracuda_kernels_loss_cross_entropy_backward_f32_run^⚠: CrossEntropy BW, f32. dinput[i, c] = (softmax(input)[i, c] - 1{c==t[i]}) · scale.
baracuda_kernels_loss_cross_entropy_backward_f64_can_implement^⚠: baracuda_kernels_loss_cross_entropy_backward_f64_can_implement (baracuda kernels loss cross entropy backward f64 can implement).
baracuda_kernels_loss_cross_entropy_backward_f64_run^⚠: CrossEntropy BW, f64.
baracuda_kernels_loss_cross_entropy_bf16_can_implement^⚠: CrossEntropy FW _can_implement, bf16.
baracuda_kernels_loss_cross_entropy_bf16_run^⚠: CrossEntropy FW, bf16.
baracuda_kernels_loss_cross_entropy_f16_can_implement^⚠: CrossEntropy FW _can_implement, f16.
baracuda_kernels_loss_cross_entropy_f16_run^⚠: CrossEntropy FW, f16.
baracuda_kernels_loss_cross_entropy_f32_can_implement^⚠: CrossEntropy FW _can_implement, f32.
baracuda_kernels_loss_cross_entropy_f32_run^⚠: CrossEntropy FW, f32. Fused LogSoftmax + NLL. Numerically stable per-row two-pass max subtraction.
baracuda_kernels_loss_cross_entropy_f64_can_implement^⚠: CrossEntropy FW _can_implement, f64.
baracuda_kernels_loss_cross_entropy_f64_run^⚠: CrossEntropy FW, f64.
baracuda_kernels_loss_cross_entropy_soft_backward_bf16_can_implement^⚠: baracuda_kernels_loss_cross_entropy_soft_backward_bf16_can_implement (baracuda kernels loss cross entropy soft backward bf16 can implement).
baracuda_kernels_loss_cross_entropy_soft_backward_bf16_run^⚠: Soft-target CrossEntropy BW, bf16.
baracuda_kernels_loss_cross_entropy_soft_backward_f16_can_implement^⚠: baracuda_kernels_loss_cross_entropy_soft_backward_f16_can_implement (baracuda kernels loss cross entropy soft backward f16 can implement).
baracuda_kernels_loss_cross_entropy_soft_backward_f16_run^⚠: Soft-target CrossEntropy BW, f16.
baracuda_kernels_loss_cross_entropy_soft_backward_f32_can_implement^⚠: baracuda_kernels_loss_cross_entropy_soft_backward_f32_can_implement (baracuda kernels loss cross entropy soft backward f32 can implement).
baracuda_kernels_loss_cross_entropy_soft_backward_f32_run^⚠: Soft-target CrossEntropy BW, f32.
baracuda_kernels_loss_cross_entropy_soft_backward_f64_can_implement^⚠: baracuda_kernels_loss_cross_entropy_soft_backward_f64_can_implement (baracuda kernels loss cross entropy soft backward f64 can implement).
baracuda_kernels_loss_cross_entropy_soft_backward_f64_run^⚠: Soft-target CrossEntropy BW, f64.
baracuda_kernels_loss_cross_entropy_soft_bf16_can_implement^⚠: baracuda_kernels_loss_cross_entropy_soft_bf16_can_implement (baracuda kernels loss cross entropy soft bf16 can implement).
baracuda_kernels_loss_cross_entropy_soft_bf16_run^⚠: Soft-target CrossEntropy FW, bf16.
baracuda_kernels_loss_cross_entropy_soft_f16_can_implement^⚠: baracuda_kernels_loss_cross_entropy_soft_f16_can_implement (baracuda kernels loss cross entropy soft f16 can implement).
baracuda_kernels_loss_cross_entropy_soft_f16_run^⚠: Soft-target CrossEntropy FW, f16.
baracuda_kernels_loss_cross_entropy_soft_f32_can_implement^⚠: baracuda_kernels_loss_cross_entropy_soft_f32_can_implement (baracuda kernels loss cross entropy soft f32 can implement).
baracuda_kernels_loss_cross_entropy_soft_f32_run^⚠: Soft-target CrossEntropy FW, f32. Target is T-typed prob tensor.
baracuda_kernels_loss_cross_entropy_soft_f64_can_implement^⚠: baracuda_kernels_loss_cross_entropy_soft_f64_can_implement (baracuda kernels loss cross entropy soft f64 can implement).
baracuda_kernels_loss_cross_entropy_soft_f64_run^⚠: Soft-target CrossEntropy FW, f64.
baracuda_kernels_loss_ctc_backward_bf16_can_implement^⚠: CTCLoss BW _can_implement, bf16.
baracuda_kernels_loss_ctc_backward_bf16_run^⚠: CTCLoss BW, bf16.
baracuda_kernels_loss_ctc_backward_f16_can_implement^⚠: CTCLoss BW _can_implement, f16.
baracuda_kernels_loss_ctc_backward_f16_run^⚠: CTCLoss BW, f16.
baracuda_kernels_loss_ctc_backward_f32_can_implement^⚠: CTCLoss BW _can_implement, f32.
baracuda_kernels_loss_ctc_backward_f32_run^⚠: CTCLoss BW, f32.
baracuda_kernels_loss_ctc_backward_f64_can_implement^⚠: CTCLoss BW _can_implement, f64 (F64_ACC).
baracuda_kernels_loss_ctc_backward_f64_run^⚠: CTCLoss BW, f64.
baracuda_kernels_loss_ctc_bf16_can_implement^⚠: CTCLoss FW _can_implement, bf16 (F32_ACC).
baracuda_kernels_loss_ctc_bf16_run^⚠: CTCLoss FW, bf16.
baracuda_kernels_loss_ctc_f16_can_implement^⚠: CTCLoss FW _can_implement, f16 (F32_ACC).
baracuda_kernels_loss_ctc_f16_run^⚠: CTCLoss FW, f16.
baracuda_kernels_loss_ctc_f32_can_implement^⚠: CTCLoss FW _can_implement, f32. Validates num_classes <= 32, max_target_len <= 256, blank ∈ [0, num_classes), reduction_mode ∈ [0,2].
baracuda_kernels_loss_ctc_f32_run^⚠: CTCLoss FW, f32.
baracuda_kernels_loss_ctc_f64_can_implement^⚠: CTCLoss FW _can_implement, f64 (F64_ACC).
baracuda_kernels_loss_ctc_f64_run^⚠: CTCLoss FW, f64.
baracuda_kernels_loss_flce_count_non_ignore_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_count_non_ignore. Host-side only.
baracuda_kernels_loss_flce_count_non_ignore_run^⚠: FLCE count-non-ignore. Single-block tree reduction; writes the target[i] != ignore_index count into count_out[0] (i64).
baracuda_kernels_loss_flce_inplace_scale_bf16_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_inplace_scale_bf16. Host-side only.
baracuda_kernels_loss_flce_inplace_scale_bf16_run^⚠: FLCE in-place scale, bf16.
baracuda_kernels_loss_flce_inplace_scale_f16_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_inplace_scale_f16. Host-side only.
baracuda_kernels_loss_flce_inplace_scale_f16_run^⚠: FLCE in-place scale, f16.
baracuda_kernels_loss_flce_inplace_scale_f32_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_inplace_scale_f32. Host-side only.
baracuda_kernels_loss_flce_inplace_scale_f32_run^⚠: FLCE in-place scale, f32. Multiplies buf in place by scalar.
baracuda_kernels_loss_flce_inplace_scale_f64_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_inplace_scale_f64. Host-side only.
baracuda_kernels_loss_flce_inplace_scale_f64_run^⚠: FLCE in-place scale, f64.
baracuda_kernels_loss_flce_per_row_bf16_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_per_row_bf16. Host-side only.
baracuda_kernels_loss_flce_per_row_bf16_run^⚠: FLCE per-row fused step, bf16.
baracuda_kernels_loss_flce_per_row_cast_bf16_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_per_row_cast_bf16. Host-side only.
baracuda_kernels_loss_flce_per_row_cast_bf16_run^⚠: FLCE per-row cast, f32 → bf16.
baracuda_kernels_loss_flce_per_row_cast_f16_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_per_row_cast_f16. Host-side only.
baracuda_kernels_loss_flce_per_row_cast_f16_run^⚠: FLCE per-row cast, f32 → f16.
baracuda_kernels_loss_flce_per_row_cast_f32_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_per_row_cast_f32. Host-side only.
baracuda_kernels_loss_flce_per_row_cast_f32_run^⚠: FLCE per-row cast (None mode finalizer), f32 → f32.
baracuda_kernels_loss_flce_per_row_cast_f64_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_per_row_cast_f64. Host-side only.
baracuda_kernels_loss_flce_per_row_cast_f64_run^⚠: FLCE per-row cast, f32 → f64.
baracuda_kernels_loss_flce_per_row_f16_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_per_row_f16. Host-side only.
baracuda_kernels_loss_flce_per_row_f16_run^⚠: FLCE per-row fused step, f16.
baracuda_kernels_loss_flce_per_row_f32_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_per_row_f32. Host-side only.
baracuda_kernels_loss_flce_per_row_f32_run^⚠: FLCE per-row fused step, f32. Mutates logits in place to grad_logits = (softmax - one_hot) · scale_per_row; writes per-row -log_softmax[target] into loss_1d (f32 accumulator).
baracuda_kernels_loss_flce_per_row_f64_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_per_row_f64. Host-side only.
baracuda_kernels_loss_flce_per_row_f64_run^⚠: FLCE per-row fused step, f64.
baracuda_kernels_loss_flce_scalar_finalize_bf16_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_scalar_finalize_bf16. Host-side only.
baracuda_kernels_loss_flce_scalar_finalize_bf16_run^⚠: FLCE scalar finalize, f32 → bf16.
baracuda_kernels_loss_flce_scalar_finalize_f16_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_scalar_finalize_f16. Host-side only.
baracuda_kernels_loss_flce_scalar_finalize_f16_run^⚠: FLCE scalar finalize, f32 → f16.
baracuda_kernels_loss_flce_scalar_finalize_f32_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_scalar_finalize_f32. Host-side only.
baracuda_kernels_loss_flce_scalar_finalize_f32_run^⚠: FLCE scalar finalize (Mean/Sum), f32 → f32.
baracuda_kernels_loss_flce_scalar_finalize_f64_can_implement^⚠: Implementability check for baracuda_kernels_loss_flce_scalar_finalize_f64. Host-side only.
baracuda_kernels_loss_flce_scalar_finalize_f64_run^⚠: FLCE scalar finalize, f32 → f64.
baracuda_kernels_loss_gaussian_nll_backward_bf16_can_implement^⚠: baracuda_kernels_loss_gaussian_nll_backward_bf16_can_implement (baracuda kernels loss gaussian nll backward bf16 can implement).
baracuda_kernels_loss_gaussian_nll_backward_bf16_run^⚠: GaussianNLL BW, bf16.
baracuda_kernels_loss_gaussian_nll_backward_f16_can_implement^⚠: baracuda_kernels_loss_gaussian_nll_backward_f16_can_implement (baracuda kernels loss gaussian nll backward f16 can implement).
baracuda_kernels_loss_gaussian_nll_backward_f16_run^⚠: GaussianNLL BW, f16.
baracuda_kernels_loss_gaussian_nll_backward_f32_can_implement^⚠: baracuda_kernels_loss_gaussian_nll_backward_f32_can_implement (baracuda kernels loss gaussian nll backward f32 can implement).
baracuda_kernels_loss_gaussian_nll_backward_f32_run^⚠: GaussianNLL BW, f32.
baracuda_kernels_loss_gaussian_nll_backward_f64_can_implement^⚠: baracuda_kernels_loss_gaussian_nll_backward_f64_can_implement (baracuda kernels loss gaussian nll backward f64 can implement).
baracuda_kernels_loss_gaussian_nll_backward_f64_run^⚠: GaussianNLL BW, f64.
baracuda_kernels_loss_gaussian_nll_bf16_can_implement^⚠: baracuda_kernels_loss_gaussian_nll_bf16_can_implement (baracuda kernels loss gaussian nll bf16 can implement).
baracuda_kernels_loss_gaussian_nll_bf16_run^⚠: GaussianNLL FW, bf16.
baracuda_kernels_loss_gaussian_nll_f16_can_implement^⚠: baracuda_kernels_loss_gaussian_nll_f16_can_implement (baracuda kernels loss gaussian nll f16 can implement).
baracuda_kernels_loss_gaussian_nll_f16_run^⚠: GaussianNLL FW, f16.
baracuda_kernels_loss_gaussian_nll_f32_can_implement^⚠: baracuda_kernels_loss_gaussian_nll_f32_can_implement (baracuda kernels loss gaussian nll f32 can implement).
baracuda_kernels_loss_gaussian_nll_f32_run^⚠: GaussianNLL FW, f32. 3-tensor input (input, target, var).
baracuda_kernels_loss_gaussian_nll_f64_can_implement^⚠: baracuda_kernels_loss_gaussian_nll_f64_can_implement (baracuda kernels loss gaussian nll f64 can implement).
baracuda_kernels_loss_gaussian_nll_f64_run^⚠: GaussianNLL FW, f64.
baracuda_kernels_loss_hinge_embedding_backward_bf16_can_implement^⚠: baracuda_kernels_loss_hinge_embedding_backward_bf16_can_implement (baracuda kernels loss hinge embedding backward bf16 can implement).
baracuda_kernels_loss_hinge_embedding_backward_bf16_run^⚠: HingeEmbedding BW, bf16.
baracuda_kernels_loss_hinge_embedding_backward_f16_can_implement^⚠: baracuda_kernels_loss_hinge_embedding_backward_f16_can_implement (baracuda kernels loss hinge embedding backward f16 can implement).
baracuda_kernels_loss_hinge_embedding_backward_f16_run^⚠: HingeEmbedding BW, f16.
baracuda_kernels_loss_hinge_embedding_backward_f32_can_implement^⚠: baracuda_kernels_loss_hinge_embedding_backward_f32_can_implement (baracuda kernels loss hinge embedding backward f32 can implement).
baracuda_kernels_loss_hinge_embedding_backward_f32_run^⚠: HingeEmbedding BW, f32.
baracuda_kernels_loss_hinge_embedding_backward_f64_can_implement^⚠: baracuda_kernels_loss_hinge_embedding_backward_f64_can_implement (baracuda kernels loss hinge embedding backward f64 can implement).
baracuda_kernels_loss_hinge_embedding_backward_f64_run^⚠: HingeEmbedding BW, f64.
baracuda_kernels_loss_hinge_embedding_bf16_can_implement^⚠: baracuda_kernels_loss_hinge_embedding_bf16_can_implement (baracuda kernels loss hinge embedding bf16 can implement).
baracuda_kernels_loss_hinge_embedding_bf16_run^⚠: HingeEmbedding FW, bf16.
baracuda_kernels_loss_hinge_embedding_f16_can_implement^⚠: baracuda_kernels_loss_hinge_embedding_f16_can_implement (baracuda kernels loss hinge embedding f16 can implement).
baracuda_kernels_loss_hinge_embedding_f16_run^⚠: HingeEmbedding FW, f16.
baracuda_kernels_loss_hinge_embedding_f32_can_implement^⚠: baracuda_kernels_loss_hinge_embedding_f32_can_implement (baracuda kernels loss hinge embedding f32 can implement).
baracuda_kernels_loss_hinge_embedding_f32_run^⚠: HingeEmbedding FW, f32. ABI: (numel, reduction_mode, margin, input, target_i64, out, workspace, workspace_bytes, stream).
baracuda_kernels_loss_hinge_embedding_f64_can_implement^⚠: baracuda_kernels_loss_hinge_embedding_f64_can_implement (baracuda kernels loss hinge embedding f64 can implement).
baracuda_kernels_loss_hinge_embedding_f64_run^⚠: HingeEmbedding FW, f64.
baracuda_kernels_loss_huber_backward_bf16_can_implement^⚠: baracuda_kernels_loss_huber_backward_bf16_can_implement (baracuda kernels loss huber backward bf16 can implement).
baracuda_kernels_loss_huber_backward_bf16_run^⚠: Huber BW, bf16.
baracuda_kernels_loss_huber_backward_f16_can_implement^⚠: baracuda_kernels_loss_huber_backward_f16_can_implement (baracuda kernels loss huber backward f16 can implement).
baracuda_kernels_loss_huber_backward_f16_run^⚠: Huber BW, f16.
baracuda_kernels_loss_huber_backward_f32_can_implement^⚠: baracuda_kernels_loss_huber_backward_f32_can_implement (baracuda kernels loss huber backward f32 can implement).
baracuda_kernels_loss_huber_backward_f32_run^⚠: Huber BW, f32.
baracuda_kernels_loss_huber_backward_f64_can_implement^⚠: baracuda_kernels_loss_huber_backward_f64_can_implement (baracuda kernels loss huber backward f64 can implement).
baracuda_kernels_loss_huber_backward_f64_run^⚠: Huber BW, f64.
baracuda_kernels_loss_huber_bf16_can_implement^⚠: baracuda_kernels_loss_huber_bf16_can_implement (baracuda kernels loss huber bf16 can implement).
baracuda_kernels_loss_huber_bf16_run^⚠: Huber FW, bf16.
baracuda_kernels_loss_huber_f16_can_implement^⚠: baracuda_kernels_loss_huber_f16_can_implement (baracuda kernels loss huber f16 can implement).
baracuda_kernels_loss_huber_f16_run^⚠: Huber FW, f16.
baracuda_kernels_loss_huber_f32_can_implement^⚠: baracuda_kernels_loss_huber_f32_can_implement (baracuda kernels loss huber f32 can implement).
baracuda_kernels_loss_huber_f32_run^⚠: Huber FW, f32. param = δ.
baracuda_kernels_loss_huber_f64_can_implement^⚠: baracuda_kernels_loss_huber_f64_can_implement (baracuda kernels loss huber f64 can implement).
baracuda_kernels_loss_huber_f64_run^⚠: Huber FW, f64.
baracuda_kernels_loss_kl_div_backward_bf16_can_implement^⚠: KLDiv BW _can_implement, bf16.
baracuda_kernels_loss_kl_div_backward_bf16_run^⚠: KLDiv BW, bf16.
baracuda_kernels_loss_kl_div_backward_f16_can_implement^⚠: KLDiv BW _can_implement, f16.
baracuda_kernels_loss_kl_div_backward_f16_run^⚠: KLDiv BW, f16.
baracuda_kernels_loss_kl_div_backward_f32_can_implement^⚠: KLDiv BW _can_implement, f32.
baracuda_kernels_loss_kl_div_backward_f32_run^⚠: KLDiv BW, f32. dinput = -target · scale.
baracuda_kernels_loss_kl_div_backward_f64_can_implement^⚠: KLDiv BW _can_implement, f64.
baracuda_kernels_loss_kl_div_backward_f64_run^⚠: KLDiv BW, f64.
baracuda_kernels_loss_kl_div_bf16_can_implement^⚠: KLDiv FW _can_implement, bf16.
baracuda_kernels_loss_kl_div_bf16_run^⚠: KLDiv FW, bf16.
baracuda_kernels_loss_kl_div_f16_can_implement^⚠: KLDiv FW _can_implement, f16.
baracuda_kernels_loss_kl_div_f16_run^⚠: KLDiv FW, f16.
baracuda_kernels_loss_kl_div_f32_can_implement^⚠: KLDiv FW _can_implement, f32.
baracuda_kernels_loss_kl_div_f32_run^⚠: KLDiv FW, f32. target·(log(target) - input) per-cell. PyTorch convention: input is already log-prob.
baracuda_kernels_loss_kl_div_f64_can_implement^⚠: KLDiv FW _can_implement, f64.
baracuda_kernels_loss_kl_div_f64_run^⚠: KLDiv FW, f64.
baracuda_kernels_loss_l1_backward_bf16_can_implement^⚠: baracuda_kernels_loss_l1_backward_bf16_can_implement (baracuda kernels loss l1 backward bf16 can implement).
baracuda_kernels_loss_l1_backward_bf16_run^⚠: L1 BW, bf16.
baracuda_kernels_loss_l1_backward_f16_can_implement^⚠: baracuda_kernels_loss_l1_backward_f16_can_implement (baracuda kernels loss l1 backward f16 can implement).
baracuda_kernels_loss_l1_backward_f16_run^⚠: L1 BW, f16.
baracuda_kernels_loss_l1_backward_f32_can_implement^⚠: L1 BW _can_implement, f32.
baracuda_kernels_loss_l1_backward_f32_run^⚠: L1 BW, f32. dpred = sign(pred - target) · scale.
baracuda_kernels_loss_l1_backward_f64_can_implement^⚠: baracuda_kernels_loss_l1_backward_f64_can_implement (baracuda kernels loss l1 backward f64 can implement).
baracuda_kernels_loss_l1_backward_f64_run^⚠: L1 BW, f64.
baracuda_kernels_loss_l1_bf16_can_implement^⚠: L1 FW _can_implement, bf16.
baracuda_kernels_loss_l1_bf16_run^⚠: L1 FW, bf16.
baracuda_kernels_loss_l1_f16_can_implement^⚠: L1 FW _can_implement, f16.
baracuda_kernels_loss_l1_f16_run^⚠: L1 FW, f16.
baracuda_kernels_loss_l1_f32_can_implement^⚠: L1 FW _can_implement, f32.
baracuda_kernels_loss_l1_f32_run^⚠: L1 FW, f32. y = |pred - target| per-cell; mean/sum reduce to scalar.
baracuda_kernels_loss_l1_f64_can_implement^⚠: L1 FW _can_implement, f64.
baracuda_kernels_loss_l1_f64_run^⚠: L1 FW, f64.
baracuda_kernels_loss_margin_ranking_backward_bf16_can_implement^⚠: baracuda_kernels_loss_margin_ranking_backward_bf16_can_implement (baracuda kernels loss margin ranking backward bf16 can implement).
baracuda_kernels_loss_margin_ranking_backward_bf16_run^⚠: MarginRanking BW, bf16.
baracuda_kernels_loss_margin_ranking_backward_f16_can_implement^⚠: baracuda_kernels_loss_margin_ranking_backward_f16_can_implement (baracuda kernels loss margin ranking backward f16 can implement).
baracuda_kernels_loss_margin_ranking_backward_f16_run^⚠: MarginRanking BW, f16.
baracuda_kernels_loss_margin_ranking_backward_f32_can_implement^⚠: baracuda_kernels_loss_margin_ranking_backward_f32_can_implement (baracuda kernels loss margin ranking backward f32 can implement).
baracuda_kernels_loss_margin_ranking_backward_f32_run^⚠: MarginRanking BW, f32. ABI: (numel, reduction_mode, scale, margin, x1, x2, t, dy, dx1, dx2, workspace, workspace_bytes, stream).
baracuda_kernels_loss_margin_ranking_backward_f64_can_implement^⚠: baracuda_kernels_loss_margin_ranking_backward_f64_can_implement (baracuda kernels loss margin ranking backward f64 can implement).
baracuda_kernels_loss_margin_ranking_backward_f64_run^⚠: MarginRanking BW, f64.
baracuda_kernels_loss_margin_ranking_bf16_can_implement^⚠: baracuda_kernels_loss_margin_ranking_bf16_can_implement (baracuda kernels loss margin ranking bf16 can implement).
baracuda_kernels_loss_margin_ranking_bf16_run^⚠: MarginRanking FW, bf16.
baracuda_kernels_loss_margin_ranking_f16_can_implement^⚠: baracuda_kernels_loss_margin_ranking_f16_can_implement (baracuda kernels loss margin ranking f16 can implement).
baracuda_kernels_loss_margin_ranking_f16_run^⚠: MarginRanking FW, f16.
baracuda_kernels_loss_margin_ranking_f32_can_implement^⚠: baracuda_kernels_loss_margin_ranking_f32_can_implement (baracuda kernels loss margin ranking f32 can implement).
baracuda_kernels_loss_margin_ranking_f32_run^⚠: MarginRanking FW, f32. ABI: (numel, reduction_mode, margin, x1, x2, t, out, workspace, workspace_bytes, stream).
baracuda_kernels_loss_margin_ranking_f64_can_implement^⚠: baracuda_kernels_loss_margin_ranking_f64_can_implement (baracuda kernels loss margin ranking f64 can implement).
baracuda_kernels_loss_margin_ranking_f64_run^⚠: MarginRanking FW, f64.
baracuda_kernels_loss_mse_backward_bf16_can_implement^⚠: MSE BW _can_implement, bf16.
baracuda_kernels_loss_mse_backward_bf16_run^⚠: MSE BW, bf16.
baracuda_kernels_loss_mse_backward_f16_can_implement^⚠: MSE BW _can_implement, f16.
baracuda_kernels_loss_mse_backward_f16_run^⚠: MSE BW, f16.
baracuda_kernels_loss_mse_backward_f32_can_implement^⚠: MSE BW _can_implement, f32.
baracuda_kernels_loss_mse_backward_f32_run^⚠: MSE BW, f32. dpred = 2·(pred - target) · scale.
baracuda_kernels_loss_mse_backward_f64_can_implement^⚠: MSE BW _can_implement, f64.
baracuda_kernels_loss_mse_backward_f64_run^⚠: MSE BW, f64.
baracuda_kernels_loss_mse_bf16_can_implement^⚠: MSE FW _can_implement, bf16.
baracuda_kernels_loss_mse_bf16_run^⚠: MSE FW, bf16.
baracuda_kernels_loss_mse_f16_can_implement^⚠: MSE FW _can_implement, f16.
baracuda_kernels_loss_mse_f16_run^⚠: MSE FW, f16.
baracuda_kernels_loss_mse_f32_can_implement^⚠: MSE FW _can_implement, f32. Host-side validator (no launch).
baracuda_kernels_loss_mse_f32_run^⚠: MSE FW, f32. (pred - target)² per-cell; mean/sum reduce to scalar. Workspace: numel * sizeof(T) bytes for Mean/Sum; unused for None.
baracuda_kernels_loss_mse_f64_can_implement^⚠: MSE FW _can_implement, f64.
baracuda_kernels_loss_mse_f64_run^⚠: MSE FW, f64.
baracuda_kernels_loss_multi_margin_backward_bf16_can_implement^⚠: baracuda_kernels_loss_multi_margin_backward_bf16_can_implement (baracuda kernels loss multi margin backward bf16 can implement).
baracuda_kernels_loss_multi_margin_backward_bf16_run^⚠: MultiMargin BW, bf16.
baracuda_kernels_loss_multi_margin_backward_f16_can_implement^⚠: baracuda_kernels_loss_multi_margin_backward_f16_can_implement (baracuda kernels loss multi margin backward f16 can implement).
baracuda_kernels_loss_multi_margin_backward_f16_run^⚠: MultiMargin BW, f16.
baracuda_kernels_loss_multi_margin_backward_f32_can_implement^⚠: baracuda_kernels_loss_multi_margin_backward_f32_can_implement (baracuda kernels loss multi margin backward f32 can implement).
baracuda_kernels_loss_multi_margin_backward_f32_run^⚠: MultiMargin BW.
baracuda_kernels_loss_multi_margin_backward_f64_can_implement^⚠: baracuda_kernels_loss_multi_margin_backward_f64_can_implement (baracuda kernels loss multi margin backward f64 can implement).
baracuda_kernels_loss_multi_margin_backward_f64_run^⚠: MultiMargin BW, f64.
baracuda_kernels_loss_multi_margin_bf16_can_implement^⚠: baracuda_kernels_loss_multi_margin_bf16_can_implement (baracuda kernels loss multi margin bf16 can implement).
baracuda_kernels_loss_multi_margin_bf16_run^⚠: MultiMargin FW, bf16.
baracuda_kernels_loss_multi_margin_f16_can_implement^⚠: baracuda_kernels_loss_multi_margin_f16_can_implement (baracuda kernels loss multi margin f16 can implement).
baracuda_kernels_loss_multi_margin_f16_run^⚠: MultiMargin FW, f16.
baracuda_kernels_loss_multi_margin_f32_can_implement^⚠: baracuda_kernels_loss_multi_margin_f32_can_implement (baracuda kernels loss multi margin f32 can implement).
baracuda_kernels_loss_multi_margin_f32_run^⚠: MultiMargin FW (per-row). ABI: (n_rows, class_extent, row_stride, reduction_mode, margin, p_norm, input, target_i64, out, workspace, workspace_bytes, stream).
baracuda_kernels_loss_multi_margin_f64_can_implement^⚠: baracuda_kernels_loss_multi_margin_f64_can_implement (baracuda kernels loss multi margin f64 can implement).
baracuda_kernels_loss_multi_margin_f64_run^⚠: MultiMargin FW, f64.
baracuda_kernels_loss_multilabel_margin_backward_bf16_can_implement^⚠: baracuda_kernels_loss_multilabel_margin_backward_bf16_can_implement (baracuda kernels loss multilabel margin backward bf16 can implement).
baracuda_kernels_loss_multilabel_margin_backward_bf16_run^⚠: MultilabelMargin BW, bf16.
baracuda_kernels_loss_multilabel_margin_backward_f16_can_implement^⚠: baracuda_kernels_loss_multilabel_margin_backward_f16_can_implement (baracuda kernels loss multilabel margin backward f16 can implement).
baracuda_kernels_loss_multilabel_margin_backward_f16_run^⚠: MultilabelMargin BW, f16.
baracuda_kernels_loss_multilabel_margin_backward_f32_can_implement^⚠: baracuda_kernels_loss_multilabel_margin_backward_f32_can_implement (baracuda kernels loss multilabel margin backward f32 can implement).
baracuda_kernels_loss_multilabel_margin_backward_f32_run^⚠: MultilabelMargin BW.
baracuda_kernels_loss_multilabel_margin_backward_f64_can_implement^⚠: baracuda_kernels_loss_multilabel_margin_backward_f64_can_implement (baracuda kernels loss multilabel margin backward f64 can implement).
baracuda_kernels_loss_multilabel_margin_backward_f64_run^⚠: MultilabelMargin BW, f64.
baracuda_kernels_loss_multilabel_margin_bf16_can_implement^⚠: baracuda_kernels_loss_multilabel_margin_bf16_can_implement (baracuda kernels loss multilabel margin bf16 can implement).
baracuda_kernels_loss_multilabel_margin_bf16_run^⚠: MultilabelMargin FW, bf16.
baracuda_kernels_loss_multilabel_margin_f16_can_implement^⚠: baracuda_kernels_loss_multilabel_margin_f16_can_implement (baracuda kernels loss multilabel margin f16 can implement).
baracuda_kernels_loss_multilabel_margin_f16_run^⚠: MultilabelMargin FW, f16.
baracuda_kernels_loss_multilabel_margin_f32_can_implement^⚠: baracuda_kernels_loss_multilabel_margin_f32_can_implement (baracuda kernels loss multilabel margin f32 can implement).
baracuda_kernels_loss_multilabel_margin_f32_run^⚠: MultilabelMargin FW (per-row). ABI: (n_rows, class_extent, row_stride_in, row_stride_tgt, reduction_mode, input, target_i64, out, workspace, workspace_bytes, stream).
baracuda_kernels_loss_multilabel_margin_f64_can_implement^⚠: baracuda_kernels_loss_multilabel_margin_f64_can_implement (baracuda kernels loss multilabel margin f64 can implement).
baracuda_kernels_loss_multilabel_margin_f64_run^⚠: MultilabelMargin FW, f64.
baracuda_kernels_loss_multilabel_soft_margin_backward_bf16_can_implement^⚠: baracuda_kernels_loss_multilabel_soft_margin_backward_bf16_can_implement (baracuda kernels loss multilabel soft margin backward bf16 can implement).
baracuda_kernels_loss_multilabel_soft_margin_backward_bf16_run^⚠: MultilabelSoftMargin BW, bf16.
baracuda_kernels_loss_multilabel_soft_margin_backward_f16_can_implement^⚠: baracuda_kernels_loss_multilabel_soft_margin_backward_f16_can_implement (baracuda kernels loss multilabel soft margin backward f16 can implement).
baracuda_kernels_loss_multilabel_soft_margin_backward_f16_run^⚠: MultilabelSoftMargin BW, f16.
baracuda_kernels_loss_multilabel_soft_margin_backward_f32_can_implement^⚠: baracuda_kernels_loss_multilabel_soft_margin_backward_f32_can_implement (baracuda kernels loss multilabel soft margin backward f32 can implement).
baracuda_kernels_loss_multilabel_soft_margin_backward_f32_run^⚠: MultilabelSoftMargin BW.
baracuda_kernels_loss_multilabel_soft_margin_backward_f64_can_implement^⚠: baracuda_kernels_loss_multilabel_soft_margin_backward_f64_can_implement (baracuda kernels loss multilabel soft margin backward f64 can implement).
baracuda_kernels_loss_multilabel_soft_margin_backward_f64_run^⚠: MultilabelSoftMargin BW, f64.
baracuda_kernels_loss_multilabel_soft_margin_bf16_can_implement^⚠: baracuda_kernels_loss_multilabel_soft_margin_bf16_can_implement (baracuda kernels loss multilabel soft margin bf16 can implement).
baracuda_kernels_loss_multilabel_soft_margin_bf16_run^⚠: MultilabelSoftMargin FW, bf16.
baracuda_kernels_loss_multilabel_soft_margin_f16_can_implement^⚠: baracuda_kernels_loss_multilabel_soft_margin_f16_can_implement (baracuda kernels loss multilabel soft margin f16 can implement).
baracuda_kernels_loss_multilabel_soft_margin_f16_run^⚠: MultilabelSoftMargin FW, f16.
baracuda_kernels_loss_multilabel_soft_margin_f32_can_implement^⚠: baracuda_kernels_loss_multilabel_soft_margin_f32_can_implement (baracuda kernels loss multilabel soft margin f32 can implement).
baracuda_kernels_loss_multilabel_soft_margin_f32_run^⚠: MultilabelSoftMargin FW.
baracuda_kernels_loss_multilabel_soft_margin_f64_can_implement^⚠: baracuda_kernels_loss_multilabel_soft_margin_f64_can_implement (baracuda kernels loss multilabel soft margin f64 can implement).
baracuda_kernels_loss_multilabel_soft_margin_f64_run^⚠: MultilabelSoftMargin FW, f64.
baracuda_kernels_loss_nll_backward_bf16_can_implement^⚠: NLL BW _can_implement, bf16.
baracuda_kernels_loss_nll_backward_bf16_run^⚠: NLL BW, bf16.
baracuda_kernels_loss_nll_backward_f16_can_implement^⚠: NLL BW _can_implement, f16.
baracuda_kernels_loss_nll_backward_f16_run^⚠: NLL BW, f16.
baracuda_kernels_loss_nll_backward_f32_can_implement^⚠: NLL BW _can_implement, f32.
baracuda_kernels_loss_nll_backward_f32_run^⚠: NLL BW, f32. Pre-zeros dinput (size dinput_numel · sizeof(T)), then writes dinput[i, target[i]] = -dy_or_scale.
baracuda_kernels_loss_nll_backward_f64_can_implement^⚠: NLL BW _can_implement, f64.
baracuda_kernels_loss_nll_backward_f64_run^⚠: NLL BW, f64.
baracuda_kernels_loss_nll_bf16_can_implement^⚠: NLL FW _can_implement, bf16.
baracuda_kernels_loss_nll_bf16_run^⚠: NLL FW, bf16.
baracuda_kernels_loss_nll_f16_can_implement^⚠: NLL FW _can_implement, f16.
baracuda_kernels_loss_nll_f16_run^⚠: NLL FW, f16.
baracuda_kernels_loss_nll_f32_can_implement^⚠: NLL FW _can_implement, f32.
baracuda_kernels_loss_nll_f32_run^⚠: NLL FW, f32. -input[i, target[i]] per row. Heterogeneous-dtype: input is T, target is i64. row_stride_input is the i64 stride between adjacent rows of input (must equal class_extent for contiguous input).
baracuda_kernels_loss_nll_f64_can_implement^⚠: NLL FW _can_implement, f64.
baracuda_kernels_loss_nll_f64_run^⚠: NLL FW, f64.
baracuda_kernels_loss_poisson_nll_backward_bf16_can_implement^⚠: baracuda_kernels_loss_poisson_nll_backward_bf16_can_implement (baracuda kernels loss poisson nll backward bf16 can implement).
baracuda_kernels_loss_poisson_nll_backward_bf16_run^⚠: PoissonNLL BW, bf16.
baracuda_kernels_loss_poisson_nll_backward_f16_can_implement^⚠: baracuda_kernels_loss_poisson_nll_backward_f16_can_implement (baracuda kernels loss poisson nll backward f16 can implement).
baracuda_kernels_loss_poisson_nll_backward_f16_run^⚠: PoissonNLL BW, f16.
baracuda_kernels_loss_poisson_nll_backward_f32_can_implement^⚠: baracuda_kernels_loss_poisson_nll_backward_f32_can_implement (baracuda kernels loss poisson nll backward f32 can implement).
baracuda_kernels_loss_poisson_nll_backward_f32_run^⚠: PoissonNLL BW, f32.
baracuda_kernels_loss_poisson_nll_backward_f64_can_implement^⚠: baracuda_kernels_loss_poisson_nll_backward_f64_can_implement (baracuda kernels loss poisson nll backward f64 can implement).
baracuda_kernels_loss_poisson_nll_backward_f64_run^⚠: PoissonNLL BW, f64.
baracuda_kernels_loss_poisson_nll_bf16_can_implement^⚠: baracuda_kernels_loss_poisson_nll_bf16_can_implement (baracuda kernels loss poisson nll bf16 can implement).
baracuda_kernels_loss_poisson_nll_bf16_run^⚠: PoissonNLL FW, bf16.
baracuda_kernels_loss_poisson_nll_f16_can_implement^⚠: baracuda_kernels_loss_poisson_nll_f16_can_implement (baracuda kernels loss poisson nll f16 can implement).
baracuda_kernels_loss_poisson_nll_f16_run^⚠: PoissonNLL FW, f16.
baracuda_kernels_loss_poisson_nll_f32_can_implement^⚠: baracuda_kernels_loss_poisson_nll_f32_can_implement (baracuda kernels loss poisson nll f32 can implement).
baracuda_kernels_loss_poisson_nll_f32_run^⚠: PoissonNLL FW, f32. log_input_flag 0/1.
baracuda_kernels_loss_poisson_nll_f64_can_implement^⚠: baracuda_kernels_loss_poisson_nll_f64_can_implement (baracuda kernels loss poisson nll f64 can implement).
baracuda_kernels_loss_poisson_nll_f64_run^⚠: PoissonNLL FW, f64.
baracuda_kernels_loss_smooth_l1_backward_bf16_can_implement^⚠: baracuda_kernels_loss_smooth_l1_backward_bf16_can_implement (baracuda kernels loss smooth l1 backward bf16 can implement).
baracuda_kernels_loss_smooth_l1_backward_bf16_run^⚠: SmoothL1 BW, bf16.
baracuda_kernels_loss_smooth_l1_backward_f16_can_implement^⚠: baracuda_kernels_loss_smooth_l1_backward_f16_can_implement (baracuda kernels loss smooth l1 backward f16 can implement).
baracuda_kernels_loss_smooth_l1_backward_f16_run^⚠: SmoothL1 BW, f16.
baracuda_kernels_loss_smooth_l1_backward_f32_can_implement^⚠: baracuda_kernels_loss_smooth_l1_backward_f32_can_implement (baracuda kernels loss smooth l1 backward f32 can implement).
baracuda_kernels_loss_smooth_l1_backward_f32_run^⚠: SmoothL1 BW, f32.
baracuda_kernels_loss_smooth_l1_backward_f64_can_implement^⚠: baracuda_kernels_loss_smooth_l1_backward_f64_can_implement (baracuda kernels loss smooth l1 backward f64 can implement).
baracuda_kernels_loss_smooth_l1_backward_f64_run^⚠: SmoothL1 BW, f64.
baracuda_kernels_loss_smooth_l1_bf16_can_implement^⚠: baracuda_kernels_loss_smooth_l1_bf16_can_implement (baracuda kernels loss smooth l1 bf16 can implement).
baracuda_kernels_loss_smooth_l1_bf16_run^⚠: SmoothL1 FW, bf16.
baracuda_kernels_loss_smooth_l1_f16_can_implement^⚠: baracuda_kernels_loss_smooth_l1_f16_can_implement (baracuda kernels loss smooth l1 f16 can implement).
baracuda_kernels_loss_smooth_l1_f16_run^⚠: SmoothL1 FW, f16.
baracuda_kernels_loss_smooth_l1_f32_can_implement^⚠: baracuda_kernels_loss_smooth_l1_f32_can_implement (baracuda kernels loss smooth l1 f32 can implement).
baracuda_kernels_loss_smooth_l1_f32_run^⚠: SmoothL1 FW, f32. param = β.
baracuda_kernels_loss_smooth_l1_f64_can_implement^⚠: baracuda_kernels_loss_smooth_l1_f64_can_implement (baracuda kernels loss smooth l1 f64 can implement).
baracuda_kernels_loss_smooth_l1_f64_run^⚠: SmoothL1 FW, f64.
baracuda_kernels_loss_triplet_margin_backward_bf16_can_implement^⚠: baracuda_kernels_loss_triplet_margin_backward_bf16_can_implement (baracuda kernels loss triplet margin backward bf16 can implement).
baracuda_kernels_loss_triplet_margin_backward_bf16_run^⚠: TripletMargin BW, bf16.
baracuda_kernels_loss_triplet_margin_backward_f16_can_implement^⚠: baracuda_kernels_loss_triplet_margin_backward_f16_can_implement (baracuda kernels loss triplet margin backward f16 can implement).
baracuda_kernels_loss_triplet_margin_backward_f16_run^⚠: TripletMargin BW, f16.
baracuda_kernels_loss_triplet_margin_backward_f32_can_implement^⚠: baracuda_kernels_loss_triplet_margin_backward_f32_can_implement (baracuda kernels loss triplet margin backward f32 can implement).
baracuda_kernels_loss_triplet_margin_backward_f32_run^⚠: TripletMargin BW.
baracuda_kernels_loss_triplet_margin_backward_f64_can_implement^⚠: baracuda_kernels_loss_triplet_margin_backward_f64_can_implement (baracuda kernels loss triplet margin backward f64 can implement).
baracuda_kernels_loss_triplet_margin_backward_f64_run^⚠: TripletMargin BW, f64.
baracuda_kernels_loss_triplet_margin_bf16_can_implement^⚠: baracuda_kernels_loss_triplet_margin_bf16_can_implement (baracuda kernels loss triplet margin bf16 can implement).
baracuda_kernels_loss_triplet_margin_bf16_run^⚠: TripletMargin FW, bf16.
baracuda_kernels_loss_triplet_margin_f16_can_implement^⚠: baracuda_kernels_loss_triplet_margin_f16_can_implement (baracuda kernels loss triplet margin f16 can implement).
baracuda_kernels_loss_triplet_margin_f16_run^⚠: TripletMargin FW, f16.
baracuda_kernels_loss_triplet_margin_f32_can_implement^⚠: baracuda_kernels_loss_triplet_margin_f32_can_implement (baracuda kernels loss triplet margin f32 can implement).
baracuda_kernels_loss_triplet_margin_f32_run^⚠: TripletMargin FW (per-row). ABI: (n_rows, d_extent, row_stride, reduction_mode, margin, p_norm, a, p, n, out, workspace, workspace_bytes, stream).
baracuda_kernels_loss_triplet_margin_f64_can_implement^⚠: baracuda_kernels_loss_triplet_margin_f64_can_implement (baracuda kernels loss triplet margin f64 can implement).
baracuda_kernels_loss_triplet_margin_f64_run^⚠: TripletMargin FW, f64.
baracuda_kernels_lp_pool_1d_bf16_backward_can_implement^⚠: baracuda_kernels_lp_pool_1d_bf16_backward_can_implement (baracuda kernels lp pool 1d bf16 backward can implement).
baracuda_kernels_lp_pool_1d_bf16_backward_run^⚠: LpPool 1d BW, bf16.
baracuda_kernels_lp_pool_1d_bf16_can_implement^⚠: baracuda_kernels_lp_pool_1d_bf16_can_implement (baracuda kernels lp pool 1d bf16 can implement).
baracuda_kernels_lp_pool_1d_bf16_run^⚠: LpPool 1d FW, bf16.
baracuda_kernels_lp_pool_1d_f16_backward_can_implement^⚠: baracuda_kernels_lp_pool_1d_f16_backward_can_implement (baracuda kernels lp pool 1d f16 backward can implement).
baracuda_kernels_lp_pool_1d_f16_backward_run^⚠: LpPool 1d BW, f16.
baracuda_kernels_lp_pool_1d_f16_can_implement^⚠: baracuda_kernels_lp_pool_1d_f16_can_implement (baracuda kernels lp pool 1d f16 can implement).
baracuda_kernels_lp_pool_1d_f16_run^⚠: LpPool 1d FW, f16.
baracuda_kernels_lp_pool_1d_f32_backward_can_implement^⚠: baracuda_kernels_lp_pool_1d_f32_backward_can_implement (baracuda kernels lp pool 1d f32 backward can implement).
baracuda_kernels_lp_pool_1d_f32_backward_run^⚠: LpPool 1d BW, f32. Caller must zero dx first.
baracuda_kernels_lp_pool_1d_f32_can_implement^⚠: baracuda_kernels_lp_pool_1d_f32_can_implement (baracuda kernels lp pool 1d f32 can implement).
baracuda_kernels_lp_pool_1d_f32_run^⚠: LpPool 1d FW, f32.
baracuda_kernels_lp_pool_1d_f64_backward_can_implement^⚠: baracuda_kernels_lp_pool_1d_f64_backward_can_implement (baracuda kernels lp pool 1d f64 backward can implement).
baracuda_kernels_lp_pool_1d_f64_backward_run^⚠: LpPool 1d BW, f64.
baracuda_kernels_lp_pool_1d_f64_can_implement^⚠: baracuda_kernels_lp_pool_1d_f64_can_implement (baracuda kernels lp pool 1d f64 can implement).
baracuda_kernels_lp_pool_1d_f64_run^⚠: LpPool 1d FW, f64.
baracuda_kernels_lp_pool_2d_bf16_backward_can_implement^⚠: baracuda_kernels_lp_pool_2d_bf16_backward_can_implement (baracuda kernels lp pool 2d bf16 backward can implement).
baracuda_kernels_lp_pool_2d_bf16_backward_run^⚠: LpPool 2d BW, bf16.
baracuda_kernels_lp_pool_2d_bf16_can_implement^⚠: baracuda_kernels_lp_pool_2d_bf16_can_implement (baracuda kernels lp pool 2d bf16 can implement).
baracuda_kernels_lp_pool_2d_bf16_run^⚠: LpPool 2d FW, bf16.
baracuda_kernels_lp_pool_2d_f16_backward_can_implement^⚠: baracuda_kernels_lp_pool_2d_f16_backward_can_implement (baracuda kernels lp pool 2d f16 backward can implement).
baracuda_kernels_lp_pool_2d_f16_backward_run^⚠: LpPool 2d BW, f16.
baracuda_kernels_lp_pool_2d_f16_can_implement^⚠: baracuda_kernels_lp_pool_2d_f16_can_implement (baracuda kernels lp pool 2d f16 can implement).
baracuda_kernels_lp_pool_2d_f16_run^⚠: LpPool 2d FW, f16.
baracuda_kernels_lp_pool_2d_f32_backward_can_implement^⚠: baracuda_kernels_lp_pool_2d_f32_backward_can_implement (baracuda kernels lp pool 2d f32 backward can implement).
baracuda_kernels_lp_pool_2d_f32_backward_run^⚠: LpPool 2d BW, f32. Caller must zero dx first.
baracuda_kernels_lp_pool_2d_f32_can_implement^⚠: baracuda_kernels_lp_pool_2d_f32_can_implement (baracuda kernels lp pool 2d f32 can implement).
baracuda_kernels_lp_pool_2d_f32_run^⚠: LpPool 2d FW, f32.
baracuda_kernels_lp_pool_2d_f64_backward_can_implement^⚠: baracuda_kernels_lp_pool_2d_f64_backward_can_implement (baracuda kernels lp pool 2d f64 backward can implement).
baracuda_kernels_lp_pool_2d_f64_backward_run^⚠: LpPool 2d BW, f64.
baracuda_kernels_lp_pool_2d_f64_can_implement^⚠: baracuda_kernels_lp_pool_2d_f64_can_implement (baracuda kernels lp pool 2d f64 can implement).
baracuda_kernels_lp_pool_2d_f64_run^⚠: LpPool 2d FW, f64.
baracuda_kernels_lstsq_f32_run^⚠: Least-squares solve via iterative _gels (no QR fallback). On convergence, niters_out >= 0. On non-convergence, niters_out < 0 and the caller should retry via the Rust plan layer (which holds the QR fallback path).
baracuda_kernels_lstsq_f32_workspace_size^⚠: LstSq workspace size in BYTES (not elements — cuSOLVER’s _gels API differs from the others on this point).
baracuda_kernels_lstsq_f64_run^⚠: Least-squares solve via iterative _gels (no QR fallback). On convergence, niters_out >= 0. On non-convergence, niters_out < 0 and the caller should retry via the Rust plan layer (which holds the QR fallback path).
baracuda_kernels_lstsq_f64_workspace_size^⚠: LstSq workspace size in BYTES (not elements — cuSOLVER’s _gels API differs from the others on this point).
baracuda_kernels_lu_f32_run^⚠: LU factorization with partial pivoting (non-batched). a_inout is overwritten with the packed LU factors; pivots_out receives the 1-based row swaps (length min(m, n)); info_out is a single i32.
baracuda_kernels_lu_f32_workspace_size^⚠: LU factorization workspace size in bytes for getrf.
baracuda_kernels_lu_f64_run^⚠: LU factorization with partial pivoting (non-batched). a_inout is overwritten with the packed LU factors; pivots_out receives the 1-based row swaps (length min(m, n)); info_out is a single i32.
baracuda_kernels_lu_f64_workspace_size^⚠: LU factorization workspace size in bytes for getrf.
baracuda_kernels_masked_fill_backward_bool_can_implement^⚠: Implementability check for masked_fill_backward_bool.
baracuda_kernels_masked_fill_backward_bool_run^⚠: masked_fill_backward — bool (u8 storage).
baracuda_kernels_masked_fill_backward_f32_can_implement^⚠: Implementability check for masked_fill_backward_f32.
baracuda_kernels_masked_fill_backward_f32_run^⚠: dsrc[i] = mask[i] ? 0 : dout[i]. f32.
baracuda_kernels_masked_fill_backward_f64_can_implement^⚠: Implementability check for masked_fill_backward_f64.
baracuda_kernels_masked_fill_backward_f64_run^⚠: masked_fill_backward — f64.
baracuda_kernels_masked_fill_backward_i32_can_implement^⚠: Implementability check for masked_fill_backward_i32.
baracuda_kernels_masked_fill_backward_i32_run^⚠: masked_fill_backward — i32.
baracuda_kernels_masked_fill_bool_can_implement^⚠: Implementability check for masked_fill_bool.
baracuda_kernels_masked_fill_bool_run^⚠: masked_fill — bool (u8 storage).
baracuda_kernels_masked_fill_f32_can_implement^⚠: Implementability check for masked_fill_f32.
baracuda_kernels_masked_fill_f32_run^⚠: out[i] = mask[i] ? fill_value : src[i]. f32 (caller passes fill_value.to_bits() as i64).
baracuda_kernels_masked_fill_f64_can_implement^⚠: Implementability check for masked_fill_f64.
baracuda_kernels_masked_fill_f64_run^⚠: masked_fill — f64.
baracuda_kernels_masked_fill_i32_can_implement^⚠: Implementability check for masked_fill_i32.
baracuda_kernels_masked_fill_i32_run^⚠: masked_fill — i32.
baracuda_kernels_mmvq_batched_bf16_can_implement^⚠: baracuda_kernels_mmvq_batched_bf16_can_implement (baracuda kernels mmvq batched bf16 can implement).
baracuda_kernels_mmvq_batched_bf16_run^⚠: Batched MMV (non-quant) — bf16. # Safety: as f32.
baracuda_kernels_mmvq_batched_f16_can_implement^⚠: baracuda_kernels_mmvq_batched_f16_can_implement (baracuda kernels mmvq batched f16 can implement).
baracuda_kernels_mmvq_batched_f16_run^⚠: Batched MMV (non-quant) — f16. # Safety: as f32.
baracuda_kernels_mmvq_batched_f32_can_implement^⚠: baracuda_kernels_mmvq_batched_f32_can_implement (baracuda kernels mmvq batched f32 can implement).
baracuda_kernels_mmvq_batched_f32_run^⚠: Batched MMV (non-quant) — f32 weights + activation + output.
baracuda_kernels_mmvq_multim_q2_K_m1_can_implement^⚠: baracuda_kernels_mmvq_multim_q2_K_m1_can_implement (baracuda kernels mmvq multim q2 k m1 can implement).
baracuda_kernels_mmvq_multim_q2_K_m1_run^⚠: baracuda_kernels_mmvq_multim_q2_K_m1_run (baracuda kernels mmvq multim q2 k m1 run).
baracuda_kernels_mmvq_multim_q2_K_m2_can_implement^⚠: baracuda_kernels_mmvq_multim_q2_K_m2_can_implement (baracuda kernels mmvq multim q2 k m2 can implement).
baracuda_kernels_mmvq_multim_q2_K_m2_run^⚠: baracuda_kernels_mmvq_multim_q2_K_m2_run (baracuda kernels mmvq multim q2 k m2 run).
baracuda_kernels_mmvq_multim_q2_K_m4_can_implement^⚠: baracuda_kernels_mmvq_multim_q2_K_m4_can_implement (baracuda kernels mmvq multim q2 k m4 can implement).
baracuda_kernels_mmvq_multim_q2_K_m4_run^⚠: baracuda_kernels_mmvq_multim_q2_K_m4_run (baracuda kernels mmvq multim q2 k m4 run).
baracuda_kernels_mmvq_multim_q2_K_m8_can_implement^⚠: baracuda_kernels_mmvq_multim_q2_K_m8_can_implement (baracuda kernels mmvq multim q2 k m8 can implement).
baracuda_kernels_mmvq_multim_q2_K_m8_run^⚠: baracuda_kernels_mmvq_multim_q2_K_m8_run (baracuda kernels mmvq multim q2 k m8 run).
baracuda_kernels_mmvq_multim_q3_K_m1_can_implement^⚠: baracuda_kernels_mmvq_multim_q3_K_m1_can_implement (baracuda kernels mmvq multim q3 k m1 can implement).
baracuda_kernels_mmvq_multim_q3_K_m1_run^⚠: baracuda_kernels_mmvq_multim_q3_K_m1_run (baracuda kernels mmvq multim q3 k m1 run).
baracuda_kernels_mmvq_multim_q3_K_m2_can_implement^⚠: baracuda_kernels_mmvq_multim_q3_K_m2_can_implement (baracuda kernels mmvq multim q3 k m2 can implement).
baracuda_kernels_mmvq_multim_q3_K_m2_run^⚠: baracuda_kernels_mmvq_multim_q3_K_m2_run (baracuda kernels mmvq multim q3 k m2 run).
baracuda_kernels_mmvq_multim_q3_K_m4_can_implement^⚠: baracuda_kernels_mmvq_multim_q3_K_m4_can_implement (baracuda kernels mmvq multim q3 k m4 can implement).
baracuda_kernels_mmvq_multim_q3_K_m4_run^⚠: baracuda_kernels_mmvq_multim_q3_K_m4_run (baracuda kernels mmvq multim q3 k m4 run).
baracuda_kernels_mmvq_multim_q3_K_m8_can_implement^⚠: baracuda_kernels_mmvq_multim_q3_K_m8_can_implement (baracuda kernels mmvq multim q3 k m8 can implement).
baracuda_kernels_mmvq_multim_q3_K_m8_run^⚠: baracuda_kernels_mmvq_multim_q3_K_m8_run (baracuda kernels mmvq multim q3 k m8 run).
baracuda_kernels_mmvq_multim_q4_0_m1_can_implement^⚠: baracuda_kernels_mmvq_multim_q4_0_m1_can_implement (baracuda kernels mmvq multim q4 0 m1 can implement).
baracuda_kernels_mmvq_multim_q4_0_m1_run^⚠: baracuda_kernels_mmvq_multim_q4_0_m1_run (baracuda kernels mmvq multim q4 0 m1 run).
baracuda_kernels_mmvq_multim_q4_0_m2_can_implement^⚠: baracuda_kernels_mmvq_multim_q4_0_m2_can_implement (baracuda kernels mmvq multim q4 0 m2 can implement).
baracuda_kernels_mmvq_multim_q4_0_m2_run^⚠: baracuda_kernels_mmvq_multim_q4_0_m2_run (baracuda kernels mmvq multim q4 0 m2 run).
baracuda_kernels_mmvq_multim_q4_0_m4_can_implement^⚠: baracuda_kernels_mmvq_multim_q4_0_m4_can_implement (baracuda kernels mmvq multim q4 0 m4 can implement).
baracuda_kernels_mmvq_multim_q4_0_m4_run^⚠: baracuda_kernels_mmvq_multim_q4_0_m4_run (baracuda kernels mmvq multim q4 0 m4 run).
baracuda_kernels_mmvq_multim_q4_0_m8_can_implement^⚠: baracuda_kernels_mmvq_multim_q4_0_m8_can_implement (baracuda kernels mmvq multim q4 0 m8 can implement).
baracuda_kernels_mmvq_multim_q4_0_m8_run^⚠: baracuda_kernels_mmvq_multim_q4_0_m8_run (baracuda kernels mmvq multim q4 0 m8 run).
baracuda_kernels_mmvq_multim_q4_1_m1_can_implement^⚠: baracuda_kernels_mmvq_multim_q4_1_m1_can_implement (baracuda kernels mmvq multim q4 1 m1 can implement).
baracuda_kernels_mmvq_multim_q4_1_m1_run^⚠: baracuda_kernels_mmvq_multim_q4_1_m1_run (baracuda kernels mmvq multim q4 1 m1 run).
baracuda_kernels_mmvq_multim_q4_1_m2_can_implement^⚠: baracuda_kernels_mmvq_multim_q4_1_m2_can_implement (baracuda kernels mmvq multim q4 1 m2 can implement).
baracuda_kernels_mmvq_multim_q4_1_m2_run^⚠: baracuda_kernels_mmvq_multim_q4_1_m2_run (baracuda kernels mmvq multim q4 1 m2 run).
baracuda_kernels_mmvq_multim_q4_1_m4_can_implement^⚠: baracuda_kernels_mmvq_multim_q4_1_m4_can_implement (baracuda kernels mmvq multim q4 1 m4 can implement).
baracuda_kernels_mmvq_multim_q4_1_m4_run^⚠: baracuda_kernels_mmvq_multim_q4_1_m4_run (baracuda kernels mmvq multim q4 1 m4 run).
baracuda_kernels_mmvq_multim_q4_1_m8_can_implement^⚠: baracuda_kernels_mmvq_multim_q4_1_m8_can_implement (baracuda kernels mmvq multim q4 1 m8 can implement).
baracuda_kernels_mmvq_multim_q4_1_m8_run^⚠: baracuda_kernels_mmvq_multim_q4_1_m8_run (baracuda kernels mmvq multim q4 1 m8 run).
baracuda_kernels_mmvq_multim_q4_K_m1_can_implement^⚠: baracuda_kernels_mmvq_multim_q4_K_m1_can_implement (baracuda kernels mmvq multim q4 k m1 can implement).
baracuda_kernels_mmvq_multim_q4_K_m1_run^⚠: baracuda_kernels_mmvq_multim_q4_K_m1_run (baracuda kernels mmvq multim q4 k m1 run).
baracuda_kernels_mmvq_multim_q4_K_m2_can_implement^⚠: baracuda_kernels_mmvq_multim_q4_K_m2_can_implement (baracuda kernels mmvq multim q4 k m2 can implement).
baracuda_kernels_mmvq_multim_q4_K_m2_run^⚠: baracuda_kernels_mmvq_multim_q4_K_m2_run (baracuda kernels mmvq multim q4 k m2 run).
baracuda_kernels_mmvq_multim_q4_K_m4_can_implement^⚠: baracuda_kernels_mmvq_multim_q4_K_m4_can_implement (baracuda kernels mmvq multim q4 k m4 can implement).
baracuda_kernels_mmvq_multim_q4_K_m4_run^⚠: baracuda_kernels_mmvq_multim_q4_K_m4_run (baracuda kernels mmvq multim q4 k m4 run).
baracuda_kernels_mmvq_multim_q4_K_m8_can_implement^⚠: baracuda_kernels_mmvq_multim_q4_K_m8_can_implement (baracuda kernels mmvq multim q4 k m8 can implement).
baracuda_kernels_mmvq_multim_q4_K_m8_run^⚠: baracuda_kernels_mmvq_multim_q4_K_m8_run (baracuda kernels mmvq multim q4 k m8 run).
baracuda_kernels_mmvq_multim_q5_0_m1_can_implement^⚠: baracuda_kernels_mmvq_multim_q5_0_m1_can_implement (baracuda kernels mmvq multim q5 0 m1 can implement).
baracuda_kernels_mmvq_multim_q5_0_m1_run^⚠: baracuda_kernels_mmvq_multim_q5_0_m1_run (baracuda kernels mmvq multim q5 0 m1 run).
baracuda_kernels_mmvq_multim_q5_0_m2_can_implement^⚠: baracuda_kernels_mmvq_multim_q5_0_m2_can_implement (baracuda kernels mmvq multim q5 0 m2 can implement).
baracuda_kernels_mmvq_multim_q5_0_m2_run^⚠: baracuda_kernels_mmvq_multim_q5_0_m2_run (baracuda kernels mmvq multim q5 0 m2 run).
baracuda_kernels_mmvq_multim_q5_0_m4_can_implement^⚠: baracuda_kernels_mmvq_multim_q5_0_m4_can_implement (baracuda kernels mmvq multim q5 0 m4 can implement).
baracuda_kernels_mmvq_multim_q5_0_m4_run^⚠: baracuda_kernels_mmvq_multim_q5_0_m4_run (baracuda kernels mmvq multim q5 0 m4 run).
baracuda_kernels_mmvq_multim_q5_0_m8_can_implement^⚠: baracuda_kernels_mmvq_multim_q5_0_m8_can_implement (baracuda kernels mmvq multim q5 0 m8 can implement).
baracuda_kernels_mmvq_multim_q5_0_m8_run^⚠: baracuda_kernels_mmvq_multim_q5_0_m8_run (baracuda kernels mmvq multim q5 0 m8 run).
baracuda_kernels_mmvq_multim_q5_1_m1_can_implement^⚠: baracuda_kernels_mmvq_multim_q5_1_m1_can_implement (baracuda kernels mmvq multim q5 1 m1 can implement).
baracuda_kernels_mmvq_multim_q5_1_m1_run^⚠: baracuda_kernels_mmvq_multim_q5_1_m1_run (baracuda kernels mmvq multim q5 1 m1 run).
baracuda_kernels_mmvq_multim_q5_1_m2_can_implement^⚠: baracuda_kernels_mmvq_multim_q5_1_m2_can_implement (baracuda kernels mmvq multim q5 1 m2 can implement).
baracuda_kernels_mmvq_multim_q5_1_m2_run^⚠: baracuda_kernels_mmvq_multim_q5_1_m2_run (baracuda kernels mmvq multim q5 1 m2 run).
baracuda_kernels_mmvq_multim_q5_1_m4_can_implement^⚠: baracuda_kernels_mmvq_multim_q5_1_m4_can_implement (baracuda kernels mmvq multim q5 1 m4 can implement).
baracuda_kernels_mmvq_multim_q5_1_m4_run^⚠: baracuda_kernels_mmvq_multim_q5_1_m4_run (baracuda kernels mmvq multim q5 1 m4 run).
baracuda_kernels_mmvq_multim_q5_1_m8_can_implement^⚠: baracuda_kernels_mmvq_multim_q5_1_m8_can_implement (baracuda kernels mmvq multim q5 1 m8 can implement).
baracuda_kernels_mmvq_multim_q5_1_m8_run^⚠: baracuda_kernels_mmvq_multim_q5_1_m8_run (baracuda kernels mmvq multim q5 1 m8 run).
baracuda_kernels_mmvq_multim_q5_K_m1_can_implement^⚠: baracuda_kernels_mmvq_multim_q5_K_m1_can_implement (baracuda kernels mmvq multim q5 k m1 can implement).
baracuda_kernels_mmvq_multim_q5_K_m1_run^⚠: baracuda_kernels_mmvq_multim_q5_K_m1_run (baracuda kernels mmvq multim q5 k m1 run).
baracuda_kernels_mmvq_multim_q5_K_m2_can_implement^⚠: baracuda_kernels_mmvq_multim_q5_K_m2_can_implement (baracuda kernels mmvq multim q5 k m2 can implement).
baracuda_kernels_mmvq_multim_q5_K_m2_run^⚠: baracuda_kernels_mmvq_multim_q5_K_m2_run (baracuda kernels mmvq multim q5 k m2 run).
baracuda_kernels_mmvq_multim_q5_K_m4_can_implement^⚠: baracuda_kernels_mmvq_multim_q5_K_m4_can_implement (baracuda kernels mmvq multim q5 k m4 can implement).
baracuda_kernels_mmvq_multim_q5_K_m4_run^⚠: baracuda_kernels_mmvq_multim_q5_K_m4_run (baracuda kernels mmvq multim q5 k m4 run).
baracuda_kernels_mmvq_multim_q5_K_m8_can_implement^⚠: baracuda_kernels_mmvq_multim_q5_K_m8_can_implement (baracuda kernels mmvq multim q5 k m8 can implement).
baracuda_kernels_mmvq_multim_q5_K_m8_run^⚠: baracuda_kernels_mmvq_multim_q5_K_m8_run (baracuda kernels mmvq multim q5 k m8 run).
baracuda_kernels_mmvq_multim_q6_K_m1_can_implement^⚠: baracuda_kernels_mmvq_multim_q6_K_m1_can_implement (baracuda kernels mmvq multim q6 k m1 can implement).
baracuda_kernels_mmvq_multim_q6_K_m1_run^⚠: baracuda_kernels_mmvq_multim_q6_K_m1_run (baracuda kernels mmvq multim q6 k m1 run).
baracuda_kernels_mmvq_multim_q6_K_m2_can_implement^⚠: baracuda_kernels_mmvq_multim_q6_K_m2_can_implement (baracuda kernels mmvq multim q6 k m2 can implement).
baracuda_kernels_mmvq_multim_q6_K_m2_run^⚠: baracuda_kernels_mmvq_multim_q6_K_m2_run (baracuda kernels mmvq multim q6 k m2 run).
baracuda_kernels_mmvq_multim_q6_K_m4_can_implement^⚠: baracuda_kernels_mmvq_multim_q6_K_m4_can_implement (baracuda kernels mmvq multim q6 k m4 can implement).
baracuda_kernels_mmvq_multim_q6_K_m4_run^⚠: baracuda_kernels_mmvq_multim_q6_K_m4_run (baracuda kernels mmvq multim q6 k m4 run).
baracuda_kernels_mmvq_multim_q6_K_m8_can_implement^⚠: baracuda_kernels_mmvq_multim_q6_K_m8_can_implement (baracuda kernels mmvq multim q6 k m8 can implement).
baracuda_kernels_mmvq_multim_q6_K_m8_run^⚠: baracuda_kernels_mmvq_multim_q6_K_m8_run (baracuda kernels mmvq multim q6 k m8 run).
baracuda_kernels_mmvq_multim_q8_0_m1_can_implement^⚠: baracuda_kernels_mmvq_multim_q8_0_m1_can_implement (baracuda kernels mmvq multim q8 0 m1 can implement).
baracuda_kernels_mmvq_multim_q8_0_m1_run^⚠: Multi-M MMVQ for Q8_0 weights, M=1 (decode regime). Computes dst[0, r] = Σ_c W[r, c] * y[0, c] for r ∈ [0, nrows_x).
baracuda_kernels_mmvq_multim_q8_0_m2_can_implement^⚠: baracuda_kernels_mmvq_multim_q8_0_m2_can_implement (baracuda kernels mmvq multim q8 0 m2 can implement).
baracuda_kernels_mmvq_multim_q8_0_m2_run^⚠: Multi-M MMVQ for Q8_0 weights, M=2. # Safety: as M=1.
baracuda_kernels_mmvq_multim_q8_0_m4_can_implement^⚠: baracuda_kernels_mmvq_multim_q8_0_m4_can_implement (baracuda kernels mmvq multim q8 0 m4 can implement).
baracuda_kernels_mmvq_multim_q8_0_m4_run^⚠: Multi-M MMVQ for Q8_0 weights, M=4. # Safety: as M=1.
baracuda_kernels_mmvq_multim_q8_0_m8_can_implement^⚠: baracuda_kernels_mmvq_multim_q8_0_m8_can_implement (baracuda kernels mmvq multim q8 0 m8 can implement).
baracuda_kernels_mmvq_multim_q8_0_m8_run^⚠: Multi-M MMVQ for Q8_0 weights, M=8 (prefill regime, target 3-7× vs the per-token M=1 dispatch). # Safety: as M=1.
baracuda_kernels_mmvq_q2_K_actstrided_bf16_can_implement^⚠: baracuda_kernels_mmvq_q2_K_actstrided_bf16_can_implement (baracuda kernels mmvq q2 k actstrided bf16 can implement).
baracuda_kernels_mmvq_q2_K_actstrided_bf16_run^⚠: Strided MMVQ — Q2_K, bf16. # Safety: as Q2_K strided f16.
baracuda_kernels_mmvq_q2_K_actstrided_can_implement^⚠: baracuda_kernels_mmvq_q2_K_actstrided_can_implement (baracuda kernels mmvq q2 k actstrided can implement).
baracuda_kernels_mmvq_q2_K_actstrided_f16_can_implement^⚠: baracuda_kernels_mmvq_q2_K_actstrided_f16_can_implement (baracuda kernels mmvq q2 k actstrided f16 can implement).
baracuda_kernels_mmvq_q2_K_actstrided_f16_run^⚠: Strided MMVQ — Q2_K, f16. # Safety: as Q4_0 strided f16, ncols mul of 256.
baracuda_kernels_mmvq_q2_K_actstrided_run^⚠: Strided MMVQ — GGUF Q2_K. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q2_K_batched_bf16_can_implement^⚠: baracuda_kernels_mmvq_q2_K_batched_bf16_can_implement (baracuda kernels mmvq q2 k batched bf16 can implement).
baracuda_kernels_mmvq_q2_K_batched_bf16_run^⚠: Batched MMVQ — Q2_K, bf16. # Safety: as Q2_K f32.
baracuda_kernels_mmvq_q2_K_batched_can_implement^⚠: baracuda_kernels_mmvq_q2_K_batched_can_implement (baracuda kernels mmvq q2 k batched can implement).
baracuda_kernels_mmvq_q2_K_batched_f16_can_implement^⚠: baracuda_kernels_mmvq_q2_K_batched_f16_can_implement (baracuda kernels mmvq q2 k batched f16 can implement).
baracuda_kernels_mmvq_q2_K_batched_f16_run^⚠: Batched MMVQ — Q2_K, f16. # Safety: as Q2_K f32.
baracuda_kernels_mmvq_q2_K_batched_run^⚠: Batched MMVQ — Q2_K, f32. # Safety: as Q4_0, ncols mul of 256.
baracuda_kernels_mmvq_q2_K_bf16_can_implement^⚠: baracuda_kernels_mmvq_q2_K_bf16_can_implement (baracuda kernels mmvq q2 k bf16 can implement).
baracuda_kernels_mmvq_q2_K_bf16_run^⚠: MMVQ — Q2_K, bf16. # Safety: as Q4_0 bf16, ncols must be multiple of 256.
baracuda_kernels_mmvq_q2_K_can_implement^⚠: baracuda_kernels_mmvq_q2_K_can_implement (baracuda kernels mmvq q2 k can implement).
baracuda_kernels_mmvq_q2_K_f16_can_implement^⚠: baracuda_kernels_mmvq_q2_K_f16_can_implement (baracuda kernels mmvq q2 k f16 can implement).
baracuda_kernels_mmvq_q2_K_f16_run^⚠: MMVQ — Q2_K, f16. # Safety: as Q4_0 f16, ncols must be multiple of 256.
baracuda_kernels_mmvq_q2_K_run^⚠: GGUF Q2_K MMVQ — FP-activation matrix-vector mul. ncols must be a multiple of 256.
baracuda_kernels_mmvq_q3_K_actstrided_bf16_can_implement^⚠: baracuda_kernels_mmvq_q3_K_actstrided_bf16_can_implement (baracuda kernels mmvq q3 k actstrided bf16 can implement).
baracuda_kernels_mmvq_q3_K_actstrided_bf16_run^⚠: Strided MMVQ — Q3_K, bf16. # Safety: as Q2_K strided bf16.
baracuda_kernels_mmvq_q3_K_actstrided_can_implement^⚠: baracuda_kernels_mmvq_q3_K_actstrided_can_implement (baracuda kernels mmvq q3 k actstrided can implement).
baracuda_kernels_mmvq_q3_K_actstrided_f16_can_implement^⚠: baracuda_kernels_mmvq_q3_K_actstrided_f16_can_implement (baracuda kernels mmvq q3 k actstrided f16 can implement).
baracuda_kernels_mmvq_q3_K_actstrided_f16_run^⚠: Strided MMVQ — Q3_K, f16. # Safety: as Q2_K strided f16.
baracuda_kernels_mmvq_q3_K_actstrided_run^⚠: Strided MMVQ — GGUF Q3_K. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q3_K_batched_bf16_can_implement^⚠: baracuda_kernels_mmvq_q3_K_batched_bf16_can_implement (baracuda kernels mmvq q3 k batched bf16 can implement).
baracuda_kernels_mmvq_q3_K_batched_bf16_run^⚠: Batched MMVQ — Q3_K, bf16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q3_K_batched_can_implement^⚠: baracuda_kernels_mmvq_q3_K_batched_can_implement (baracuda kernels mmvq q3 k batched can implement).
baracuda_kernels_mmvq_q3_K_batched_f16_can_implement^⚠: baracuda_kernels_mmvq_q3_K_batched_f16_can_implement (baracuda kernels mmvq q3 k batched f16 can implement).
baracuda_kernels_mmvq_q3_K_batched_f16_run^⚠: Batched MMVQ — Q3_K, f16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q3_K_batched_run^⚠: Batched MMVQ — Q3_K, f32. # Safety: as Q2_K.
baracuda_kernels_mmvq_q3_K_bf16_can_implement^⚠: baracuda_kernels_mmvq_q3_K_bf16_can_implement (baracuda kernels mmvq q3 k bf16 can implement).
baracuda_kernels_mmvq_q3_K_bf16_run^⚠: MMVQ — Q3_K, bf16. # Safety: as Q2_K bf16.
baracuda_kernels_mmvq_q3_K_can_implement^⚠: baracuda_kernels_mmvq_q3_K_can_implement (baracuda kernels mmvq q3 k can implement).
baracuda_kernels_mmvq_q3_K_f16_can_implement^⚠: baracuda_kernels_mmvq_q3_K_f16_can_implement (baracuda kernels mmvq q3 k f16 can implement).
baracuda_kernels_mmvq_q3_K_f16_run^⚠: MMVQ — Q3_K, f16. # Safety: as Q2_K f16.
baracuda_kernels_mmvq_q3_K_run^⚠: GGUF Q3_K MMVQ. # Safety: as Q2_K.
baracuda_kernels_mmvq_q4_0_actstrided_bf16_can_implement^⚠: baracuda_kernels_mmvq_q4_0_actstrided_bf16_can_implement (baracuda kernels mmvq q4 0 actstrided bf16 can implement).
baracuda_kernels_mmvq_q4_0_actstrided_bf16_run^⚠: Strided MMVQ — Q4_0, bf16. # Safety: as the f32 strided sibling.
baracuda_kernels_mmvq_q4_0_actstrided_can_implement^⚠: baracuda_kernels_mmvq_q4_0_actstrided_can_implement (baracuda kernels mmvq q4 0 actstrided can implement).
baracuda_kernels_mmvq_q4_0_actstrided_f16_can_implement^⚠: baracuda_kernels_mmvq_q4_0_actstrided_f16_can_implement (baracuda kernels mmvq q4 0 actstrided f16 can implement).
baracuda_kernels_mmvq_q4_0_actstrided_f16_run^⚠: Strided MMVQ — Q4_0, f16. # Safety: as the f32 strided sibling.
baracuda_kernels_mmvq_q4_0_actstrided_run^⚠: Strided MMVQ — GGUF Q4_0. # Safety: as the contig Q4_0 variant, plus (y[k * stride_y])_k=0..ncols must be a valid f32 read.
baracuda_kernels_mmvq_q4_0_batched_bf16_can_implement^⚠: baracuda_kernels_mmvq_q4_0_batched_bf16_can_implement (baracuda kernels mmvq q4 0 batched bf16 can implement).
baracuda_kernels_mmvq_q4_0_batched_bf16_run^⚠: Batched MMVQ — Q4_0, bf16. # Safety: as Q4_0 f32.
baracuda_kernels_mmvq_q4_0_batched_can_implement^⚠: baracuda_kernels_mmvq_q4_0_batched_can_implement (baracuda kernels mmvq q4 0 batched can implement).
baracuda_kernels_mmvq_q4_0_batched_f16_can_implement^⚠: baracuda_kernels_mmvq_q4_0_batched_f16_can_implement (baracuda kernels mmvq q4 0 batched f16 can implement).
baracuda_kernels_mmvq_q4_0_batched_f16_run^⚠: Batched MMVQ — Q4_0, f16. # Safety: as Q4_0 f32.
baracuda_kernels_mmvq_q4_0_batched_run^⚠: Batched MMVQ — Q4_0, f32 activation + output. # Safety: device-resident pointers; valid stream; workspace ≥ m_total * 4 bytes.
baracuda_kernels_mmvq_q4_0_bf16_can_implement^⚠: baracuda_kernels_mmvq_q4_0_bf16_can_implement (baracuda kernels mmvq q4 0 bf16 can implement).
baracuda_kernels_mmvq_q4_0_bf16_run^⚠: MMVQ — Q4_0, bf16 activation + bf16 output. # Safety: as the f32 sibling with y / dst typed __nv_bfloat16 device-resident.
baracuda_kernels_mmvq_q4_0_can_implement^⚠: baracuda_kernels_mmvq_q4_0_can_implement (baracuda kernels mmvq q4 0 can implement).
baracuda_kernels_mmvq_q4_0_f16_can_implement^⚠: baracuda_kernels_mmvq_q4_0_f16_can_implement (baracuda kernels mmvq q4 0 f16 can implement).
baracuda_kernels_mmvq_q4_0_f16_run^⚠: MMVQ — Q4_0, f16 activation + f16 output. # Safety: as the f32 sibling with y / dst typed __half device-resident.
baracuda_kernels_mmvq_q4_0_run^⚠: GGUF Q4_0 MMVQ — FP-activation matrix-vector mul. ncols must be a multiple of 32.
baracuda_kernels_mmvq_q4_1_actstrided_bf16_can_implement^⚠: baracuda_kernels_mmvq_q4_1_actstrided_bf16_can_implement (baracuda kernels mmvq q4 1 actstrided bf16 can implement).
baracuda_kernels_mmvq_q4_1_actstrided_bf16_run^⚠: Strided MMVQ — Q4_1, bf16. # Safety: as Q4_0 strided bf16.
baracuda_kernels_mmvq_q4_1_actstrided_can_implement^⚠: baracuda_kernels_mmvq_q4_1_actstrided_can_implement (baracuda kernels mmvq q4 1 actstrided can implement).
baracuda_kernels_mmvq_q4_1_actstrided_f16_can_implement^⚠: baracuda_kernels_mmvq_q4_1_actstrided_f16_can_implement (baracuda kernels mmvq q4 1 actstrided f16 can implement).
baracuda_kernels_mmvq_q4_1_actstrided_f16_run^⚠: Strided MMVQ — Q4_1, f16. # Safety: as Q4_0 strided f16.
baracuda_kernels_mmvq_q4_1_actstrided_run^⚠: Strided MMVQ — GGUF Q4_1. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q4_1_batched_bf16_can_implement^⚠: baracuda_kernels_mmvq_q4_1_batched_bf16_can_implement (baracuda kernels mmvq q4 1 batched bf16 can implement).
baracuda_kernels_mmvq_q4_1_batched_bf16_run^⚠: Batched MMVQ — Q4_1, bf16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q4_1_batched_can_implement^⚠: baracuda_kernels_mmvq_q4_1_batched_can_implement (baracuda kernels mmvq q4 1 batched can implement).
baracuda_kernels_mmvq_q4_1_batched_f16_can_implement^⚠: baracuda_kernels_mmvq_q4_1_batched_f16_can_implement (baracuda kernels mmvq q4 1 batched f16 can implement).
baracuda_kernels_mmvq_q4_1_batched_f16_run^⚠: Batched MMVQ — Q4_1, f16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q4_1_batched_run^⚠: Batched MMVQ — Q4_1, f32. # Safety: as Q4_0.
baracuda_kernels_mmvq_q4_1_bf16_can_implement^⚠: baracuda_kernels_mmvq_q4_1_bf16_can_implement (baracuda kernels mmvq q4 1 bf16 can implement).
baracuda_kernels_mmvq_q4_1_bf16_run^⚠: MMVQ — Q4_1, bf16 activation + bf16 output. # Safety: as Q4_0 bf16.
baracuda_kernels_mmvq_q4_1_can_implement^⚠: baracuda_kernels_mmvq_q4_1_can_implement (baracuda kernels mmvq q4 1 can implement).
baracuda_kernels_mmvq_q4_1_f16_can_implement^⚠: baracuda_kernels_mmvq_q4_1_f16_can_implement (baracuda kernels mmvq q4 1 f16 can implement).
baracuda_kernels_mmvq_q4_1_f16_run^⚠: MMVQ — Q4_1, f16 activation + f16 output. # Safety: as Q4_0 f16.
baracuda_kernels_mmvq_q4_1_run^⚠: GGUF Q4_1 MMVQ. # Safety: as Q4_0.
baracuda_kernels_mmvq_q4_K_actstrided_bf16_can_implement^⚠: baracuda_kernels_mmvq_q4_K_actstrided_bf16_can_implement (baracuda kernels mmvq q4 k actstrided bf16 can implement).
baracuda_kernels_mmvq_q4_K_actstrided_bf16_run^⚠: Strided MMVQ — Q4_K, bf16. # Safety: as Q2_K strided bf16.
baracuda_kernels_mmvq_q4_K_actstrided_can_implement^⚠: baracuda_kernels_mmvq_q4_K_actstrided_can_implement (baracuda kernels mmvq q4 k actstrided can implement).
baracuda_kernels_mmvq_q4_K_actstrided_f16_can_implement^⚠: baracuda_kernels_mmvq_q4_K_actstrided_f16_can_implement (baracuda kernels mmvq q4 k actstrided f16 can implement).
baracuda_kernels_mmvq_q4_K_actstrided_f16_run^⚠: Strided MMVQ — Q4_K, f16. # Safety: as Q2_K strided f16.
baracuda_kernels_mmvq_q4_K_actstrided_run^⚠: Strided MMVQ — GGUF Q4_K. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q4_K_batched_bf16_can_implement^⚠: baracuda_kernels_mmvq_q4_K_batched_bf16_can_implement (baracuda kernels mmvq q4 k batched bf16 can implement).
baracuda_kernels_mmvq_q4_K_batched_bf16_run^⚠: Batched MMVQ — Q4_K, bf16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q4_K_batched_can_implement^⚠: baracuda_kernels_mmvq_q4_K_batched_can_implement (baracuda kernels mmvq q4 k batched can implement).
baracuda_kernels_mmvq_q4_K_batched_f16_can_implement^⚠: baracuda_kernels_mmvq_q4_K_batched_f16_can_implement (baracuda kernels mmvq q4 k batched f16 can implement).
baracuda_kernels_mmvq_q4_K_batched_f16_run^⚠: Batched MMVQ — Q4_K, f16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q4_K_batched_run^⚠: Batched MMVQ — Q4_K, f32. # Safety: as Q2_K.
baracuda_kernels_mmvq_q4_K_bf16_can_implement^⚠: baracuda_kernels_mmvq_q4_K_bf16_can_implement (baracuda kernels mmvq q4 k bf16 can implement).
baracuda_kernels_mmvq_q4_K_bf16_run^⚠: MMVQ — Q4_K, bf16. # Safety: as Q2_K bf16.
baracuda_kernels_mmvq_q4_K_can_implement^⚠: baracuda_kernels_mmvq_q4_K_can_implement (baracuda kernels mmvq q4 k can implement).
baracuda_kernels_mmvq_q4_K_f16_can_implement^⚠: baracuda_kernels_mmvq_q4_K_f16_can_implement (baracuda kernels mmvq q4 k f16 can implement).
baracuda_kernels_mmvq_q4_K_f16_run^⚠: MMVQ — Q4_K, f16. # Safety: as Q2_K f16.
baracuda_kernels_mmvq_q4_K_run^⚠: GGUF Q4_K MMVQ. # Safety: as Q2_K.
baracuda_kernels_mmvq_q5_0_actstrided_bf16_can_implement^⚠: baracuda_kernels_mmvq_q5_0_actstrided_bf16_can_implement (baracuda kernels mmvq q5 0 actstrided bf16 can implement).
baracuda_kernels_mmvq_q5_0_actstrided_bf16_run^⚠: Strided MMVQ — Q5_0, bf16. # Safety: as Q4_0 strided bf16.
baracuda_kernels_mmvq_q5_0_actstrided_can_implement^⚠: baracuda_kernels_mmvq_q5_0_actstrided_can_implement (baracuda kernels mmvq q5 0 actstrided can implement).
baracuda_kernels_mmvq_q5_0_actstrided_f16_can_implement^⚠: baracuda_kernels_mmvq_q5_0_actstrided_f16_can_implement (baracuda kernels mmvq q5 0 actstrided f16 can implement).
baracuda_kernels_mmvq_q5_0_actstrided_f16_run^⚠: Strided MMVQ — Q5_0, f16. # Safety: as Q4_0 strided f16.
baracuda_kernels_mmvq_q5_0_actstrided_run^⚠: Strided MMVQ — GGUF Q5_0. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q5_0_batched_bf16_can_implement^⚠: baracuda_kernels_mmvq_q5_0_batched_bf16_can_implement (baracuda kernels mmvq q5 0 batched bf16 can implement).
baracuda_kernels_mmvq_q5_0_batched_bf16_run^⚠: Batched MMVQ — Q5_0, bf16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_0_batched_can_implement^⚠: baracuda_kernels_mmvq_q5_0_batched_can_implement (baracuda kernels mmvq q5 0 batched can implement).
baracuda_kernels_mmvq_q5_0_batched_f16_can_implement^⚠: baracuda_kernels_mmvq_q5_0_batched_f16_can_implement (baracuda kernels mmvq q5 0 batched f16 can implement).
baracuda_kernels_mmvq_q5_0_batched_f16_run^⚠: Batched MMVQ — Q5_0, f16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_0_batched_run^⚠: Batched MMVQ — Q5_0, f32. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_0_bf16_can_implement^⚠: baracuda_kernels_mmvq_q5_0_bf16_can_implement (baracuda kernels mmvq q5 0 bf16 can implement).
baracuda_kernels_mmvq_q5_0_bf16_run^⚠: MMVQ — Q5_0, bf16. # Safety: as Q4_0 bf16.
baracuda_kernels_mmvq_q5_0_can_implement^⚠: baracuda_kernels_mmvq_q5_0_can_implement (baracuda kernels mmvq q5 0 can implement).
baracuda_kernels_mmvq_q5_0_f16_can_implement^⚠: baracuda_kernels_mmvq_q5_0_f16_can_implement (baracuda kernels mmvq q5 0 f16 can implement).
baracuda_kernels_mmvq_q5_0_f16_run^⚠: MMVQ — Q5_0, f16. # Safety: as Q4_0 f16.
baracuda_kernels_mmvq_q5_0_run^⚠: GGUF Q5_0 MMVQ. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_1_actstrided_bf16_can_implement^⚠: baracuda_kernels_mmvq_q5_1_actstrided_bf16_can_implement (baracuda kernels mmvq q5 1 actstrided bf16 can implement).
baracuda_kernels_mmvq_q5_1_actstrided_bf16_run^⚠: Strided MMVQ — Q5_1, bf16. # Safety: as Q4_0 strided bf16.
baracuda_kernels_mmvq_q5_1_actstrided_can_implement^⚠: baracuda_kernels_mmvq_q5_1_actstrided_can_implement (baracuda kernels mmvq q5 1 actstrided can implement).
baracuda_kernels_mmvq_q5_1_actstrided_f16_can_implement^⚠: baracuda_kernels_mmvq_q5_1_actstrided_f16_can_implement (baracuda kernels mmvq q5 1 actstrided f16 can implement).
baracuda_kernels_mmvq_q5_1_actstrided_f16_run^⚠: Strided MMVQ — Q5_1, f16. # Safety: as Q4_0 strided f16.
baracuda_kernels_mmvq_q5_1_actstrided_run^⚠: Strided MMVQ — GGUF Q5_1. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q5_1_batched_bf16_can_implement^⚠: baracuda_kernels_mmvq_q5_1_batched_bf16_can_implement (baracuda kernels mmvq q5 1 batched bf16 can implement).
baracuda_kernels_mmvq_q5_1_batched_bf16_run^⚠: Batched MMVQ — Q5_1, bf16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_1_batched_can_implement^⚠: baracuda_kernels_mmvq_q5_1_batched_can_implement (baracuda kernels mmvq q5 1 batched can implement).
baracuda_kernels_mmvq_q5_1_batched_f16_can_implement^⚠: baracuda_kernels_mmvq_q5_1_batched_f16_can_implement (baracuda kernels mmvq q5 1 batched f16 can implement).
baracuda_kernels_mmvq_q5_1_batched_f16_run^⚠: Batched MMVQ — Q5_1, f16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_1_batched_run^⚠: Batched MMVQ — Q5_1, f32. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_1_bf16_can_implement^⚠: baracuda_kernels_mmvq_q5_1_bf16_can_implement (baracuda kernels mmvq q5 1 bf16 can implement).
baracuda_kernels_mmvq_q5_1_bf16_run^⚠: MMVQ — Q5_1, bf16. # Safety: as Q4_0 bf16.
baracuda_kernels_mmvq_q5_1_can_implement^⚠: baracuda_kernels_mmvq_q5_1_can_implement (baracuda kernels mmvq q5 1 can implement).
baracuda_kernels_mmvq_q5_1_f16_can_implement^⚠: baracuda_kernels_mmvq_q5_1_f16_can_implement (baracuda kernels mmvq q5 1 f16 can implement).
baracuda_kernels_mmvq_q5_1_f16_run^⚠: MMVQ — Q5_1, f16. # Safety: as Q4_0 f16.
baracuda_kernels_mmvq_q5_1_run^⚠: GGUF Q5_1 MMVQ. # Safety: as Q4_0.
baracuda_kernels_mmvq_q5_K_actstrided_bf16_can_implement^⚠: baracuda_kernels_mmvq_q5_K_actstrided_bf16_can_implement (baracuda kernels mmvq q5 k actstrided bf16 can implement).
baracuda_kernels_mmvq_q5_K_actstrided_bf16_run^⚠: Strided MMVQ — Q5_K, bf16. # Safety: as Q2_K strided bf16.
baracuda_kernels_mmvq_q5_K_actstrided_can_implement^⚠: baracuda_kernels_mmvq_q5_K_actstrided_can_implement (baracuda kernels mmvq q5 k actstrided can implement).
baracuda_kernels_mmvq_q5_K_actstrided_f16_can_implement^⚠: baracuda_kernels_mmvq_q5_K_actstrided_f16_can_implement (baracuda kernels mmvq q5 k actstrided f16 can implement).
baracuda_kernels_mmvq_q5_K_actstrided_f16_run^⚠: Strided MMVQ — Q5_K, f16. # Safety: as Q2_K strided f16.
baracuda_kernels_mmvq_q5_K_actstrided_run^⚠: Strided MMVQ — GGUF Q5_K. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q5_K_batched_bf16_can_implement^⚠: baracuda_kernels_mmvq_q5_K_batched_bf16_can_implement (baracuda kernels mmvq q5 k batched bf16 can implement).
baracuda_kernels_mmvq_q5_K_batched_bf16_run^⚠: Batched MMVQ — Q5_K, bf16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q5_K_batched_can_implement^⚠: baracuda_kernels_mmvq_q5_K_batched_can_implement (baracuda kernels mmvq q5 k batched can implement).
baracuda_kernels_mmvq_q5_K_batched_f16_can_implement^⚠: baracuda_kernels_mmvq_q5_K_batched_f16_can_implement (baracuda kernels mmvq q5 k batched f16 can implement).
baracuda_kernels_mmvq_q5_K_batched_f16_run^⚠: Batched MMVQ — Q5_K, f16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q5_K_batched_run^⚠: Batched MMVQ — Q5_K, f32. # Safety: as Q2_K.
baracuda_kernels_mmvq_q5_K_bf16_can_implement^⚠: baracuda_kernels_mmvq_q5_K_bf16_can_implement (baracuda kernels mmvq q5 k bf16 can implement).
baracuda_kernels_mmvq_q5_K_bf16_run^⚠: MMVQ — Q5_K, bf16. # Safety: as Q2_K bf16.
baracuda_kernels_mmvq_q5_K_can_implement^⚠: baracuda_kernels_mmvq_q5_K_can_implement (baracuda kernels mmvq q5 k can implement).
baracuda_kernels_mmvq_q5_K_f16_can_implement^⚠: baracuda_kernels_mmvq_q5_K_f16_can_implement (baracuda kernels mmvq q5 k f16 can implement).
baracuda_kernels_mmvq_q5_K_f16_run^⚠: MMVQ — Q5_K, f16. # Safety: as Q2_K f16.
baracuda_kernels_mmvq_q5_K_run^⚠: GGUF Q5_K MMVQ. # Safety: as Q2_K.
baracuda_kernels_mmvq_q6_K_actstrided_bf16_can_implement^⚠: baracuda_kernels_mmvq_q6_K_actstrided_bf16_can_implement (baracuda kernels mmvq q6 k actstrided bf16 can implement).
baracuda_kernels_mmvq_q6_K_actstrided_bf16_run^⚠: Strided MMVQ — Q6_K, bf16. # Safety: as Q2_K strided bf16.
baracuda_kernels_mmvq_q6_K_actstrided_can_implement^⚠: baracuda_kernels_mmvq_q6_K_actstrided_can_implement (baracuda kernels mmvq q6 k actstrided can implement).
baracuda_kernels_mmvq_q6_K_actstrided_f16_can_implement^⚠: baracuda_kernels_mmvq_q6_K_actstrided_f16_can_implement (baracuda kernels mmvq q6 k actstrided f16 can implement).
baracuda_kernels_mmvq_q6_K_actstrided_f16_run^⚠: Strided MMVQ — Q6_K, f16. # Safety: as Q2_K strided f16.
baracuda_kernels_mmvq_q6_K_actstrided_run^⚠: Strided MMVQ — GGUF Q6_K. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q6_K_batched_bf16_can_implement^⚠: baracuda_kernels_mmvq_q6_K_batched_bf16_can_implement (baracuda kernels mmvq q6 k batched bf16 can implement).
baracuda_kernels_mmvq_q6_K_batched_bf16_run^⚠: Batched MMVQ — Q6_K, bf16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q6_K_batched_can_implement^⚠: baracuda_kernels_mmvq_q6_K_batched_can_implement (baracuda kernels mmvq q6 k batched can implement).
baracuda_kernels_mmvq_q6_K_batched_f16_can_implement^⚠: baracuda_kernels_mmvq_q6_K_batched_f16_can_implement (baracuda kernels mmvq q6 k batched f16 can implement).
baracuda_kernels_mmvq_q6_K_batched_f16_run^⚠: Batched MMVQ — Q6_K, f16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q6_K_batched_run^⚠: Batched MMVQ — Q6_K, f32. # Safety: as Q2_K.
baracuda_kernels_mmvq_q6_K_bf16_can_implement^⚠: baracuda_kernels_mmvq_q6_K_bf16_can_implement (baracuda kernels mmvq q6 k bf16 can implement).
baracuda_kernels_mmvq_q6_K_bf16_run^⚠: MMVQ — Q6_K, bf16. # Safety: as Q2_K bf16.
baracuda_kernels_mmvq_q6_K_can_implement^⚠: baracuda_kernels_mmvq_q6_K_can_implement (baracuda kernels mmvq q6 k can implement).
baracuda_kernels_mmvq_q6_K_f16_can_implement^⚠: baracuda_kernels_mmvq_q6_K_f16_can_implement (baracuda kernels mmvq q6 k f16 can implement).
baracuda_kernels_mmvq_q6_K_f16_run^⚠: MMVQ — Q6_K, f16. # Safety: as Q2_K f16.
baracuda_kernels_mmvq_q6_K_run^⚠: GGUF Q6_K MMVQ. # Safety: as Q2_K.
baracuda_kernels_mmvq_q8_0_actstrided_bf16_can_implement^⚠: baracuda_kernels_mmvq_q8_0_actstrided_bf16_can_implement (baracuda kernels mmvq q8 0 actstrided bf16 can implement).
baracuda_kernels_mmvq_q8_0_actstrided_bf16_run^⚠: Strided MMVQ — Q8_0, bf16. # Safety: as Q4_0 strided bf16.
baracuda_kernels_mmvq_q8_0_actstrided_can_implement^⚠: baracuda_kernels_mmvq_q8_0_actstrided_can_implement (baracuda kernels mmvq q8 0 actstrided can implement).
baracuda_kernels_mmvq_q8_0_actstrided_f16_can_implement^⚠: baracuda_kernels_mmvq_q8_0_actstrided_f16_can_implement (baracuda kernels mmvq q8 0 actstrided f16 can implement).
baracuda_kernels_mmvq_q8_0_actstrided_f16_run^⚠: Strided MMVQ — Q8_0, f16. # Safety: as Q4_0 strided f16.
baracuda_kernels_mmvq_q8_0_actstrided_run^⚠: Strided MMVQ — GGUF Q8_0. # Safety: as the contig sibling.
baracuda_kernels_mmvq_q8_0_batched_bf16_can_implement^⚠: baracuda_kernels_mmvq_q8_0_batched_bf16_can_implement (baracuda kernels mmvq q8 0 batched bf16 can implement).
baracuda_kernels_mmvq_q8_0_batched_bf16_run^⚠: Batched MMVQ — Q8_0, bf16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q8_0_batched_can_implement^⚠: baracuda_kernels_mmvq_q8_0_batched_can_implement (baracuda kernels mmvq q8 0 batched can implement).
baracuda_kernels_mmvq_q8_0_batched_f16_can_implement^⚠: baracuda_kernels_mmvq_q8_0_batched_f16_can_implement (baracuda kernels mmvq q8 0 batched f16 can implement).
baracuda_kernels_mmvq_q8_0_batched_f16_run^⚠: Batched MMVQ — Q8_0, f16. # Safety: as Q4_0.
baracuda_kernels_mmvq_q8_0_batched_run^⚠: Batched MMVQ — Q8_0, f32. # Safety: as Q4_0.
baracuda_kernels_mmvq_q8_0_bf16_can_implement^⚠: baracuda_kernels_mmvq_q8_0_bf16_can_implement (baracuda kernels mmvq q8 0 bf16 can implement).
baracuda_kernels_mmvq_q8_0_bf16_run^⚠: MMVQ — Q8_0, bf16. # Safety: as Q4_0 bf16.
baracuda_kernels_mmvq_q8_0_can_implement^⚠: baracuda_kernels_mmvq_q8_0_can_implement (baracuda kernels mmvq q8 0 can implement).
baracuda_kernels_mmvq_q8_0_f16_can_implement^⚠: baracuda_kernels_mmvq_q8_0_f16_can_implement (baracuda kernels mmvq q8 0 f16 can implement).
baracuda_kernels_mmvq_q8_0_f16_run^⚠: MMVQ — Q8_0, f16. # Safety: as Q4_0 f16.
baracuda_kernels_mmvq_q8_0_run^⚠: GGUF Q8_0 MMVQ. # Safety: as Q4_0.
baracuda_kernels_mmvq_q8_K_actstrided_bf16_can_implement^⚠: baracuda_kernels_mmvq_q8_K_actstrided_bf16_can_implement (baracuda kernels mmvq q8 k actstrided bf16 can implement).
baracuda_kernels_mmvq_q8_K_actstrided_bf16_run^⚠: Strided MMVQ — Q8_K, bf16 (bespoke). # Safety: as Q2_K strided bf16.
baracuda_kernels_mmvq_q8_K_actstrided_can_implement^⚠: baracuda_kernels_mmvq_q8_K_actstrided_can_implement (baracuda kernels mmvq q8 k actstrided can implement).
baracuda_kernels_mmvq_q8_K_actstrided_f16_can_implement^⚠: baracuda_kernels_mmvq_q8_K_actstrided_f16_can_implement (baracuda kernels mmvq q8 k actstrided f16 can implement).
baracuda_kernels_mmvq_q8_K_actstrided_f16_run^⚠: Strided MMVQ — Q8_K, f16 (bespoke). # Safety: as Q2_K strided f16.
baracuda_kernels_mmvq_q8_K_actstrided_run^⚠: Strided MMVQ — GGUF Q8_K (bespoke; Phase 11.4 + 14.5).
baracuda_kernels_mmvq_q8_K_batched_bf16_can_implement^⚠: baracuda_kernels_mmvq_q8_K_batched_bf16_can_implement (baracuda kernels mmvq q8 k batched bf16 can implement).
baracuda_kernels_mmvq_q8_K_batched_bf16_run^⚠: Batched MMVQ — Q8_K (bespoke), bf16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q8_K_batched_can_implement^⚠: baracuda_kernels_mmvq_q8_K_batched_can_implement (baracuda kernels mmvq q8 k batched can implement).
baracuda_kernels_mmvq_q8_K_batched_f16_can_implement^⚠: baracuda_kernels_mmvq_q8_K_batched_f16_can_implement (baracuda kernels mmvq q8 k batched f16 can implement).
baracuda_kernels_mmvq_q8_K_batched_f16_run^⚠: Batched MMVQ — Q8_K (bespoke), f16. # Safety: as Q2_K.
baracuda_kernels_mmvq_q8_K_batched_run^⚠: Batched MMVQ — Q8_K (bespoke), f32. # Safety: as Q2_K.
baracuda_kernels_mmvq_q8_K_bf16_can_implement^⚠: baracuda_kernels_mmvq_q8_K_bf16_can_implement (baracuda kernels mmvq q8 k bf16 can implement).
baracuda_kernels_mmvq_q8_K_bf16_run^⚠: MMVQ — Q8_K, bf16 (bespoke; Phase 11.4 + 18.1). # Safety: as Q2_K bf16.
baracuda_kernels_mmvq_q8_K_can_implement^⚠: baracuda_kernels_mmvq_q8_K_can_implement (baracuda kernels mmvq q8 k can implement).
baracuda_kernels_mmvq_q8_K_f16_can_implement^⚠: baracuda_kernels_mmvq_q8_K_f16_can_implement (baracuda kernels mmvq q8 k f16 can implement).
baracuda_kernels_mmvq_q8_K_f16_run^⚠: MMVQ — Q8_K, f16 (bespoke; Phase 11.4 + 18.1). # Safety: as Q2_K f16.
baracuda_kernels_mmvq_q8_K_run^⚠: GGUF Q8_K MMVQ — Phase 11.4 (bespoke, not vendored from llama.cpp). ncols must be a multiple of 256. # Safety: as Q2_K.
baracuda_kernels_moe_scalar_gguf_can_implement^⚠: baracuda_kernels_moe_scalar_gguf_can_implement (baracuda kernels moe scalar gguf can implement).
baracuda_kernels_moe_scalar_gguf_run^⚠: MoE forward — scalar dispatch path on GGUF-packed expert weights. f32 activations in, f32 output out.
baracuda_kernels_moe_wmma_bf16_can_implement^⚠: baracuda_kernels_moe_wmma_bf16_can_implement (baracuda kernels moe wmma bf16 can implement).
baracuda_kernels_moe_wmma_bf16_run^⚠: MoE forward — WMMA FP weights, bf16 activations + weights, bf16 output.
baracuda_kernels_moe_wmma_f16_can_implement^⚠: baracuda_kernels_moe_wmma_f16_can_implement (baracuda kernels moe wmma f16 can implement).
baracuda_kernels_moe_wmma_f16_run^⚠: MoE forward — WMMA FP weights, f16 activations + weights, f16 output. Output buffer must be zero-initialized by the caller when topk_weights == null and topk > 1 (multiple writes per row).
baracuda_kernels_moe_wmma_gguf_bf16_can_implement^⚠: baracuda_kernels_moe_wmma_gguf_bf16_can_implement (baracuda kernels moe wmma gguf bf16 can implement).
baracuda_kernels_moe_wmma_gguf_bf16_run^⚠: MoE forward — WMMA + GGUF combined path, bf16 activations.
baracuda_kernels_moe_wmma_gguf_f16_can_implement^⚠: baracuda_kernels_moe_wmma_gguf_f16_can_implement (baracuda kernels moe wmma gguf f16 can implement).
baracuda_kernels_moe_wmma_gguf_f16_run^⚠: MoE forward — WMMA + GGUF combined path. f16 activations, GGUF-packed weights, f32 output.
baracuda_kernels_msort_backward_f32_can_implement^⚠: baracuda_kernels_msort_backward_f32_can_implement (baracuda kernels msort backward f32 can implement).
baracuda_kernels_msort_backward_f32_run^⚠: Msort BW, f32. Same scatter as sort BW; distinct symbol kept for FFI / telemetry parity.
baracuda_kernels_msort_backward_f64_can_implement^⚠: baracuda_kernels_msort_backward_f64_can_implement (baracuda kernels msort backward f64 can implement).
baracuda_kernels_msort_backward_f64_run^⚠: Msort BW, f64.
baracuda_kernels_msort_f32_can_implement^⚠: baracuda_kernels_msort_f32_can_implement (baracuda kernels msort f32 can implement).
baracuda_kernels_msort_f32_run^⚠: Stable block-bitonic sort, f32. Tie-break on original index so equal keys preserve input order.
baracuda_kernels_msort_f64_can_implement^⚠: baracuda_kernels_msort_f64_can_implement (baracuda kernels msort f64 can implement).
baracuda_kernels_msort_f64_run^⚠: Stable block-bitonic sort, f64.
baracuda_kernels_msort_i32_can_implement^⚠: baracuda_kernels_msort_i32_can_implement (baracuda kernels msort i32 can implement).
baracuda_kernels_msort_i32_run^⚠: Stable block-bitonic sort, i32.
baracuda_kernels_msort_i64_can_implement^⚠: baracuda_kernels_msort_i64_can_implement (baracuda kernels msort i64 can implement).
baracuda_kernels_msort_i64_run^⚠: Stable block-bitonic sort, i64.
baracuda_kernels_nms_f32_can_implement^⚠: baracuda_kernels_nms_f32_can_implement (baracuda kernels nms f32 can implement).
baracuda_kernels_nms_f32_run^⚠: nms(boxes, iou_thresh). Caller supplies boxes pre-sorted by score, descending. boxes: [num_boxes, 4] (x1, y1, x2, y2). keep_mask: [num_boxes] u8 (0 / 1); count_out: single i32. f32. # Safety: as above.
baracuda_kernels_nms_f64_can_implement^⚠: baracuda_kernels_nms_f64_can_implement (baracuda kernels nms f64 can implement).
baracuda_kernels_nms_f64_run^⚠: nms, f64. # Safety: as f32.
baracuda_kernels_nonzero_bool_can_implement^⚠: Implementability check for nonzero_bool.
baracuda_kernels_nonzero_bool_run^⚠: nonzero — bool (u8) input.
baracuda_kernels_nonzero_f32_can_implement^⚠: Implementability check for nonzero_f32.
baracuda_kernels_nonzero_f32_run^⚠: Coordinates where x[i] != 0. f32 input.
baracuda_kernels_nonzero_f64_can_implement^⚠: Implementability check for nonzero_f64.
baracuda_kernels_nonzero_f64_run^⚠: nonzero — f64 input.
baracuda_kernels_nonzero_i32_can_implement^⚠: Implementability check for nonzero_i32.
baracuda_kernels_nonzero_i32_run^⚠: nonzero — i32 input.
baracuda_kernels_nonzero_i64idx_bool_can_implement^⚠: Implementability check for nonzero_i64idx_bool.
baracuda_kernels_nonzero_i64idx_bool_run^⚠: nonzero — bool input, i64 output coords.
baracuda_kernels_nonzero_i64idx_f32_can_implement^⚠: Implementability check for nonzero_i64idx_f32.
baracuda_kernels_nonzero_i64idx_f32_run^⚠: nonzero — f32 input, i64 output coords.
baracuda_kernels_nonzero_i64idx_f64_can_implement^⚠: Implementability check for nonzero_i64idx_f64.
baracuda_kernels_nonzero_i64idx_f64_run^⚠: nonzero — f64 input, i64 output coords.
baracuda_kernels_nonzero_i64idx_i32_can_implement^⚠: Implementability check for nonzero_i64idx_i32.
baracuda_kernels_nonzero_i64idx_i32_run^⚠: nonzero — i32 input, i64 output coords.
baracuda_kernels_one_hot_bool_can_implement^⚠: Implementability check for one_hot_bool.
baracuda_kernels_one_hot_bool_run^⚠: one_hot — bool output (u8 storage).
baracuda_kernels_one_hot_f32_can_implement^⚠: Implementability check for one_hot_f32.
baracuda_kernels_one_hot_f32_run^⚠: out[..., c] = 1 if c == src[...] else 0. Output last axis has extent num_classes. Input dtype is always i32; output is f32.
baracuda_kernels_one_hot_f64_can_implement^⚠: Implementability check for one_hot_f64.
baracuda_kernels_one_hot_f64_run^⚠: one_hot — f64 output.
baracuda_kernels_one_hot_i32_can_implement^⚠: Implementability check for one_hot_i32.
baracuda_kernels_one_hot_i32_run^⚠: one_hot — i32 output.
baracuda_kernels_one_hot_i64idx_bool_can_implement^⚠: Implementability check for one_hot_i64idx_bool.
baracuda_kernels_one_hot_i64idx_bool_run^⚠: one_hot — bool output, i64 indices.
baracuda_kernels_one_hot_i64idx_f32_can_implement^⚠: Implementability check for one_hot_i64idx_f32.
baracuda_kernels_one_hot_i64idx_f32_run^⚠: one_hot — f32 output, i64 input class indices.
baracuda_kernels_one_hot_i64idx_f64_can_implement^⚠: Implementability check for one_hot_i64idx_f64.
baracuda_kernels_one_hot_i64idx_f64_run^⚠: one_hot — f64 output, i64 indices.
baracuda_kernels_one_hot_i64idx_i32_can_implement^⚠: Implementability check for one_hot_i64idx_i32.
baracuda_kernels_one_hot_i64idx_i32_run^⚠: one_hot — i32 output, i64 indices.
baracuda_kernels_ormqr_f32_run^⚠: Apply Householder-encoded Q (from a prior geqrf) to c_inout. side ∈ {0=Left, 1=Right}; op ∈ {0=N, 1=T, 2=C}. On Left + op=N, computes C := Q · C; pair with a pre-staged identity C to materialize dense Q.
baracuda_kernels_ormqr_f64_run^⚠: Apply Householder-encoded Q (from a prior geqrf) to c_inout. side ∈ {0=Left, 1=Right}; op ∈ {0=N, 1=T, 2=C}. On Left + op=N, computes C := Q · C; pair with a pre-staged identity C to materialize dense Q.
baracuda_kernels_pad_circular_bf16_can_implement^⚠: baracuda_kernels_pad_circular_bf16_can_implement (baracuda kernels pad circular bf16 can implement).
baracuda_kernels_pad_circular_bf16_run^⚠: Pad circular, bf16.
baracuda_kernels_pad_circular_f16_can_implement^⚠: baracuda_kernels_pad_circular_f16_can_implement (baracuda kernels pad circular f16 can implement).
baracuda_kernels_pad_circular_f16_run^⚠: Pad circular, f16.
baracuda_kernels_pad_circular_f32_can_implement^⚠: baracuda_kernels_pad_circular_f32_can_implement (baracuda kernels pad circular f32 can implement).
baracuda_kernels_pad_circular_f32_run^⚠: Pad circular, f32. Cyclic wrap from the opposite end of each axis.
baracuda_kernels_pad_circular_f64_can_implement^⚠: baracuda_kernels_pad_circular_f64_can_implement (baracuda kernels pad circular f64 can implement).
baracuda_kernels_pad_circular_f64_run^⚠: Pad circular, f64.
baracuda_kernels_pad_constant_backward_bf16_can_implement^⚠: baracuda_kernels_pad_constant_backward_bf16_can_implement (baracuda kernels pad constant backward bf16 can implement).
baracuda_kernels_pad_constant_backward_bf16_run^⚠: Pad-constant backward (slice), bf16.
baracuda_kernels_pad_constant_backward_f16_can_implement^⚠: baracuda_kernels_pad_constant_backward_f16_can_implement (baracuda kernels pad constant backward f16 can implement).
baracuda_kernels_pad_constant_backward_f16_run^⚠: Pad-constant backward (slice), f16.
baracuda_kernels_pad_constant_backward_f32_can_implement^⚠: baracuda_kernels_pad_constant_backward_f32_can_implement (baracuda kernels pad constant backward f32 can implement).
baracuda_kernels_pad_constant_backward_f32_run^⚠: Pad-constant backward (slice), f32.
baracuda_kernels_pad_constant_backward_f64_can_implement^⚠: baracuda_kernels_pad_constant_backward_f64_can_implement (baracuda kernels pad constant backward f64 can implement).
baracuda_kernels_pad_constant_backward_f64_run^⚠: Pad-constant backward (slice), f64.
baracuda_kernels_pad_constant_bf16_can_implement^⚠: baracuda_kernels_pad_constant_bf16_can_implement (baracuda kernels pad constant bf16 can implement).
baracuda_kernels_pad_constant_bf16_run^⚠: Pad with a constant value, bf16, contig output. The value argument carries the __nv_bfloat16 bit pattern as u16 — Rust callers can produce it via half::bf16::to_bits().
baracuda_kernels_pad_constant_f16_can_implement^⚠: baracuda_kernels_pad_constant_f16_can_implement (baracuda kernels pad constant f16 can implement).
baracuda_kernels_pad_constant_f16_run^⚠: Pad with a constant value, f16, contig output. The value argument carries the __half bit pattern as u16 — Rust callers can produce it via half::f16::to_bits(). ABI-compatible because __half is a 2-byte __CUDA_ALIGN__(2) POD struct passed in the same register slot as unsigned short.
baracuda_kernels_pad_constant_f32_can_implement^⚠: baracuda_kernels_pad_constant_f32_can_implement (baracuda kernels pad constant f32 can implement).
baracuda_kernels_pad_constant_f32_run^⚠: Pad with a constant value, f32, contig output.
baracuda_kernels_pad_constant_f64_can_implement^⚠: baracuda_kernels_pad_constant_f64_can_implement (baracuda kernels pad constant f64 can implement).
baracuda_kernels_pad_constant_f64_run^⚠: Pad with a constant value, f64, contig output.
baracuda_kernels_pad_reflect_bf16_can_implement^⚠: baracuda_kernels_pad_reflect_bf16_can_implement (baracuda kernels pad reflect bf16 can implement).
baracuda_kernels_pad_reflect_bf16_run^⚠: Pad reflect, bf16.
baracuda_kernels_pad_reflect_f16_can_implement^⚠: baracuda_kernels_pad_reflect_f16_can_implement (baracuda kernels pad reflect f16 can implement).
baracuda_kernels_pad_reflect_f16_run^⚠: Pad reflect, f16.
baracuda_kernels_pad_reflect_f32_can_implement^⚠: baracuda_kernels_pad_reflect_f32_can_implement (baracuda kernels pad reflect f32 can implement).
baracuda_kernels_pad_reflect_f32_run^⚠: Pad reflect, f32. Mirror input across the boundary (no edge duplication).
baracuda_kernels_pad_reflect_f64_can_implement^⚠: baracuda_kernels_pad_reflect_f64_can_implement (baracuda kernels pad reflect f64 can implement).
baracuda_kernels_pad_reflect_f64_run^⚠: Pad reflect, f64.
baracuda_kernels_pad_replicate_bf16_can_implement^⚠: Implementability check for baracuda_kernels_pad_replicate_bf16. Host-side only.
baracuda_kernels_pad_replicate_bf16_run^⚠: Pad replicate, bf16.
baracuda_kernels_pad_replicate_f16_can_implement^⚠: Implementability check for baracuda_kernels_pad_replicate_f16. Host-side only.
baracuda_kernels_pad_replicate_f16_run^⚠: Pad replicate, f16.
baracuda_kernels_pad_replicate_f32_can_implement^⚠: Implementability check for baracuda_kernels_pad_replicate_f32. Host-side only.
baracuda_kernels_pad_replicate_f32_run^⚠: Pad replicate, f32. Clamp to the edge value of the input.
baracuda_kernels_pad_replicate_f64_can_implement^⚠: Implementability check for baracuda_kernels_pad_replicate_f64. Host-side only.
baracuda_kernels_pad_replicate_f64_run^⚠: Pad replicate, f64.
baracuda_kernels_permute_bf16_can_implement^⚠: Pre-launch implementability check for permute_bf16.
baracuda_kernels_permute_bf16_run^⚠: Materialized permute, bf16. Pure element copy — no math.
baracuda_kernels_permute_bf16_strided_can_implement^⚠: permute_bf16_strided_can_implement companion.
baracuda_kernels_permute_bf16_strided_run^⚠: Permute strided sibling, bf16.
baracuda_kernels_permute_f16_can_implement^⚠: Pre-launch implementability check for permute_f16.
baracuda_kernels_permute_f16_run^⚠: Materialized permute, f16. Pure element copy — no math.
baracuda_kernels_permute_f16_strided_can_implement^⚠: permute_f16_strided_can_implement companion.
baracuda_kernels_permute_f16_strided_run^⚠: Permute strided sibling, f16.
baracuda_kernels_permute_f32_can_implement^⚠: Pre-launch implementability check for permute_f32.
baracuda_kernels_permute_f32_run^⚠: Materialized permute, f32.
baracuda_kernels_permute_f32_strided_can_implement^⚠: permute_f32_strided_can_implement companion.
baracuda_kernels_permute_f32_strided_run^⚠: Permute strided sibling, f32.
baracuda_kernels_permute_f64_can_implement^⚠: Pre-launch implementability check for permute_f64.
baracuda_kernels_permute_f64_run^⚠: Materialized permute, f64. Pure element copy — no math.
baracuda_kernels_permute_f64_strided_can_implement^⚠: permute_f64_strided_can_implement companion.
baracuda_kernels_permute_f64_strided_run^⚠: Permute strided sibling, f64.
baracuda_kernels_pixel_shuffle_bf16_can_implement^⚠: baracuda_kernels_pixel_shuffle_bf16_can_implement (baracuda kernels pixel shuffle bf16 can implement).
baracuda_kernels_pixel_shuffle_bf16_run^⚠: pixel_shuffle, bf16. # Safety: as f32.
baracuda_kernels_pixel_shuffle_f16_can_implement^⚠: baracuda_kernels_pixel_shuffle_f16_can_implement (baracuda kernels pixel shuffle f16 can implement).
baracuda_kernels_pixel_shuffle_f16_run^⚠: pixel_shuffle, f16. # Safety: as f32.
baracuda_kernels_pixel_shuffle_f32_can_implement^⚠: baracuda_kernels_pixel_shuffle_f32_can_implement (baracuda kernels pixel shuffle f32 can implement).
baracuda_kernels_pixel_shuffle_f32_run^⚠: pixel_shuffle(x, r) — [N, C·r², H, W] → [N, C, H·r, W·r]. f32. # Safety: as above.
baracuda_kernels_pixel_shuffle_f64_can_implement^⚠: baracuda_kernels_pixel_shuffle_f64_can_implement (baracuda kernels pixel shuffle f64 can implement).
baracuda_kernels_pixel_shuffle_f64_run^⚠: pixel_shuffle, f64. # Safety: as f32.
baracuda_kernels_pixel_unshuffle_bf16_can_implement^⚠: baracuda_kernels_pixel_unshuffle_bf16_can_implement (baracuda kernels pixel unshuffle bf16 can implement).
baracuda_kernels_pixel_unshuffle_bf16_run^⚠: pixel_unshuffle, bf16. # Safety: as f32.
baracuda_kernels_pixel_unshuffle_f16_can_implement^⚠: baracuda_kernels_pixel_unshuffle_f16_can_implement (baracuda kernels pixel unshuffle f16 can implement).
baracuda_kernels_pixel_unshuffle_f16_run^⚠: pixel_unshuffle, f16. # Safety: as f32.
baracuda_kernels_pixel_unshuffle_f32_can_implement^⚠: baracuda_kernels_pixel_unshuffle_f32_can_implement (baracuda kernels pixel unshuffle f32 can implement).
baracuda_kernels_pixel_unshuffle_f32_run^⚠: pixel_unshuffle(x, r) — [N, C, H·r, W·r] → [N, C·r², H, W]. Inverse of pixel_shuffle (and each is the other’s BW). f32.
baracuda_kernels_pixel_unshuffle_f64_can_implement^⚠: baracuda_kernels_pixel_unshuffle_f64_can_implement (baracuda kernels pixel unshuffle f64 can implement).
baracuda_kernels_pixel_unshuffle_f64_run^⚠: pixel_unshuffle, f64. # Safety: as f32.
baracuda_kernels_prelu_backward_bf16_can_implement^⚠: Implementability check for baracuda_kernels_prelu_backward_bf16. Host-side only.
baracuda_kernels_prelu_backward_bf16_run^⚠: PReLU BW, bf16.
baracuda_kernels_prelu_backward_f16_can_implement^⚠: Implementability check for baracuda_kernels_prelu_backward_f16. Host-side only.
baracuda_kernels_prelu_backward_f16_run^⚠: PReLU BW, f16.
baracuda_kernels_prelu_backward_f32_can_implement^⚠: Implementability check for baracuda_kernels_prelu_backward_f32. Host-side only.
baracuda_kernels_prelu_backward_f32_run^⚠: PReLU BW, f32. ABI: (numel, channel_stride, channel_extent, scalar_weight, dy, x, weight, dx, dweight, workspace, workspace_bytes, stream).
baracuda_kernels_prelu_backward_f64_can_implement^⚠: Implementability check for baracuda_kernels_prelu_backward_f64. Host-side only.
baracuda_kernels_prelu_backward_f64_run^⚠: PReLU BW, f64.
baracuda_kernels_prelu_bf16_can_implement^⚠: baracuda_kernels_prelu_bf16_can_implement (baracuda kernels prelu bf16 can implement).
baracuda_kernels_prelu_bf16_run^⚠: PReLU FW, bf16.
baracuda_kernels_prelu_f16_can_implement^⚠: baracuda_kernels_prelu_f16_can_implement (baracuda kernels prelu f16 can implement).
baracuda_kernels_prelu_f16_run^⚠: PReLU FW, f16.
baracuda_kernels_prelu_f32_can_implement^⚠: baracuda_kernels_prelu_f32_can_implement (baracuda kernels prelu f32 can implement).
baracuda_kernels_prelu_f32_run^⚠: PReLU FW, f32. ABI: (numel, channel_stride, channel_extent, scalar_weight, x, weight, y, workspace, workspace_bytes, stream).
baracuda_kernels_prelu_f64_can_implement^⚠: baracuda_kernels_prelu_f64_can_implement (baracuda kernels prelu f64 can implement).
baracuda_kernels_prelu_f64_run^⚠: PReLU FW, f64.
baracuda_kernels_qr_f32_run^⚠: QR factorization (packed Householder output, m >= n required). a_inout is overwritten with R (upper triangle) + Householder reflectors (strict lower); tau_out is [min(m, n)].
baracuda_kernels_qr_f32_workspace_size^⚠: QR factorization workspace size in bytes for geqrf.
baracuda_kernels_qr_f64_run^⚠: QR factorization (packed Householder output, m >= n required). a_inout is overwritten with R (upper triangle) + Householder reflectors (strict lower); tau_out is [min(m, n)].
baracuda_kernels_qr_f64_workspace_size^⚠: QR factorization workspace size in bytes for geqrf.
baracuda_kernels_quantize_per_channel_backward_bf16_can_implement^⚠: Implementability check for quantize_per_channel_backward_bf16.
baracuda_kernels_quantize_per_channel_backward_bf16_run^⚠: quantize_per_channel_backward — bf16.
baracuda_kernels_quantize_per_channel_backward_f16_can_implement^⚠: Implementability check for quantize_per_channel_backward_f16.
baracuda_kernels_quantize_per_channel_backward_f16_run^⚠: quantize_per_channel_backward — f16.
baracuda_kernels_quantize_per_channel_backward_f32_can_implement^⚠: Implementability check for quantize_per_channel_backward_f32.
baracuda_kernels_quantize_per_channel_backward_f32_run^⚠: dx[i] = (dy[i] / scale[c]) * in_range_mask(x[i]). f32.
baracuda_kernels_quantize_per_channel_backward_f64_can_implement^⚠: Implementability check for quantize_per_channel_backward_f64.
baracuda_kernels_quantize_per_channel_backward_f64_run^⚠: quantize_per_channel_backward — f64.
baracuda_kernels_quantize_per_channel_bf16_s8_can_implement^⚠: Implementability check for quantize_per_channel_bf16_s8.
baracuda_kernels_quantize_per_channel_bf16_s8_run^⚠: quantize_per_channel — bf16 → s8.
baracuda_kernels_quantize_per_channel_bf16_u8_can_implement^⚠: Implementability check for quantize_per_channel_bf16_u8.
baracuda_kernels_quantize_per_channel_bf16_u8_run^⚠: quantize_per_channel — bf16 → u8.
baracuda_kernels_quantize_per_channel_f16_s8_can_implement^⚠: Implementability check for quantize_per_channel_f16_s8.
baracuda_kernels_quantize_per_channel_f16_s8_run^⚠: quantize_per_channel — f16 → s8.
baracuda_kernels_quantize_per_channel_f16_u8_can_implement^⚠: Implementability check for quantize_per_channel_f16_u8.
baracuda_kernels_quantize_per_channel_f16_u8_run^⚠: quantize_per_channel — f16 → u8.
baracuda_kernels_quantize_per_channel_f32_s8_can_implement^⚠: Implementability check for quantize_per_channel_f32_s8.
baracuda_kernels_quantize_per_channel_f32_s8_run^⚠: q[i] = clamp(round(x[i]/scale[c])+zp[c], qmin, qmax) where c = coord[axis]. f32 → s8.
baracuda_kernels_quantize_per_channel_f32_u8_can_implement^⚠: Implementability check for quantize_per_channel_f32_u8.
baracuda_kernels_quantize_per_channel_f32_u8_run^⚠: quantize_per_channel — f32 → u8.
baracuda_kernels_quantize_per_channel_f64_s8_can_implement^⚠: Implementability check for quantize_per_channel_f64_s8.
baracuda_kernels_quantize_per_channel_f64_s8_run^⚠: quantize_per_channel — f64 → s8.
baracuda_kernels_quantize_per_channel_f64_u8_can_implement^⚠: Implementability check for quantize_per_channel_f64_u8.
baracuda_kernels_quantize_per_channel_f64_u8_run^⚠: quantize_per_channel — f64 → u8.
baracuda_kernels_quantize_per_group_backward_bf16_can_implement^⚠: Implementability check for quantize_per_group_backward_bf16.
baracuda_kernels_quantize_per_group_backward_bf16_run^⚠: STE BW — bf16.
baracuda_kernels_quantize_per_group_backward_f16_can_implement^⚠: Implementability check for quantize_per_group_backward_f16.
baracuda_kernels_quantize_per_group_backward_f16_run^⚠: STE BW — f16.
baracuda_kernels_quantize_per_group_backward_f32_can_implement^⚠: Implementability check for quantize_per_group_backward_f32.
baracuda_kernels_quantize_per_group_backward_f32_run^⚠: STE BW — f32.
baracuda_kernels_quantize_per_group_backward_f64_can_implement^⚠: Implementability check for quantize_per_group_backward_f64.
baracuda_kernels_quantize_per_group_backward_f64_run^⚠: STE BW — f64.
baracuda_kernels_quantize_per_group_bf16_s8_can_implement^⚠: Implementability check for quantize_per_group_bf16_s8.
baracuda_kernels_quantize_per_group_bf16_s8_run^⚠: quantize_per_group — bf16 → s8.
baracuda_kernels_quantize_per_group_bf16_u8_can_implement^⚠: Implementability check for quantize_per_group_bf16_u8.
baracuda_kernels_quantize_per_group_bf16_u8_run^⚠: quantize_per_group — bf16 → u8.
baracuda_kernels_quantize_per_group_f16_s8_can_implement^⚠: Implementability check for quantize_per_group_f16_s8.
baracuda_kernels_quantize_per_group_f16_s8_run^⚠: quantize_per_group — f16 → s8.
baracuda_kernels_quantize_per_group_f16_u8_can_implement^⚠: Implementability check for quantize_per_group_f16_u8.
baracuda_kernels_quantize_per_group_f16_u8_run^⚠: quantize_per_group — f16 → u8.
baracuda_kernels_quantize_per_group_f32_s8_can_implement^⚠: Implementability check for quantize_per_group_f32_s8.
baracuda_kernels_quantize_per_group_f32_s8_run^⚠: quantize_per_group — f32 → s8.
baracuda_kernels_quantize_per_group_f32_u8_can_implement^⚠: Implementability check for quantize_per_group_f32_u8.
baracuda_kernels_quantize_per_group_f32_u8_run^⚠: quantize_per_group — f32 → u8.
baracuda_kernels_quantize_per_group_f64_s8_can_implement^⚠: Implementability check for quantize_per_group_f64_s8.
baracuda_kernels_quantize_per_group_f64_s8_run^⚠: quantize_per_group — f64 → s8.
baracuda_kernels_quantize_per_group_f64_u8_can_implement^⚠: Implementability check for quantize_per_group_f64_u8.
baracuda_kernels_quantize_per_group_f64_u8_run^⚠: quantize_per_group — f64 → u8.
baracuda_kernels_quantize_per_tensor_backward_bf16_can_implement^⚠: Implementability check for quantize_per_tensor_backward_bf16.
baracuda_kernels_quantize_per_tensor_backward_bf16_run^⚠: quantize_per_tensor_backward — bf16.
baracuda_kernels_quantize_per_tensor_backward_f16_can_implement^⚠: Implementability check for quantize_per_tensor_backward_f16.
baracuda_kernels_quantize_per_tensor_backward_f16_run^⚠: quantize_per_tensor_backward — f16.
baracuda_kernels_quantize_per_tensor_backward_f32_can_implement^⚠: Implementability check for quantize_per_tensor_backward_f32.
baracuda_kernels_quantize_per_tensor_backward_f32_run^⚠: dx = (dy / scale) * in_range_mask(x). f32.
baracuda_kernels_quantize_per_tensor_backward_f64_can_implement^⚠: Implementability check for quantize_per_tensor_backward_f64.
baracuda_kernels_quantize_per_tensor_backward_f64_run^⚠: quantize_per_tensor_backward — f64 (f64 scale).
baracuda_kernels_quantize_per_tensor_bf16_s8_can_implement^⚠: Implementability check for quantize_per_tensor_bf16_s8.
baracuda_kernels_quantize_per_tensor_bf16_s8_run^⚠: quantize_per_tensor — bf16 → s8.
baracuda_kernels_quantize_per_tensor_bf16_u8_can_implement^⚠: Implementability check for quantize_per_tensor_bf16_u8.
baracuda_kernels_quantize_per_tensor_bf16_u8_run^⚠: quantize_per_tensor — bf16 → u8.
baracuda_kernels_quantize_per_tensor_f16_s8_can_implement^⚠: Implementability check for quantize_per_tensor_f16_s8.
baracuda_kernels_quantize_per_tensor_f16_s8_run^⚠: quantize_per_tensor — f16 → s8.
baracuda_kernels_quantize_per_tensor_f16_u8_can_implement^⚠: Implementability check for quantize_per_tensor_f16_u8.
baracuda_kernels_quantize_per_tensor_f16_u8_run^⚠: quantize_per_tensor — f16 → u8.
baracuda_kernels_quantize_per_tensor_f32_s8_can_implement^⚠: Implementability check for quantize_per_tensor_f32_s8.
baracuda_kernels_quantize_per_tensor_f32_s8_run^⚠: q = clamp(round(x/scale)+zp, qmin, qmax). f32 input, s8 output.
baracuda_kernels_quantize_per_tensor_f32_u8_can_implement^⚠: Implementability check for quantize_per_tensor_f32_u8.
baracuda_kernels_quantize_per_tensor_f32_u8_run^⚠: quantize_per_tensor — f32 → u8.
baracuda_kernels_quantize_per_tensor_f64_s8_can_implement^⚠: Implementability check for quantize_per_tensor_f64_s8.
baracuda_kernels_quantize_per_tensor_f64_s8_run^⚠: quantize_per_tensor — f64 → s8 (f64 scale).
baracuda_kernels_quantize_per_tensor_f64_u8_can_implement^⚠: Implementability check for quantize_per_tensor_f64_u8.
baracuda_kernels_quantize_per_tensor_f64_u8_run^⚠: quantize_per_tensor — f64 → u8 (f64 scale).
baracuda_kernels_quantize_per_token_backward_bf16_can_implement^⚠: Implementability check for quantize_per_token_backward_bf16.
baracuda_kernels_quantize_per_token_backward_bf16_run^⚠: STE backward — bf16.
baracuda_kernels_quantize_per_token_backward_f16_can_implement^⚠: Implementability check for quantize_per_token_backward_f16.
baracuda_kernels_quantize_per_token_backward_f16_run^⚠: STE backward — f16.
baracuda_kernels_quantize_per_token_backward_f32_can_implement^⚠: Implementability check for quantize_per_token_backward_f32.
baracuda_kernels_quantize_per_token_backward_f32_run^⚠: STE backward — f32.
baracuda_kernels_quantize_per_token_backward_f64_can_implement^⚠: Implementability check for quantize_per_token_backward_f64.
baracuda_kernels_quantize_per_token_backward_f64_run^⚠: STE backward — f64.
baracuda_kernels_quantize_per_token_bf16_s8_can_implement^⚠: Implementability check for quantize_per_token_bf16_s8.
baracuda_kernels_quantize_per_token_bf16_s8_run^⚠: quantize_per_token — bf16 → s8.
baracuda_kernels_quantize_per_token_bf16_u8_can_implement^⚠: Implementability check for quantize_per_token_bf16_u8.
baracuda_kernels_quantize_per_token_bf16_u8_run^⚠: quantize_per_token — bf16 → u8.
baracuda_kernels_quantize_per_token_f16_s8_can_implement^⚠: Implementability check for quantize_per_token_f16_s8.
baracuda_kernels_quantize_per_token_f16_s8_run^⚠: quantize_per_token — f16 → s8.
baracuda_kernels_quantize_per_token_f16_u8_can_implement^⚠: Implementability check for quantize_per_token_f16_u8.
baracuda_kernels_quantize_per_token_f16_u8_run^⚠: quantize_per_token — f16 → u8.
baracuda_kernels_quantize_per_token_f32_s8_can_implement^⚠: Implementability check for quantize_per_token_f32_s8.
baracuda_kernels_quantize_per_token_f32_s8_run^⚠: quantize_per_token — TIn f32, TOut s8. Status codes as elsewhere.
baracuda_kernels_quantize_per_token_f32_u8_can_implement^⚠: Implementability check for quantize_per_token_f32_u8.
baracuda_kernels_quantize_per_token_f32_u8_run^⚠: quantize_per_token — f32 → u8.
baracuda_kernels_quantize_per_token_f64_s8_can_implement^⚠: Implementability check for quantize_per_token_f64_s8.
baracuda_kernels_quantize_per_token_f64_s8_run^⚠: quantize_per_token — f64 → s8.
baracuda_kernels_quantize_per_token_f64_u8_can_implement^⚠: Implementability check for quantize_per_token_f64_u8.
baracuda_kernels_quantize_per_token_f64_u8_run^⚠: quantize_per_token — f64 → u8.
baracuda_kernels_quantize_q8_1_bf16_can_implement^⚠: baracuda_kernels_quantize_q8_1_bf16_can_implement (baracuda kernels quantize q8 1 bf16 can implement).
baracuda_kernels_quantize_q8_1_bf16_run^⚠: Q8_1 activation staging — bf16 source. # Safety: as f32 variant.
baracuda_kernels_quantize_q8_1_f16_can_implement^⚠: baracuda_kernels_quantize_q8_1_f16_can_implement (baracuda kernels quantize q8 1 f16 can implement).
baracuda_kernels_quantize_q8_1_f16_run^⚠: Q8_1 activation staging — f16 source. # Safety: as f32 variant.
baracuda_kernels_quantize_q8_1_f32_can_implement^⚠: baracuda_kernels_quantize_q8_1_f32_can_implement (baracuda kernels quantize q8 1 f32 can implement).
baracuda_kernels_quantize_q8_1_f32_run^⚠: Q8_1 activation staging — f32 source.
baracuda_kernels_quantize_q8_1_workspace_bytes^⚠: Returns workspace bytes needed to stage ny × kx activations into Q8_1. = ny * ceil(kx / 32) * 36. Returns 0 on invalid (non-positive) arguments.
baracuda_kernels_quantized_linear_w8a8_f32_can_implement^⚠: Implementability check for quantized_linear_w8a8_f32.
baracuda_kernels_quantized_linear_w8a8_f32_run^⚠: quantized_linear_w8a8 — TIn = f32.
baracuda_kernels_quantized_linear_w8a8_f64_can_implement^⚠: Implementability check for quantized_linear_w8a8_f64.
baracuda_kernels_quantized_linear_w8a8_f64_run^⚠: quantized_linear_w8a8 — TIn = f64.
baracuda_kernels_reduce_all_bf16_can_implement^⚠: Pre-launch implementability check for reduce_all_bf16.
baracuda_kernels_reduce_all_bf16_run^⚠: all(x, axis=k) with bf16 input, uint8_t Bool output.
baracuda_kernels_reduce_all_bool_can_implement^⚠: Pre-launch implementability check for reduce_all_bool.
baracuda_kernels_reduce_all_bool_run^⚠: all(x, axis=k) with Bool (uint8_t) input, uint8_t Bool output.
baracuda_kernels_reduce_all_f16_can_implement^⚠: Pre-launch implementability check for reduce_all_f16.
baracuda_kernels_reduce_all_f16_run^⚠: all(x, axis=k) with f16 input, uint8_t Bool output.
baracuda_kernels_reduce_all_f32_can_implement^⚠: Pre-launch implementability check for reduce_all_f32.
baracuda_kernels_reduce_all_f32_run^⚠: all(x, axis=k) with f32 input, uint8_t Bool output.
baracuda_kernels_reduce_all_f64_can_implement^⚠: Pre-launch implementability check for reduce_all_f64.
baracuda_kernels_reduce_all_f64_run^⚠: all(x, axis=k) with f64 input, uint8_t Bool output.
baracuda_kernels_reduce_all_i32_can_implement^⚠: Pre-launch implementability check for reduce_all_i32.
baracuda_kernels_reduce_all_i32_run^⚠: all(x, axis=k) with i32 input, uint8_t Bool output.
baracuda_kernels_reduce_all_i64_can_implement^⚠: Pre-launch implementability check for reduce_all_i64.
baracuda_kernels_reduce_all_i64_run^⚠: all(x, axis=k) with i64 input, uint8_t Bool output.
baracuda_kernels_reduce_any_bf16_can_implement^⚠: Pre-launch implementability check for reduce_any_bf16.
baracuda_kernels_reduce_any_bf16_run^⚠: any(x, axis=k) with bf16 input, uint8_t Bool output.
baracuda_kernels_reduce_any_bool_can_implement^⚠: Pre-launch implementability check for reduce_any_bool.
baracuda_kernels_reduce_any_bool_run^⚠: any(x, axis=k) with Bool (uint8_t) input, uint8_t Bool output.
baracuda_kernels_reduce_any_f16_can_implement^⚠: Pre-launch implementability check for reduce_any_f16.
baracuda_kernels_reduce_any_f16_run^⚠: any(x, axis=k) with f16 input, uint8_t Bool output.
baracuda_kernels_reduce_any_f32_can_implement^⚠: Pre-launch implementability check for reduce_any_f32.
baracuda_kernels_reduce_any_f32_run^⚠: any(x, axis=k) with f32 input, uint8_t Bool output.
baracuda_kernels_reduce_any_f64_can_implement^⚠: Pre-launch implementability check for reduce_any_f64.
baracuda_kernels_reduce_any_f64_run^⚠: any(x, axis=k) with f64 input, uint8_t Bool output.
baracuda_kernels_reduce_any_i32_can_implement^⚠: Pre-launch implementability check for reduce_any_i32.
baracuda_kernels_reduce_any_i32_run^⚠: any(x, axis=k) with i32 input, uint8_t Bool output.
baracuda_kernels_reduce_any_i64_can_implement^⚠: Pre-launch implementability check for reduce_any_i64.
baracuda_kernels_reduce_any_i64_run^⚠: any(x, axis=k) with i64 input, uint8_t Bool output.
baracuda_kernels_reduce_count_nonzero_bf16_can_implement^⚠: Pre-launch implementability check for reduce_count_nonzero_bf16.
baracuda_kernels_reduce_count_nonzero_bf16_run^⚠: count_nonzero(x, axis=k) with bf16 input, i64 output.
baracuda_kernels_reduce_count_nonzero_bool_can_implement^⚠: Pre-launch implementability check for reduce_count_nonzero_bool.
baracuda_kernels_reduce_count_nonzero_bool_run^⚠: count_nonzero(x, axis=k) with Bool (uint8_t) input, i64 output.
baracuda_kernels_reduce_count_nonzero_f16_can_implement^⚠: Pre-launch implementability check for reduce_count_nonzero_f16.
baracuda_kernels_reduce_count_nonzero_f16_run^⚠: count_nonzero(x, axis=k) with f16 input, i64 output.
baracuda_kernels_reduce_count_nonzero_f32_can_implement^⚠: Pre-launch implementability check for reduce_count_nonzero_f32.
baracuda_kernels_reduce_count_nonzero_f32_run^⚠: count_nonzero(x, axis=k) with f32 input, i64 output.
baracuda_kernels_reduce_count_nonzero_f64_can_implement^⚠: Pre-launch implementability check for reduce_count_nonzero_f64.
baracuda_kernels_reduce_count_nonzero_f64_run^⚠: count_nonzero(x, axis=k) with f64 input, i64 output.
baracuda_kernels_reduce_count_nonzero_i32_can_implement^⚠: Pre-launch implementability check for reduce_count_nonzero_i32.
baracuda_kernels_reduce_count_nonzero_i32_run^⚠: count_nonzero(x, axis=k) with i32 input, i64 output.
baracuda_kernels_reduce_count_nonzero_i64_can_implement^⚠: Pre-launch implementability check for reduce_count_nonzero_i64.
baracuda_kernels_reduce_count_nonzero_i64_run^⚠: count_nonzero(x, axis=k) with i64 input, i64 output.
baracuda_kernels_reduce_logsumexp_backward_bf16_can_implement^⚠: Pre-launch implementability check for reduce_logsumexp_backward_bf16.
baracuda_kernels_reduce_logsumexp_backward_bf16_run^⚠: LogSumExp reduction backward, bf16.
baracuda_kernels_reduce_logsumexp_backward_f16_can_implement^⚠: Pre-launch implementability check for reduce_logsumexp_backward_f16.
baracuda_kernels_reduce_logsumexp_backward_f16_run^⚠: LogSumExp reduction backward, f16.
baracuda_kernels_reduce_logsumexp_backward_f32_can_implement^⚠: Pre-launch implementability check for reduce_logsumexp_backward_f32.
baracuda_kernels_reduce_logsumexp_backward_f32_run^⚠: LogSumExp reduction backward, f32.
baracuda_kernels_reduce_logsumexp_backward_f64_can_implement^⚠: Pre-launch implementability check for reduce_logsumexp_backward_f64.
baracuda_kernels_reduce_logsumexp_backward_f64_run^⚠: LogSumExp reduction backward, f64.
baracuda_kernels_reduce_logsumexp_bf16_can_implement^⚠: Implementability check for baracuda_kernels_reduce_logsumexp_bf16. Host-side only.
baracuda_kernels_reduce_logsumexp_bf16_run^⚠: LogSumExp reduction along one axis, bf16 (f32-detour throughout).
baracuda_kernels_reduce_logsumexp_f16_can_implement^⚠: Implementability check for baracuda_kernels_reduce_logsumexp_f16. Host-side only.
baracuda_kernels_reduce_logsumexp_f16_run^⚠: LogSumExp reduction along one axis, f16 (f32-detour throughout).
baracuda_kernels_reduce_logsumexp_f32_can_implement^⚠: Implementability check for baracuda_kernels_reduce_logsumexp_f32. Host-side only.
baracuda_kernels_reduce_logsumexp_f32_run^⚠: LogSumExp reduction along one axis, f32 — numerically stable two-pass max-then-sum-exp. Shares the simple-reduce parameter shape so the Rust dispatcher can reach it through the same FFI signature; the kernel internally performs two passes over the reduce axis.
baracuda_kernels_reduce_logsumexp_f64_can_implement^⚠: Implementability check for baracuda_kernels_reduce_logsumexp_f64. Host-side only.
baracuda_kernels_reduce_logsumexp_f64_run^⚠: LogSumExp reduction along one axis, f64.
baracuda_kernels_reduce_max_bf16_can_implement^⚠: Pre-launch implementability check for reduce_max_bf16.
baracuda_kernels_reduce_max_bf16_run^⚠: Max reduction along one axis, bf16 (f32-detour fmaxf).
baracuda_kernels_reduce_max_f16_can_implement^⚠: Pre-launch implementability check for reduce_max_f16.
baracuda_kernels_reduce_max_f16_run^⚠: Max reduction along one axis, f16 (f32-detour fmaxf).
baracuda_kernels_reduce_max_f32_can_implement^⚠: Pre-launch implementability check for reduce_max_f32.
baracuda_kernels_reduce_max_f32_run^⚠: Max reduction along one axis, f32. init = -INFINITY, fmaxf.
baracuda_kernels_reduce_max_f64_can_implement^⚠: Pre-launch implementability check for reduce_max_f64.
baracuda_kernels_reduce_max_f64_run^⚠: Max reduction along one axis, f64.
baracuda_kernels_reduce_max_i8_can_implement^⚠: Pre-launch implementability check for reduce_max_i8.
baracuda_kernels_reduce_max_i8_run^⚠: max(x, axis=k) with i8 input/output (init = INT8_MIN).
baracuda_kernels_reduce_max_i16_can_implement^⚠: Pre-launch implementability check for reduce_max_i16.
baracuda_kernels_reduce_max_i16_run^⚠: max(x, axis=k) with i16 input/output (init = INT16_MIN).
baracuda_kernels_reduce_max_i32_can_implement^⚠: Pre-launch implementability check for reduce_max_i32.
baracuda_kernels_reduce_max_i32_run^⚠: max(x, axis=k) with i32 input/output (init = INT32_MIN).
baracuda_kernels_reduce_max_i64_can_implement^⚠: Pre-launch implementability check for reduce_max_i64.
baracuda_kernels_reduce_max_i64_run^⚠: max(x, axis=k) with i64 input/output (init = INT64_MIN).
baracuda_kernels_reduce_max_min_backward_bf16_can_implement^⚠: Pre-launch implementability check for reduce_max_min_backward_bf16.
baracuda_kernels_reduce_max_min_backward_bf16_run^⚠: Max/Min reduction backward, bf16.
baracuda_kernels_reduce_max_min_backward_f16_can_implement^⚠: Pre-launch implementability check for reduce_max_min_backward_f16.
baracuda_kernels_reduce_max_min_backward_f16_run^⚠: Max/Min reduction backward, f16.
baracuda_kernels_reduce_max_min_backward_f32_can_implement^⚠: Pre-launch implementability check for reduce_max_min_backward_f32.
baracuda_kernels_reduce_max_min_backward_f32_run^⚠: Max/Min reduction backward, f32.
baracuda_kernels_reduce_max_min_backward_f64_can_implement^⚠: Pre-launch implementability check for reduce_max_min_backward_f64.
baracuda_kernels_reduce_max_min_backward_f64_run^⚠: Max/Min reduction backward, f64.
baracuda_kernels_reduce_max_to_bf16_can_implement^⚠: baracuda_kernels_reduce_max_to_bf16_can_implement (baracuda kernels reduce max to bf16 can implement).
baracuda_kernels_reduce_max_to_bf16_run^⚠: reduce_max_to, bf16.
baracuda_kernels_reduce_max_to_f16_can_implement^⚠: baracuda_kernels_reduce_max_to_f16_can_implement (baracuda kernels reduce max to f16 can implement).
baracuda_kernels_reduce_max_to_f16_run^⚠: reduce_max_to, f16. Identity is -FLT_MAX in f32 accumulator space, narrowed back to f16 on store.
baracuda_kernels_reduce_max_to_f32_can_implement^⚠: baracuda_kernels_reduce_max_to_f32_can_implement (baracuda kernels reduce max to f32 can implement).
baracuda_kernels_reduce_max_to_f32_run^⚠: reduce_max_to, f32. Identity is -FLT_MAX when the broadcast set is empty.
baracuda_kernels_reduce_max_to_f64_can_implement^⚠: baracuda_kernels_reduce_max_to_f64_can_implement (baracuda kernels reduce max to f64 can implement).
baracuda_kernels_reduce_max_to_f64_run^⚠: reduce_max_to, f64. Identity is -DBL_MAX.
baracuda_kernels_reduce_max_u8_can_implement^⚠: Pre-launch implementability check for reduce_max_u8.
baracuda_kernels_reduce_max_u8_run^⚠: max(x, axis=k) with u8 input/output (init = 0).
baracuda_kernels_reduce_max_u32_can_implement^⚠: Pre-launch implementability check for reduce_max_u32.
baracuda_kernels_reduce_max_u32_run^⚠: max(x, axis=k) with u32 input/output (init = 0).
baracuda_kernels_reduce_mean_backward_bf16_can_implement^⚠: Pre-launch implementability check for reduce_mean_backward_bf16.
baracuda_kernels_reduce_mean_backward_bf16_run^⚠: Mean reduction backward, bf16.
baracuda_kernels_reduce_mean_backward_f16_can_implement^⚠: Pre-launch implementability check for reduce_mean_backward_f16.
baracuda_kernels_reduce_mean_backward_f16_run^⚠: Mean reduction backward, f16.
baracuda_kernels_reduce_mean_backward_f32_can_implement^⚠: Pre-launch implementability check for reduce_mean_backward_f32.
baracuda_kernels_reduce_mean_backward_f32_run^⚠: Mean reduction backward, f32. Same as Sum BW with extra 1/k scale (inv_extent is 1.0 / reduced_extent computed in f64 on the host).
baracuda_kernels_reduce_mean_backward_f64_can_implement^⚠: Pre-launch implementability check for reduce_mean_backward_f64.
baracuda_kernels_reduce_mean_backward_f64_run^⚠: Mean reduction backward, f64.
baracuda_kernels_reduce_mean_bf16_can_implement^⚠: Pre-launch implementability check for reduce_mean_bf16.
baracuda_kernels_reduce_mean_bf16_run^⚠: Mean reduction along one axis, bf16 (f32-detour for sum + divide).
baracuda_kernels_reduce_mean_f16_can_implement^⚠: Pre-launch implementability check for reduce_mean_f16.
baracuda_kernels_reduce_mean_f16_run^⚠: Mean reduction along one axis, f16 (f32-detour for sum + divide).
baracuda_kernels_reduce_mean_f32_can_implement^⚠: Pre-launch implementability check for reduce_mean_f32.
baracuda_kernels_reduce_mean_f32_run^⚠: Mean reduction along one axis, f32. Sum then divide by extent.
baracuda_kernels_reduce_mean_f64_can_implement^⚠: Pre-launch implementability check for reduce_mean_f64.
baracuda_kernels_reduce_mean_f64_run^⚠: Mean reduction along one axis, f64.
baracuda_kernels_reduce_min_bf16_can_implement^⚠: Pre-launch implementability check for reduce_min_bf16.
baracuda_kernels_reduce_min_bf16_run^⚠: Min reduction along one axis, bf16 (f32-detour fminf).
baracuda_kernels_reduce_min_f16_can_implement^⚠: Pre-launch implementability check for reduce_min_f16.
baracuda_kernels_reduce_min_f16_run^⚠: Min reduction along one axis, f16 (f32-detour fminf).
baracuda_kernels_reduce_min_f32_can_implement^⚠: Pre-launch implementability check for reduce_min_f32.
baracuda_kernels_reduce_min_f32_run^⚠: Min reduction along one axis, f32. init = +INFINITY, fminf.
baracuda_kernels_reduce_min_f64_can_implement^⚠: Pre-launch implementability check for reduce_min_f64.
baracuda_kernels_reduce_min_f64_run^⚠: Min reduction along one axis, f64.
baracuda_kernels_reduce_min_i8_can_implement^⚠: Pre-launch implementability check for reduce_min_i8.
baracuda_kernels_reduce_min_i8_run^⚠: min(x, axis=k) with i8 input/output (init = INT8_MAX).
baracuda_kernels_reduce_min_i16_can_implement^⚠: Pre-launch implementability check for reduce_min_i16.
baracuda_kernels_reduce_min_i16_run^⚠: min(x, axis=k) with i16 input/output (init = INT16_MAX).
baracuda_kernels_reduce_min_i32_can_implement^⚠: Pre-launch implementability check for reduce_min_i32.
baracuda_kernels_reduce_min_i32_run^⚠: min(x, axis=k) with i32 input/output (init = INT32_MAX).
baracuda_kernels_reduce_min_i64_can_implement^⚠: Pre-launch implementability check for reduce_min_i64.
baracuda_kernels_reduce_min_i64_run^⚠: min(x, axis=k) with i64 input/output (init = INT64_MAX).
baracuda_kernels_reduce_min_to_bf16_can_implement^⚠: baracuda_kernels_reduce_min_to_bf16_can_implement (baracuda kernels reduce min to bf16 can implement).
baracuda_kernels_reduce_min_to_bf16_run^⚠: reduce_min_to, bf16.
baracuda_kernels_reduce_min_to_f16_can_implement^⚠: baracuda_kernels_reduce_min_to_f16_can_implement (baracuda kernels reduce min to f16 can implement).
baracuda_kernels_reduce_min_to_f16_run^⚠: reduce_min_to, f16. Accumulator widens to f32; identity is +FLT_MAX in f32 accumulator space, narrowing to +inf on store.
baracuda_kernels_reduce_min_to_f32_can_implement^⚠: baracuda_kernels_reduce_min_to_f32_can_implement (baracuda kernels reduce min to f32 can implement).
baracuda_kernels_reduce_min_to_f32_run^⚠: reduce_min_to, f32. Identity is +FLT_MAX when the broadcast set is empty.
baracuda_kernels_reduce_min_to_f64_can_implement^⚠: baracuda_kernels_reduce_min_to_f64_can_implement (baracuda kernels reduce min to f64 can implement).
baracuda_kernels_reduce_min_to_f64_run^⚠: reduce_min_to, f64. Identity is +DBL_MAX.
baracuda_kernels_reduce_min_u8_can_implement^⚠: Pre-launch implementability check for reduce_min_u8.
baracuda_kernels_reduce_min_u8_run^⚠: min(x, axis=k) with u8 input/output (same-dtype, init = UINT8_MAX).
baracuda_kernels_reduce_min_u32_can_implement^⚠: Pre-launch implementability check for reduce_min_u32.
baracuda_kernels_reduce_min_u32_run^⚠: min(x, axis=k) with u32 input/output (init = UINT32_MAX).
baracuda_kernels_reduce_norm2_backward_bf16_can_implement^⚠: Pre-launch implementability check for reduce_norm2_backward_bf16.
baracuda_kernels_reduce_norm2_backward_bf16_run^⚠: Norm2 reduction backward, bf16.
baracuda_kernels_reduce_norm2_backward_f16_can_implement^⚠: Pre-launch implementability check for reduce_norm2_backward_f16.
baracuda_kernels_reduce_norm2_backward_f16_run^⚠: Norm2 reduction backward, f16.
baracuda_kernels_reduce_norm2_backward_f32_can_implement^⚠: Pre-launch implementability check for reduce_norm2_backward_f32.
baracuda_kernels_reduce_norm2_backward_f32_run^⚠: Norm2 reduction backward, f32.
baracuda_kernels_reduce_norm2_backward_f64_can_implement^⚠: Pre-launch implementability check for reduce_norm2_backward_f64.
baracuda_kernels_reduce_norm2_backward_f64_run^⚠: Norm2 reduction backward, f64.
baracuda_kernels_reduce_norm2_bf16_can_implement^⚠: Pre-launch implementability check for reduce_norm2_bf16.
baracuda_kernels_reduce_norm2_bf16_run^⚠: Norm2 reduction along one axis, bf16 (f32-detour functor + sqrt).
baracuda_kernels_reduce_norm2_f16_can_implement^⚠: Pre-launch implementability check for reduce_norm2_f16.
baracuda_kernels_reduce_norm2_f16_run^⚠: Norm2 reduction along one axis, f16 (f32-detour functor + sqrt).
baracuda_kernels_reduce_norm2_f32_can_implement^⚠: Pre-launch implementability check for reduce_norm2_f32.
baracuda_kernels_reduce_norm2_f32_run^⚠: Norm2 reduction along one axis, f32. y = sqrt(sum(x*x)) — shares the simple-reduce parameter shape.
baracuda_kernels_reduce_norm2_f64_can_implement^⚠: Pre-launch implementability check for reduce_norm2_f64.
baracuda_kernels_reduce_norm2_f64_run^⚠: Norm2 reduction along one axis, f64.
baracuda_kernels_reduce_prod_backward_bf16_can_implement^⚠: Pre-launch implementability check for reduce_prod_backward_bf16.
baracuda_kernels_reduce_prod_backward_bf16_run^⚠: Prod reduction backward, bf16.
baracuda_kernels_reduce_prod_backward_f16_can_implement^⚠: Pre-launch implementability check for reduce_prod_backward_f16.
baracuda_kernels_reduce_prod_backward_f16_run^⚠: Prod reduction backward, f16.
baracuda_kernels_reduce_prod_backward_f32_can_implement^⚠: Pre-launch implementability check for reduce_prod_backward_f32.
baracuda_kernels_reduce_prod_backward_f32_run^⚠: Prod reduction backward, f32.
baracuda_kernels_reduce_prod_backward_f64_can_implement^⚠: Pre-launch implementability check for reduce_prod_backward_f64.
baracuda_kernels_reduce_prod_backward_f64_run^⚠: Prod reduction backward, f64.
baracuda_kernels_reduce_prod_bf16_can_implement^⚠: Pre-launch implementability check for reduce_prod_bf16.
baracuda_kernels_reduce_prod_bf16_run^⚠: Product reduction along one axis, bf16 (f32-detour multiply).
baracuda_kernels_reduce_prod_f16_can_implement^⚠: Pre-launch implementability check for reduce_prod_f16.
baracuda_kernels_reduce_prod_f16_run^⚠: Product reduction along one axis, f16 (f32-detour multiply).
baracuda_kernels_reduce_prod_f32_can_implement^⚠: Pre-launch implementability check for reduce_prod_f32.
baracuda_kernels_reduce_prod_f32_run^⚠: Product reduction along one axis, f32. init = 1, op = *.
baracuda_kernels_reduce_prod_f64_can_implement^⚠: Pre-launch implementability check for reduce_prod_f64.
baracuda_kernels_reduce_prod_f64_run^⚠: Product reduction along one axis, f64.
baracuda_kernels_reduce_prod_i8_can_implement^⚠: Pre-launch implementability check for reduce_prod_i8.
baracuda_kernels_reduce_prod_i8_run^⚠: prod(x, axis=k) with i8 input/output (wider i64 accumulator).
baracuda_kernels_reduce_prod_i16_can_implement^⚠: Pre-launch implementability check for reduce_prod_i16.
baracuda_kernels_reduce_prod_i16_run^⚠: prod(x, axis=k) with i16 input/output (wider i64 accumulator).
baracuda_kernels_reduce_prod_i32_can_implement^⚠: Pre-launch implementability check for reduce_prod_i32.
baracuda_kernels_reduce_prod_i32_run^⚠: prod(x, axis=k) with i32 input/output (wider i64 accumulator).
baracuda_kernels_reduce_prod_i64_can_implement^⚠: Pre-launch implementability check for reduce_prod_i64.
baracuda_kernels_reduce_prod_i64_run^⚠: prod(x, axis=k) with i64 input/output. Modulo-2^64 wrap.
baracuda_kernels_reduce_prod_to_bf16_can_implement^⚠: baracuda_kernels_reduce_prod_to_bf16_can_implement (baracuda kernels reduce prod to bf16 can implement).
baracuda_kernels_reduce_prod_to_bf16_run^⚠: reduce_prod_to, bf16.
baracuda_kernels_reduce_prod_to_f16_can_implement^⚠: baracuda_kernels_reduce_prod_to_f16_can_implement (baracuda kernels reduce prod to f16 can implement).
baracuda_kernels_reduce_prod_to_f16_run^⚠: reduce_prod_to, f16. Cumulative product overflows fast in half-precision; callers should keep values close to 1.
baracuda_kernels_reduce_prod_to_f32_can_implement^⚠: baracuda_kernels_reduce_prod_to_f32_can_implement (baracuda kernels reduce prod to f32 can implement).
baracuda_kernels_reduce_prod_to_f32_run^⚠: reduce_prod_to, f32. Identity is 1 (multiplicative). Half dtypes accumulate in f32 then narrow on store.
baracuda_kernels_reduce_prod_to_f64_can_implement^⚠: baracuda_kernels_reduce_prod_to_f64_can_implement (baracuda kernels reduce prod to f64 can implement).
baracuda_kernels_reduce_prod_to_f64_run^⚠: reduce_prod_to, f64.
baracuda_kernels_reduce_prod_u8_can_implement^⚠: Pre-launch implementability check for reduce_prod_u8.
baracuda_kernels_reduce_prod_u8_run^⚠: prod(x, axis=k) with u8 input/output (wider u64 accumulator, wrap-on-overflow narrow on store).
baracuda_kernels_reduce_prod_u32_can_implement^⚠: Pre-launch implementability check for reduce_prod_u32.
baracuda_kernels_reduce_prod_u32_run^⚠: prod(x, axis=k) with u32 input/output (wider u64 accumulator).
baracuda_kernels_reduce_std_backward_bf16_can_implement^⚠: Pre-launch implementability check for reduce_std_backward_bf16.
baracuda_kernels_reduce_std_backward_bf16_run^⚠: Std-dev reduction backward, bf16.
baracuda_kernels_reduce_std_backward_f16_can_implement^⚠: Pre-launch implementability check for reduce_std_backward_f16.
baracuda_kernels_reduce_std_backward_f16_run^⚠: Std-dev reduction backward, f16.
baracuda_kernels_reduce_std_backward_f32_can_implement^⚠: Pre-launch implementability check for reduce_std_backward_f32.
baracuda_kernels_reduce_std_backward_f32_run^⚠: Std-dev reduction backward, f32 (Welford BW + sqrt term).
baracuda_kernels_reduce_std_backward_f64_can_implement^⚠: Pre-launch implementability check for reduce_std_backward_f64.
baracuda_kernels_reduce_std_backward_f64_run^⚠: Std-dev reduction backward, f64 (Welford BW in f64 + sqrt term).
baracuda_kernels_reduce_std_bf16_can_implement^⚠: Pre-launch implementability check for reduce_std_bf16.
baracuda_kernels_reduce_std_bf16_run^⚠: Std-dev along one axis, bf16.
baracuda_kernels_reduce_std_f16_can_implement^⚠: Pre-launch implementability check for reduce_std_f16.
baracuda_kernels_reduce_std_f16_run^⚠: Std-dev along one axis, f16.
baracuda_kernels_reduce_std_f32_can_implement^⚠: Pre-launch implementability check for reduce_std_f32.
baracuda_kernels_reduce_std_f32_run^⚠: Std-dev along one axis, f32, Welford + sqrt.
baracuda_kernels_reduce_std_f64_can_implement^⚠: Pre-launch implementability check for reduce_std_f64.
baracuda_kernels_reduce_std_f64_run^⚠: Std-dev along one axis, f64 (Welford in f64 + sqrt).
baracuda_kernels_reduce_sum_backward_bf16_can_implement^⚠: Pre-launch implementability check for reduce_sum_backward_bf16.
baracuda_kernels_reduce_sum_backward_bf16_run^⚠: Sum reduction backward, bf16.
baracuda_kernels_reduce_sum_backward_f16_can_implement^⚠: Pre-launch implementability check for reduce_sum_backward_f16.
baracuda_kernels_reduce_sum_backward_f16_run^⚠: Sum reduction backward, f16.
baracuda_kernels_reduce_sum_backward_f32_can_implement^⚠: Pre-launch implementability check for reduce_sum_backward_f32.
baracuda_kernels_reduce_sum_backward_f32_run^⚠: Sum reduction backward, f32. dx[c] = dy[c_with_reduce_axis_0] realized via stride-0 broadcast on the reduce axis.
baracuda_kernels_reduce_sum_backward_f64_can_implement^⚠: Pre-launch implementability check for reduce_sum_backward_f64.
baracuda_kernels_reduce_sum_backward_f64_run^⚠: Sum reduction backward, f64.
baracuda_kernels_reduce_sum_bf16_can_implement^⚠: Pre-launch implementability check for reduce_sum_bf16.
baracuda_kernels_reduce_sum_bf16_run^⚠: Sum reduction along one axis, bf16 (f32-detour functor).
baracuda_kernels_reduce_sum_f16_can_implement^⚠: Pre-launch implementability check for reduce_sum_f16.
baracuda_kernels_reduce_sum_f16_run^⚠: Sum reduction along one axis, f16.
baracuda_kernels_reduce_sum_f32_can_implement^⚠: Pre-launch implementability check for reduce_sum_f32.
baracuda_kernels_reduce_sum_f32_run^⚠: Sum reduction along one axis, f32, naive thread-per-output-cell.
baracuda_kernels_reduce_sum_f64_can_implement^⚠: Pre-launch implementability check for reduce_sum_f64.
baracuda_kernels_reduce_sum_f64_run^⚠: Sum reduction along one axis, f64.
baracuda_kernels_reduce_sum_i8_can_implement^⚠: Pre-launch implementability check for reduce_sum_i8.
baracuda_kernels_reduce_sum_i8_run^⚠: sum(x, axis=k) with i8 input/output (wider i64 accumulator).
baracuda_kernels_reduce_sum_i16_can_implement^⚠: Pre-launch implementability check for reduce_sum_i16.
baracuda_kernels_reduce_sum_i16_run^⚠: sum(x, axis=k) with i16 input/output (wider i64 accumulator).
baracuda_kernels_reduce_sum_i32_can_implement^⚠: Pre-launch implementability check for reduce_sum_i32.
baracuda_kernels_reduce_sum_i32_run^⚠: sum(x, axis=k) with i32 input/output (wider i64 accumulator).
baracuda_kernels_reduce_sum_i64_can_implement^⚠: Pre-launch implementability check for reduce_sum_i64.
baracuda_kernels_reduce_sum_i64_run^⚠: sum(x, axis=k) with i64 input/output. Accumulator and output share dtype; modulo-2^64 wrap is the natural device behaviour.
baracuda_kernels_reduce_sum_to_bf16_can_implement^⚠: baracuda_kernels_reduce_sum_to_bf16_can_implement (baracuda kernels reduce sum to bf16 can implement).
baracuda_kernels_reduce_sum_to_bf16_run^⚠: reduce_sum_to, bf16. Accumulator widens to f32.
baracuda_kernels_reduce_sum_to_f16_can_implement^⚠: baracuda_kernels_reduce_sum_to_f16_can_implement (baracuda kernels reduce sum to f16 can implement).
baracuda_kernels_reduce_sum_to_f16_run^⚠: reduce_sum_to, f16. Accumulator widens to f32 per the rest of the family’s convention.
baracuda_kernels_reduce_sum_to_f32_can_implement^⚠: baracuda_kernels_reduce_sum_to_f32_can_implement (baracuda kernels reduce sum to f32 can implement).
baracuda_kernels_reduce_sum_to_f32_run^⚠: reduce_sum_to, f32. Broadcast-reverse Σ. Phase 31.
baracuda_kernels_reduce_sum_to_f64_can_implement^⚠: baracuda_kernels_reduce_sum_to_f64_can_implement (baracuda kernels reduce sum to f64 can implement).
baracuda_kernels_reduce_sum_to_f64_run^⚠: reduce_sum_to, f64.
baracuda_kernels_reduce_sum_u8_can_implement^⚠: Pre-launch implementability check for reduce_sum_u8.
baracuda_kernels_reduce_sum_u8_run^⚠: sum(x, axis=k) with u8 input/output (wider u64 accumulator, wrap-on-overflow narrow on store).
baracuda_kernels_reduce_sum_u32_can_implement^⚠: Pre-launch implementability check for reduce_sum_u32.
baracuda_kernels_reduce_sum_u32_run^⚠: sum(x, axis=k) with u32 input/output (wider u64 accumulator).
baracuda_kernels_reduce_var_backward_bf16_can_implement^⚠: Pre-launch implementability check for reduce_var_backward_bf16.
baracuda_kernels_reduce_var_backward_bf16_run^⚠: Variance reduction backward, bf16.
baracuda_kernels_reduce_var_backward_f16_can_implement^⚠: Pre-launch implementability check for reduce_var_backward_f16.
baracuda_kernels_reduce_var_backward_f16_run^⚠: Variance reduction backward, f16.
baracuda_kernels_reduce_var_backward_f32_can_implement^⚠: Pre-launch implementability check for reduce_var_backward_f32.
baracuda_kernels_reduce_var_backward_f32_run^⚠: Variance reduction backward, f32 (Welford BW).
baracuda_kernels_reduce_var_backward_f64_can_implement^⚠: Pre-launch implementability check for reduce_var_backward_f64.
baracuda_kernels_reduce_var_backward_f64_run^⚠: Variance reduction backward, f64 (Welford BW in f64).
baracuda_kernels_reduce_var_bf16_can_implement^⚠: Pre-launch implementability check for reduce_var_bf16.
baracuda_kernels_reduce_var_bf16_run^⚠: Variance reduction along one axis, bf16.
baracuda_kernels_reduce_var_f16_can_implement^⚠: Pre-launch implementability check for reduce_var_f16.
baracuda_kernels_reduce_var_f16_run^⚠: Variance reduction along one axis, f16.
baracuda_kernels_reduce_var_f32_can_implement^⚠: Pre-launch implementability check for reduce_var_f32.
baracuda_kernels_reduce_var_f32_run^⚠: Variance reduction along one axis, f32, Welford one-pass. correction = 1 for Bessel-corrected sample variance, 0 for population variance.
baracuda_kernels_reduce_var_f64_can_implement^⚠: Pre-launch implementability check for reduce_var_f64.
baracuda_kernels_reduce_var_f64_run^⚠: Variance reduction along one axis, f64 (Welford in f64).
baracuda_kernels_repeat_backward_bf16_can_implement^⚠: baracuda_kernels_repeat_backward_bf16_can_implement (baracuda kernels repeat backward bf16 can implement).
baracuda_kernels_repeat_backward_bf16_run^⚠: Repeat backward (gather-adjoint sum), bf16. Accumulates in float.
baracuda_kernels_repeat_backward_f16_can_implement^⚠: baracuda_kernels_repeat_backward_f16_can_implement (baracuda kernels repeat backward f16 can implement).
baracuda_kernels_repeat_backward_f16_run^⚠: Repeat backward (gather-adjoint sum), f16. Accumulates in float.
baracuda_kernels_repeat_backward_f32_can_implement^⚠: baracuda_kernels_repeat_backward_f32_can_implement (baracuda kernels repeat backward f32 can implement).
baracuda_kernels_repeat_backward_f32_run^⚠: Repeat backward (gather-adjoint sum), f32.
baracuda_kernels_repeat_backward_f64_can_implement^⚠: baracuda_kernels_repeat_backward_f64_can_implement (baracuda kernels repeat backward f64 can implement).
baracuda_kernels_repeat_backward_f64_run^⚠: Repeat backward (gather-adjoint sum), f64.
baracuda_kernels_repeat_bf16_can_implement^⚠: Pre-launch implementability check for repeat_bf16.
baracuda_kernels_repeat_bf16_run^⚠: Repeat (per-axis tile), bf16.
baracuda_kernels_repeat_f16_can_implement^⚠: Pre-launch implementability check for repeat_f16.
baracuda_kernels_repeat_f16_run^⚠: Repeat (per-axis tile), f16. Same parameter shape as the f32 variant — pure copy, no arithmetic.
baracuda_kernels_repeat_f32_can_implement^⚠: Pre-launch implementability check for repeat_f32.
baracuda_kernels_repeat_f32_run^⚠: Repeat (per-axis tile), f32. output.shape[d] = input.shape[d] * repeats[d]. Kernel computes input_coord[d] = output_coord[d] % input.shape[d].
baracuda_kernels_repeat_f64_can_implement^⚠: Pre-launch implementability check for repeat_f64.
baracuda_kernels_repeat_f64_run^⚠: Repeat (per-axis tile), f64.
baracuda_kernels_rfft_1d_f32_run^⚠: 1-D R2C FFT (real → Hermitian-half complex). Unnormalized (matches PyTorch’s norm="backward").
baracuda_kernels_rfft_1d_f32_workspace_size^⚠: 1-D R2C FFT workspace size in bytes — always 0.
baracuda_kernels_rfft_1d_f64_run^⚠: 1-D R2C FFT (real → Hermitian-half complex). Unnormalized (matches PyTorch’s norm="backward").
baracuda_kernels_rfft_1d_f64_workspace_size^⚠: 1-D R2C FFT workspace size in bytes — always 0.
baracuda_kernels_rfft_nd_f32_run^⚠: ND R2C FFT (real → Hermitian-half complex). Unnormalized. dims[..rank] are real-side extents; complex output has dims[rank-1] / 2 + 1 on the last transformed axis.
baracuda_kernels_rfft_nd_f32_workspace_size^⚠: ND R2C FFT workspace size in bytes — always 0.
baracuda_kernels_rfft_nd_f64_run^⚠: ND R2C FFT (real → Hermitian-half complex). Unnormalized. dims[..rank] are real-side extents; complex output has dims[rank-1] / 2 + 1 on the last transformed axis.
baracuda_kernels_rfft_nd_f64_workspace_size^⚠: ND R2C FFT workspace size in bytes — always 0.
baracuda_kernels_rms_norm_backward_bf16_can_implement^⚠: baracuda_kernels_rms_norm_backward_bf16_can_implement (baracuda kernels rms norm backward bf16 can implement).
baracuda_kernels_rms_norm_backward_bf16_run^⚠: RMSNorm BW, bf16.
baracuda_kernels_rms_norm_backward_bf16_strided_can_implement^⚠: rms_norm_backward_bf16_strided_can_implement companion.
baracuda_kernels_rms_norm_backward_bf16_strided_run^⚠: RMSNorm BW strided sibling, bf16.
baracuda_kernels_rms_norm_backward_f16_can_implement^⚠: baracuda_kernels_rms_norm_backward_f16_can_implement (baracuda kernels rms norm backward f16 can implement).
baracuda_kernels_rms_norm_backward_f16_run^⚠: RMSNorm BW, f16.
baracuda_kernels_rms_norm_backward_f16_strided_can_implement^⚠: rms_norm_backward_f16_strided_can_implement companion.
baracuda_kernels_rms_norm_backward_f16_strided_run^⚠: RMSNorm BW strided sibling, f16.
baracuda_kernels_rms_norm_backward_f32_can_implement^⚠: baracuda_kernels_rms_norm_backward_f32_can_implement (baracuda kernels rms norm backward f32 can implement).
baracuda_kernels_rms_norm_backward_f32_run^⚠: RMSNorm BW, f32. Computes dx and (when dgamma != null) dgamma[i] = Σ over outer cells dy[..., i] · (x[..., i] / rms[..., 0]) where i ranges over the joint normalized region of length norm_total_extent.
baracuda_kernels_rms_norm_backward_f32_strided_can_implement^⚠: rms_norm_backward_f32_strided_can_implement companion.
baracuda_kernels_rms_norm_backward_f32_strided_run^⚠: RMSNorm BW strided sibling, f32. Same contract as baracuda_kernels_rms_norm_backward_f32_run; identical underlying launcher.
baracuda_kernels_rms_norm_backward_f64_can_implement^⚠: baracuda_kernels_rms_norm_backward_f64_can_implement (baracuda kernels rms norm backward f64 can implement).
baracuda_kernels_rms_norm_backward_f64_run^⚠: RMSNorm BW, f64.
baracuda_kernels_rms_norm_backward_f64_strided_can_implement^⚠: rms_norm_backward_f64_strided_can_implement companion.
baracuda_kernels_rms_norm_backward_f64_strided_run^⚠: RMSNorm BW strided sibling, f64.
baracuda_kernels_rms_norm_bf16_can_implement^⚠: baracuda_kernels_rms_norm_bf16_can_implement (baracuda kernels rms norm bf16 can implement).
baracuda_kernels_rms_norm_bf16_run^⚠: RMSNorm FW, bf16. f32 accumulator inside the kernel.
baracuda_kernels_rms_norm_bf16_strided_can_implement^⚠: rms_norm_bf16_strided_can_implement companion.
baracuda_kernels_rms_norm_bf16_strided_run^⚠: RMSNorm FW strided sibling, bf16. See rms_norm_f32_strided_run.
baracuda_kernels_rms_norm_f16_can_implement^⚠: baracuda_kernels_rms_norm_f16_can_implement (baracuda kernels rms norm f16 can implement).
baracuda_kernels_rms_norm_f16_run^⚠: RMSNorm FW, f16. f32 accumulator inside the kernel.
baracuda_kernels_rms_norm_f16_strided_can_implement^⚠: rms_norm_f16_strided_can_implement companion.
baracuda_kernels_rms_norm_f16_strided_run^⚠: RMSNorm FW strided sibling, f16. See rms_norm_f32_strided_run.
baracuda_kernels_rms_norm_f32_can_implement^⚠: baracuda_kernels_rms_norm_f32_can_implement (baracuda kernels rms norm f32 can implement).
baracuda_kernels_rms_norm_f32_run^⚠: RMSNorm FW, f32. y = x / sqrt(mean(x², over norm_axes) + eps) * gamma. norm_axes_mask is a bitmask over input axes (suffix of [0, rank)); norm_total_extent is the product of those axes’ extents. gamma may be null (treated as 1). rms_out shape equals input shape with norm axes collapsed to 1; only the slot at inner_lin == 0 within each row is written.
baracuda_kernels_rms_norm_f32_strided_can_implement^⚠: rms_norm_f32_strided_can_implement companion.
baracuda_kernels_rms_norm_f32_strided_run^⚠: RMSNorm FW strided sibling, f32. Same contract as baracuda_kernels_rms_norm_f32_run; identical underlying launcher.
baracuda_kernels_rms_norm_f64_can_implement^⚠: baracuda_kernels_rms_norm_f64_can_implement (baracuda kernels rms norm f64 can implement).
baracuda_kernels_rms_norm_f64_run^⚠: RMSNorm FW, f64.
baracuda_kernels_rms_norm_f64_strided_can_implement^⚠: rms_norm_f64_strided_can_implement companion.
baracuda_kernels_rms_norm_f64_strided_run^⚠: RMSNorm FW strided sibling, f64. See rms_norm_f32_strided_run.
baracuda_kernels_roi_align_backward_f32_can_implement^⚠: baracuda_kernels_roi_align_backward_f32_can_implement (baracuda kernels roi align backward f32 can implement).
baracuda_kernels_roi_align_backward_f32_run^⚠: roi_align BW, f32. Caller pre-zeros dinput. # Safety: as FW.
baracuda_kernels_roi_align_backward_f64_can_implement^⚠: baracuda_kernels_roi_align_backward_f64_can_implement (baracuda kernels roi align backward f64 can implement).
baracuda_kernels_roi_align_backward_f64_run^⚠: roi_align BW, f64. # Safety: as f32 BW.
baracuda_kernels_roi_align_f32_can_implement^⚠: baracuda_kernels_roi_align_f32_can_implement (baracuda kernels roi align f32 can implement).
baracuda_kernels_roi_align_f32_run^⚠: roi_align, f32. rois: [num_rois, 5] (batch_idx, x1, y1, x2, y2) in INPUT-pixel coords (scaled by spatial_scale inside the kernel). sampling_ratio == 0 selects adaptive sampling. aligned == 0 is PyTorch’s pre-0.6 convention.
baracuda_kernels_roi_align_f64_can_implement^⚠: baracuda_kernels_roi_align_f64_can_implement (baracuda kernels roi align f64 can implement).
baracuda_kernels_roi_align_f64_run^⚠: roi_align, f64. # Safety: as f32.
baracuda_kernels_roi_pool_backward_f32_can_implement^⚠: baracuda_kernels_roi_pool_backward_f32_can_implement (baracuda kernels roi pool backward f32 can implement).
baracuda_kernels_roi_pool_backward_f32_run^⚠: roi_pool BW, f32. Caller pre-zeros dinput. # Safety: as FW.
baracuda_kernels_roi_pool_backward_f64_can_implement^⚠: baracuda_kernels_roi_pool_backward_f64_can_implement (baracuda kernels roi pool backward f64 can implement).
baracuda_kernels_roi_pool_backward_f64_run^⚠: roi_pool BW, f64. # Safety: as f32 BW.
baracuda_kernels_roi_pool_f32_can_implement^⚠: baracuda_kernels_roi_pool_f32_can_implement (baracuda kernels roi pool f32 can implement).
baracuda_kernels_roi_pool_f32_run^⚠: roi_pool, f32. Writes output AND argmax (i32 linear plane-relative index per output cell; -1 for empty bins).
baracuda_kernels_roi_pool_f64_can_implement^⚠: baracuda_kernels_roi_pool_f64_can_implement (baracuda kernels roi pool f64 can implement).
baracuda_kernels_roi_pool_f64_run^⚠: roi_pool, f64. # Safety: as f32.
baracuda_kernels_roll_bf16_can_implement^⚠: Pre-launch implementability check for roll_bf16.
baracuda_kernels_roll_bf16_run^⚠: Roll, bf16. Pure element copy — no math.
baracuda_kernels_roll_bf16_strided_can_implement^⚠: roll_bf16_strided_can_implement companion.
baracuda_kernels_roll_bf16_strided_run^⚠: Roll strided sibling, bf16.
baracuda_kernels_roll_f16_can_implement^⚠: Pre-launch implementability check for roll_f16.
baracuda_kernels_roll_f16_run^⚠: Roll, f16. Pure element copy — no math.
baracuda_kernels_roll_f16_strided_can_implement^⚠: roll_f16_strided_can_implement companion.
baracuda_kernels_roll_f16_strided_run^⚠: Roll strided sibling, f16.
baracuda_kernels_roll_f32_can_implement^⚠: Pre-launch implementability check for roll_f32.
baracuda_kernels_roll_f32_run^⚠: Roll (cyclic shift along axes), f32. shifts[d] is the shift amount on axis d (positive or negative, mod shape[d]).
baracuda_kernels_roll_f32_strided_can_implement^⚠: roll_f32_strided_can_implement companion.
baracuda_kernels_roll_f32_strided_run^⚠: Roll strided sibling, f32.
baracuda_kernels_roll_f64_can_implement^⚠: Pre-launch implementability check for roll_f64.
baracuda_kernels_roll_f64_run^⚠: Roll, f64. Pure element copy — no math.
baracuda_kernels_roll_f64_strided_can_implement^⚠: roll_f64_strided_can_implement companion.
baracuda_kernels_roll_f64_strided_run^⚠: Roll strided sibling, f64.
baracuda_kernels_rope_apply_backward_bf16_can_implement^⚠: baracuda_kernels_rope_apply_backward_bf16_can_implement (baracuda kernels rope apply backward bf16 can implement).
baracuda_kernels_rope_apply_backward_bf16_run^⚠: RoPE apply BW, bf16.
baracuda_kernels_rope_apply_backward_f16_can_implement^⚠: baracuda_kernels_rope_apply_backward_f16_can_implement (baracuda kernels rope apply backward f16 can implement).
baracuda_kernels_rope_apply_backward_f16_run^⚠: RoPE apply BW, f16.
baracuda_kernels_rope_apply_backward_f32_can_implement^⚠: baracuda_kernels_rope_apply_backward_f32_can_implement (baracuda kernels rope apply backward f32 can implement).
baracuda_kernels_rope_apply_backward_f32_run^⚠: RoPE apply BW, f32. Same cos/sin tables as FW; orthogonal-rotation reverse.
baracuda_kernels_rope_apply_backward_f64_can_implement^⚠: baracuda_kernels_rope_apply_backward_f64_can_implement (baracuda kernels rope apply backward f64 can implement).
baracuda_kernels_rope_apply_backward_f64_run^⚠: RoPE apply BW, f64.
baracuda_kernels_rope_apply_bf16_can_implement^⚠: baracuda_kernels_rope_apply_bf16_can_implement (baracuda kernels rope apply bf16 can implement).
baracuda_kernels_rope_apply_bf16_run^⚠: RoPE apply FW, bf16 (f32 trig table, f32 multiply detour).
baracuda_kernels_rope_apply_f16_can_implement^⚠: baracuda_kernels_rope_apply_f16_can_implement (baracuda kernels rope apply f16 can implement).
baracuda_kernels_rope_apply_f16_run^⚠: RoPE apply FW, f16 (f32 trig table, f32 multiply detour).
baracuda_kernels_rope_apply_f32_can_implement^⚠: Implementability check for rope_apply_f32. Host-side only.
baracuda_kernels_rope_apply_f32_run^⚠: RoPE apply FW, f32. Cos/sin tables provided by caller.
baracuda_kernels_rope_apply_f64_can_implement^⚠: baracuda_kernels_rope_apply_f64_can_implement (baracuda kernels rope apply f64 can implement).
baracuda_kernels_rope_apply_f64_run^⚠: RoPE apply FW, f64 (f32 trig table promoted to double at load).
baracuda_kernels_rope_apply_interleaved_backward_bf16_can_implement^⚠: baracuda_kernels_rope_apply_interleaved_backward_bf16_can_implement (baracuda kernels rope apply interleaved backward bf16 can implement).
baracuda_kernels_rope_apply_interleaved_backward_bf16_run^⚠: RoPE apply interleaved BW, bf16.
baracuda_kernels_rope_apply_interleaved_backward_f16_can_implement^⚠: baracuda_kernels_rope_apply_interleaved_backward_f16_can_implement (baracuda kernels rope apply interleaved backward f16 can implement).
baracuda_kernels_rope_apply_interleaved_backward_f16_run^⚠: RoPE apply interleaved BW, f16.
baracuda_kernels_rope_apply_interleaved_backward_f32_can_implement^⚠: baracuda_kernels_rope_apply_interleaved_backward_f32_can_implement (baracuda kernels rope apply interleaved backward f32 can implement).
baracuda_kernels_rope_apply_interleaved_backward_f32_run^⚠: RoPE apply interleaved BW, f32.
baracuda_kernels_rope_apply_interleaved_backward_f64_can_implement^⚠: baracuda_kernels_rope_apply_interleaved_backward_f64_can_implement (baracuda kernels rope apply interleaved backward f64 can implement).
baracuda_kernels_rope_apply_interleaved_backward_f64_run^⚠: RoPE apply interleaved BW, f64.
baracuda_kernels_rope_apply_interleaved_bf16_can_implement^⚠: baracuda_kernels_rope_apply_interleaved_bf16_can_implement (baracuda kernels rope apply interleaved bf16 can implement).
baracuda_kernels_rope_apply_interleaved_bf16_run^⚠: RoPE apply interleaved FW, bf16.
baracuda_kernels_rope_apply_interleaved_f16_can_implement^⚠: baracuda_kernels_rope_apply_interleaved_f16_can_implement (baracuda kernels rope apply interleaved f16 can implement).
baracuda_kernels_rope_apply_interleaved_f16_run^⚠: RoPE apply interleaved FW, f16.
baracuda_kernels_rope_apply_interleaved_f32_can_implement^⚠: baracuda_kernels_rope_apply_interleaved_f32_can_implement (baracuda kernels rope apply interleaved f32 can implement).
baracuda_kernels_rope_apply_interleaved_f32_run^⚠: RoPE apply interleaved FW, f32.
baracuda_kernels_rope_apply_interleaved_f64_can_implement^⚠: baracuda_kernels_rope_apply_interleaved_f64_can_implement (baracuda kernels rope apply interleaved f64 can implement).
baracuda_kernels_rope_apply_interleaved_f64_run^⚠: RoPE apply interleaved FW, f64.
baracuda_kernels_rope_apply_thd_backward_bf16_can_implement^⚠: baracuda_kernels_rope_apply_thd_backward_bf16_can_implement (baracuda kernels rope apply thd backward bf16 can implement).
baracuda_kernels_rope_apply_thd_backward_bf16_run^⚠: RoPE apply THD BW, bf16.
baracuda_kernels_rope_apply_thd_backward_f16_can_implement^⚠: baracuda_kernels_rope_apply_thd_backward_f16_can_implement (baracuda kernels rope apply thd backward f16 can implement).
baracuda_kernels_rope_apply_thd_backward_f16_run^⚠: RoPE apply THD BW, f16.
baracuda_kernels_rope_apply_thd_backward_f32_can_implement^⚠: baracuda_kernels_rope_apply_thd_backward_f32_can_implement (baracuda kernels rope apply thd backward f32 can implement).
baracuda_kernels_rope_apply_thd_backward_f32_run^⚠: RoPE apply THD BW, f32.
baracuda_kernels_rope_apply_thd_backward_f64_can_implement^⚠: baracuda_kernels_rope_apply_thd_backward_f64_can_implement (baracuda kernels rope apply thd backward f64 can implement).
baracuda_kernels_rope_apply_thd_backward_f64_run^⚠: RoPE apply THD BW, f64.
baracuda_kernels_rope_apply_thd_bf16_can_implement^⚠: baracuda_kernels_rope_apply_thd_bf16_can_implement (baracuda kernels rope apply thd bf16 can implement).
baracuda_kernels_rope_apply_thd_bf16_run^⚠: RoPE apply THD FW, bf16.
baracuda_kernels_rope_apply_thd_f16_can_implement^⚠: baracuda_kernels_rope_apply_thd_f16_can_implement (baracuda kernels rope apply thd f16 can implement).
baracuda_kernels_rope_apply_thd_f16_run^⚠: RoPE apply THD FW, f16.
baracuda_kernels_rope_apply_thd_f32_can_implement^⚠: baracuda_kernels_rope_apply_thd_f32_can_implement (baracuda kernels rope apply thd f32 can implement).
baracuda_kernels_rope_apply_thd_f32_run^⚠: RoPE apply THD FW, f32.
baracuda_kernels_rope_apply_thd_f64_can_implement^⚠: baracuda_kernels_rope_apply_thd_f64_can_implement (baracuda kernels rope apply thd f64 can implement).
baracuda_kernels_rope_apply_thd_f64_run^⚠: RoPE apply THD FW, f64.
baracuda_kernels_rope_backward_bf16_can_implement^⚠: Implementability check for rope_backward_bf16. Host-side only.
baracuda_kernels_rope_backward_bf16_run^⚠: RoPE BW, bf16.
baracuda_kernels_rope_backward_bf16_strided_can_implement^⚠: Implementability check for rope_backward_bf16_strided. Host-side only.
baracuda_kernels_rope_backward_bf16_strided_run^⚠: RoPE BW strided, bf16.
baracuda_kernels_rope_backward_f16_can_implement^⚠: Implementability check for rope_backward_f16. Host-side only.
baracuda_kernels_rope_backward_f16_run^⚠: RoPE BW, f16.
baracuda_kernels_rope_backward_f16_strided_can_implement^⚠: Implementability check for rope_backward_f16_strided. Host-side only.
baracuda_kernels_rope_backward_f16_strided_run^⚠: RoPE BW strided, f16.
baracuda_kernels_rope_backward_f32_can_implement^⚠: Implementability check for rope_backward_f32. Host-side only.
baracuda_kernels_rope_backward_f32_run^⚠: RoPE BW, f32. Same shape as FW; computes dx from dy by rotation through -θ.
baracuda_kernels_rope_backward_f32_strided_can_implement^⚠: Implementability check for rope_backward_f32_strided. Host-side only.
baracuda_kernels_rope_backward_f32_strided_run^⚠: RoPE BW strided, f32. Strides apply to dy (input) and dx (output).
baracuda_kernels_rope_backward_f64_can_implement^⚠: Implementability check for rope_backward_f64. Host-side only.
baracuda_kernels_rope_backward_f64_run^⚠: RoPE BW, f64.
baracuda_kernels_rope_backward_f64_strided_can_implement^⚠: Implementability check for rope_backward_f64_strided. Host-side only.
baracuda_kernels_rope_backward_f64_strided_run^⚠: RoPE BW strided, f64.
baracuda_kernels_rope_bf16_can_implement^⚠: Implementability check for rope_bf16. Host-side only.
baracuda_kernels_rope_bf16_run^⚠: RoPE FW, bf16.
baracuda_kernels_rope_bf16_strided_can_implement^⚠: Implementability check for rope_bf16_strided. Host-side only.
baracuda_kernels_rope_bf16_strided_run^⚠: RoPE FW strided, bf16.
baracuda_kernels_rope_f16_can_implement^⚠: Implementability check for rope_f16. Host-side only.
baracuda_kernels_rope_f16_run^⚠: RoPE FW, f16 (f32 trig detour internally).
baracuda_kernels_rope_f16_strided_can_implement^⚠: Implementability check for rope_f16_strided. Host-side only.
baracuda_kernels_rope_f16_strided_run^⚠: RoPE FW strided, f16.
baracuda_kernels_rope_f32_can_implement^⚠: Implementability check for rope_f32. Host-side only.
baracuda_kernels_rope_f32_run^⚠: RoPE FW, f32. Input/output are [B, H, S, D] contiguous row-major; head_dim (D) must be even. When pos_default_flag != 0, the kernel ignores positions and uses position index = sequence index; otherwise positions is int64_t[seq].
baracuda_kernels_rope_f32_strided_can_implement^⚠: Implementability check for rope_f32_strided. Host-side only.
baracuda_kernels_rope_f32_strided_run^⚠: RoPE FW strided, f32.
baracuda_kernels_rope_f64_can_implement^⚠: Implementability check for rope_f64. Host-side only.
baracuda_kernels_rope_f64_run^⚠: RoPE FW, f64.
baracuda_kernels_rope_f64_strided_can_implement^⚠: Implementability check for rope_f64_strided. Host-side only.
baracuda_kernels_rope_f64_strided_run^⚠: RoPE FW strided, f64.
baracuda_kernels_scale_inplace_c32_can_implement^⚠: Implementability check for baracuda_kernels_scale_inplace_c32. Host-side only.
baracuda_kernels_scale_inplace_c32_run^⚠: In-place scale of a cufftComplex buffer by a real scalar: y[i].x *= scale; y[i].y *= scale;. Applied after cufftExecC2C in the inverse direction to bake in the 1/N normalization PyTorch expects.
baracuda_kernels_scale_inplace_c64_can_implement^⚠: Implementability check for baracuda_kernels_scale_inplace_c64. Host-side only.
baracuda_kernels_scale_inplace_c64_run^⚠: In-place scale of a cufftDoubleComplex buffer by a real scalar. f64 analogue of baracuda_kernels_scale_inplace_c32_run.
baracuda_kernels_scale_inplace_real_f32_can_implement^⚠: Implementability check for baracuda_kernels_scale_inplace_real_f32. Host-side only.
baracuda_kernels_scale_inplace_real_f32_run^⚠: In-place scale of a real f32 buffer. Used to bake the 1/N normalization into the output of cufftExecC2R (IRFFT).
baracuda_kernels_scale_inplace_real_f64_can_implement^⚠: Implementability check for baracuda_kernels_scale_inplace_real_f64. Host-side only.
baracuda_kernels_scale_inplace_real_f64_run^⚠: In-place scale of a real f64 buffer. f64 analogue.
baracuda_kernels_scan_cummax_backward_bf16_can_implement^⚠: Pre-launch implementability check for scan_cummax_backward_bf16.
baracuda_kernels_scan_cummax_backward_bf16_run^⚠: Cummax backward, bf16.
baracuda_kernels_scan_cummax_backward_f16_can_implement^⚠: Pre-launch implementability check for scan_cummax_backward_f16.
baracuda_kernels_scan_cummax_backward_f16_run^⚠: Cummax backward, f16.
baracuda_kernels_scan_cummax_backward_f32_can_implement^⚠: Pre-launch implementability check for scan_cummax_backward_f32.
baracuda_kernels_scan_cummax_backward_f32_run^⚠: Cummax backward, f32. Walks the forward scan tracking first-occurrence argmax; gradient flows to the source position.
baracuda_kernels_scan_cummax_backward_f64_can_implement^⚠: Pre-launch implementability check for scan_cummax_backward_f64.
baracuda_kernels_scan_cummax_backward_f64_run^⚠: Cummax backward, f64.
baracuda_kernels_scan_cummax_bf16_can_implement^⚠: Pre-launch implementability check for scan_cummax_bf16.
baracuda_kernels_scan_cummax_bf16_run^⚠: Cummax, bf16.
baracuda_kernels_scan_cummax_f16_can_implement^⚠: Pre-launch implementability check for scan_cummax_f16.
baracuda_kernels_scan_cummax_f16_run^⚠: Cummax, f16.
baracuda_kernels_scan_cummax_f32_can_implement^⚠: Pre-launch implementability check for scan_cummax_f32.
baracuda_kernels_scan_cummax_f32_run^⚠: Cummax (inclusive prefix running max), f32.
baracuda_kernels_scan_cummax_f64_can_implement^⚠: Pre-launch implementability check for scan_cummax_f64.
baracuda_kernels_scan_cummax_f64_run^⚠: Cummax, f64.
baracuda_kernels_scan_cummin_backward_bf16_can_implement^⚠: Pre-launch implementability check for scan_cummin_backward_bf16.
baracuda_kernels_scan_cummin_backward_bf16_run^⚠: Cummin backward, bf16.
baracuda_kernels_scan_cummin_backward_f16_can_implement^⚠: Pre-launch implementability check for scan_cummin_backward_f16.
baracuda_kernels_scan_cummin_backward_f16_run^⚠: Cummin backward, f16.
baracuda_kernels_scan_cummin_backward_f32_can_implement^⚠: Pre-launch implementability check for scan_cummin_backward_f32.
baracuda_kernels_scan_cummin_backward_f32_run^⚠: Cummin backward, f32. Same kernel shape as Cummax BW with < instead of > for the tie-tracking comparison.
baracuda_kernels_scan_cummin_backward_f64_can_implement^⚠: Pre-launch implementability check for scan_cummin_backward_f64.
baracuda_kernels_scan_cummin_backward_f64_run^⚠: Cummin backward, f64.
baracuda_kernels_scan_cummin_bf16_can_implement^⚠: Pre-launch implementability check for scan_cummin_bf16.
baracuda_kernels_scan_cummin_bf16_run^⚠: Cummin, bf16.
baracuda_kernels_scan_cummin_f16_can_implement^⚠: Pre-launch implementability check for scan_cummin_f16.
baracuda_kernels_scan_cummin_f16_run^⚠: Cummin, f16.
baracuda_kernels_scan_cummin_f32_can_implement^⚠: Pre-launch implementability check for scan_cummin_f32.
baracuda_kernels_scan_cummin_f32_run^⚠: Cummin (inclusive prefix running min), f32.
baracuda_kernels_scan_cummin_f64_can_implement^⚠: Pre-launch implementability check for scan_cummin_f64.
baracuda_kernels_scan_cummin_f64_run^⚠: Cummin, f64.
baracuda_kernels_scan_cumprod_backward_bf16_can_implement^⚠: Pre-launch implementability check for scan_cumprod_backward_bf16.
baracuda_kernels_scan_cumprod_backward_bf16_run^⚠: Cumprod backward, bf16.
baracuda_kernels_scan_cumprod_backward_f16_can_implement^⚠: Pre-launch implementability check for scan_cumprod_backward_f16.
baracuda_kernels_scan_cumprod_backward_f16_run^⚠: Cumprod backward, f16. f32-detour accumulator.
baracuda_kernels_scan_cumprod_backward_f32_can_implement^⚠: Pre-launch implementability check for scan_cumprod_backward_f32.
baracuda_kernels_scan_cumprod_backward_f32_run^⚠: Cumprod backward, f32. Per-cell suffix accumulator of dy[i] * y[i] / x[j]. Caller must ensure x has no zeros along the scan axis.
baracuda_kernels_scan_cumprod_backward_f64_can_implement^⚠: Pre-launch implementability check for scan_cumprod_backward_f64.
baracuda_kernels_scan_cumprod_backward_f64_run^⚠: Cumprod backward, f64.
baracuda_kernels_scan_cumprod_bf16_can_implement^⚠: Pre-launch implementability check for scan_cumprod_bf16.
baracuda_kernels_scan_cumprod_bf16_run^⚠: Cumprod, bf16.
baracuda_kernels_scan_cumprod_f16_can_implement^⚠: Pre-launch implementability check for scan_cumprod_f16.
baracuda_kernels_scan_cumprod_f16_run^⚠: Cumprod, f16. f32-detour accumulator.
baracuda_kernels_scan_cumprod_f32_can_implement^⚠: Pre-launch implementability check for scan_cumprod_f32.
baracuda_kernels_scan_cumprod_f32_run^⚠: Cumprod (inclusive prefix product), f32. Same ABI as cumsum.
baracuda_kernels_scan_cumprod_f64_can_implement^⚠: Pre-launch implementability check for scan_cumprod_f64.
baracuda_kernels_scan_cumprod_f64_run^⚠: Cumprod, f64.
baracuda_kernels_scan_cumsum_bf16_can_implement^⚠: Pre-launch implementability check for scan_cumsum_bf16.
baracuda_kernels_scan_cumsum_bf16_run^⚠: Cumsum, bf16.
baracuda_kernels_scan_cumsum_f16_can_implement^⚠: Pre-launch implementability check for scan_cumsum_f16.
baracuda_kernels_scan_cumsum_f16_run^⚠: Cumsum, f16. f32-detour accumulator inside the kernel.
baracuda_kernels_scan_cumsum_f32_can_implement^⚠: Pre-launch implementability check for scan_cumsum_f32.
baracuda_kernels_scan_cumsum_f32_run^⚠: Inclusive prefix sum (cumsum) along a single axis, f32. reverse != 0 flips the scan direction.
baracuda_kernels_scan_cumsum_f64_can_implement^⚠: Pre-launch implementability check for scan_cumsum_f64.
baracuda_kernels_scan_cumsum_f64_run^⚠: Cumsum, f64.
baracuda_kernels_scan_log_cumsum_exp_backward_bf16_can_implement^⚠: baracuda_kernels_scan_log_cumsum_exp_backward_bf16_can_implement (baracuda kernels scan log cumsum exp backward bf16 can implement).
baracuda_kernels_scan_log_cumsum_exp_backward_bf16_run^⚠: LogCumsumExp BW, bf16.
baracuda_kernels_scan_log_cumsum_exp_backward_f16_can_implement^⚠: baracuda_kernels_scan_log_cumsum_exp_backward_f16_can_implement (baracuda kernels scan log cumsum exp backward f16 can implement).
baracuda_kernels_scan_log_cumsum_exp_backward_f16_run^⚠: LogCumsumExp BW, f16. f32-detour accumulator.
baracuda_kernels_scan_log_cumsum_exp_backward_f32_can_implement^⚠: baracuda_kernels_scan_log_cumsum_exp_backward_f32_can_implement (baracuda kernels scan log cumsum exp backward f32 can implement).
baracuda_kernels_scan_log_cumsum_exp_backward_f32_run^⚠: LogCumsumExp BW, f32. Per-cell accumulator of Σ dy[i] * exp(x[k] - y[i]) over the FW-direction-dependent i range. Needs both saved x and saved y (same shape since scans are length-preserving). Stable by construction: x[k] - y[i] ≤ 0 so exp(.) ∈ [0, 1].
baracuda_kernels_scan_log_cumsum_exp_backward_f64_can_implement^⚠: baracuda_kernels_scan_log_cumsum_exp_backward_f64_can_implement (baracuda kernels scan log cumsum exp backward f64 can implement).
baracuda_kernels_scan_log_cumsum_exp_backward_f64_run^⚠: LogCumsumExp BW, f64.
baracuda_kernels_scan_log_cumsum_exp_bf16_can_implement^⚠: baracuda_kernels_scan_log_cumsum_exp_bf16_can_implement (baracuda kernels scan log cumsum exp bf16 can implement).
baracuda_kernels_scan_log_cumsum_exp_bf16_run^⚠: LogCumsumExp FW, bf16.
baracuda_kernels_scan_log_cumsum_exp_f16_can_implement^⚠: baracuda_kernels_scan_log_cumsum_exp_f16_can_implement (baracuda kernels scan log cumsum exp f16 can implement).
baracuda_kernels_scan_log_cumsum_exp_f16_run^⚠: LogCumsumExp FW, f16. f32-detour accumulator inside the kernel.
baracuda_kernels_scan_log_cumsum_exp_f32_can_implement^⚠: baracuda_kernels_scan_log_cumsum_exp_f32_can_implement (baracuda kernels scan log cumsum exp f32 can implement).
baracuda_kernels_scan_log_cumsum_exp_f32_run^⚠: LogCumsumExp FW, f32. y[k] = log(Σ_{j ≤ k} exp(x[j])) (or suffix-LSE when reverse != 0). Numerically stable via the online running-max algorithm. Same ABI as cumsum.
baracuda_kernels_scan_log_cumsum_exp_f64_can_implement^⚠: baracuda_kernels_scan_log_cumsum_exp_f64_can_implement (baracuda kernels scan log cumsum exp f64 can implement).
baracuda_kernels_scan_log_cumsum_exp_f64_run^⚠: LogCumsumExp FW, f64.
baracuda_kernels_scatter_add_f32_can_implement^⚠: Implementability check for scatter_add_f32.
baracuda_kernels_scatter_add_f32_run^⚠: out[..., index[..., j, ...], ...] += updates[..., j, ...] along scatter_dim. f32 (atomicAdd).
baracuda_kernels_scatter_add_f64_can_implement^⚠: Implementability check for scatter_add_f64.
baracuda_kernels_scatter_add_f64_run^⚠: scatter_add — f64 (atomicAdd).
baracuda_kernels_scatter_add_i64idx_f32_can_implement^⚠: Implementability check for scatter_add_i64idx_f32.
baracuda_kernels_scatter_add_i64idx_f32_run^⚠: scatter_add — f32, i64 indices (atomicAdd).
baracuda_kernels_scatter_add_i64idx_f64_can_implement^⚠: Implementability check for scatter_add_i64idx_f64.
baracuda_kernels_scatter_add_i64idx_f64_run^⚠: scatter_add — f64, i64 indices.
baracuda_kernels_scatter_bf16_can_implement^⚠: Implementability check for scatter_bf16.
baracuda_kernels_scatter_bf16_run^⚠: scatter — bf16, i32 idx.
baracuda_kernels_scatter_f16_can_implement^⚠: Implementability check for scatter_f16.
baracuda_kernels_scatter_f16_run^⚠: scatter — f16, i32 idx.
baracuda_kernels_scatter_f32_can_implement^⚠: Implementability check for scatter_f32.
baracuda_kernels_scatter_f32_run^⚠: scatter — out[index] = updates, f32, i32 idx. NO accumulation.
baracuda_kernels_scatter_f64_can_implement^⚠: Implementability check for scatter_f64.
baracuda_kernels_scatter_f64_run^⚠: scatter — f64, i32 idx.
baracuda_kernels_scatter_i8_can_implement^⚠: baracuda_kernels_scatter_i8_can_implement (baracuda kernels scatter i8 can implement).
baracuda_kernels_scatter_i8_run^⚠: baracuda_kernels_scatter_i8_run (baracuda kernels scatter i8 run).
baracuda_kernels_scatter_i16_can_implement^⚠: baracuda_kernels_scatter_i16_can_implement (baracuda kernels scatter i16 can implement).
baracuda_kernels_scatter_i16_run^⚠: baracuda_kernels_scatter_i16_run (baracuda kernels scatter i16 run).
baracuda_kernels_scatter_i32_can_implement^⚠: baracuda_kernels_scatter_i32_can_implement (baracuda kernels scatter i32 can implement).
baracuda_kernels_scatter_i32_run^⚠: baracuda_kernels_scatter_i32_run (baracuda kernels scatter i32 run).
baracuda_kernels_scatter_i64_can_implement^⚠: baracuda_kernels_scatter_i64_can_implement (baracuda kernels scatter i64 can implement).
baracuda_kernels_scatter_i64_run^⚠: baracuda_kernels_scatter_i64_run (baracuda kernels scatter i64 run).
baracuda_kernels_scatter_i64idx_bf16_can_implement^⚠: Implementability check for scatter_i64idx_bf16.
baracuda_kernels_scatter_i64idx_bf16_run^⚠: scatter — bf16, i64 idx.
baracuda_kernels_scatter_i64idx_f16_can_implement^⚠: Implementability check for scatter_i64idx_f16.
baracuda_kernels_scatter_i64idx_f16_run^⚠: scatter — f16, i64 idx.
baracuda_kernels_scatter_i64idx_f32_can_implement^⚠: Implementability check for scatter_i64idx_f32.
baracuda_kernels_scatter_i64idx_f32_run^⚠: scatter — f32, i64 idx.
baracuda_kernels_scatter_i64idx_f64_can_implement^⚠: Implementability check for scatter_i64idx_f64.
baracuda_kernels_scatter_i64idx_f64_run^⚠: scatter — f64, i64 idx.
baracuda_kernels_scatter_i64idx_i8_can_implement^⚠: baracuda_kernels_scatter_i64idx_i8_can_implement (baracuda kernels scatter i64idx i8 can implement).
baracuda_kernels_scatter_i64idx_i8_run^⚠: baracuda_kernels_scatter_i64idx_i8_run (baracuda kernels scatter i64idx i8 run).
baracuda_kernels_scatter_i64idx_i16_can_implement^⚠: baracuda_kernels_scatter_i64idx_i16_can_implement (baracuda kernels scatter i64idx i16 can implement).
baracuda_kernels_scatter_i64idx_i16_run^⚠: baracuda_kernels_scatter_i64idx_i16_run (baracuda kernels scatter i64idx i16 run).
baracuda_kernels_scatter_i64idx_i32_can_implement^⚠: baracuda_kernels_scatter_i64idx_i32_can_implement (baracuda kernels scatter i64idx i32 can implement).
baracuda_kernels_scatter_i64idx_i32_run^⚠: baracuda_kernels_scatter_i64idx_i32_run (baracuda kernels scatter i64idx i32 run).
baracuda_kernels_scatter_i64idx_i64_can_implement^⚠: baracuda_kernels_scatter_i64idx_i64_can_implement (baracuda kernels scatter i64idx i64 can implement).
baracuda_kernels_scatter_i64idx_i64_run^⚠: baracuda_kernels_scatter_i64idx_i64_run (baracuda kernels scatter i64idx i64 run).
baracuda_kernels_scatter_i64idx_u8_can_implement^⚠: baracuda_kernels_scatter_i64idx_u8_can_implement (baracuda kernels scatter i64idx u8 can implement).
baracuda_kernels_scatter_i64idx_u8_run^⚠: baracuda_kernels_scatter_i64idx_u8_run (baracuda kernels scatter i64idx u8 run).
baracuda_kernels_scatter_i64idx_u16_can_implement^⚠: baracuda_kernels_scatter_i64idx_u16_can_implement (baracuda kernels scatter i64idx u16 can implement).
baracuda_kernels_scatter_i64idx_u16_run^⚠: baracuda_kernels_scatter_i64idx_u16_run (baracuda kernels scatter i64idx u16 run).
baracuda_kernels_scatter_i64idx_u32_can_implement^⚠: baracuda_kernels_scatter_i64idx_u32_can_implement (baracuda kernels scatter i64idx u32 can implement).
baracuda_kernels_scatter_i64idx_u32_run^⚠: baracuda_kernels_scatter_i64idx_u32_run (baracuda kernels scatter i64idx u32 run).
baracuda_kernels_scatter_u8_can_implement^⚠: baracuda_kernels_scatter_u8_can_implement (baracuda kernels scatter u8 can implement).
baracuda_kernels_scatter_u8_run^⚠: baracuda_kernels_scatter_u8_run (baracuda kernels scatter u8 run).
baracuda_kernels_scatter_u16_can_implement^⚠: baracuda_kernels_scatter_u16_can_implement (baracuda kernels scatter u16 can implement).
baracuda_kernels_scatter_u16_run^⚠: baracuda_kernels_scatter_u16_run (baracuda kernels scatter u16 run).
baracuda_kernels_scatter_u32_can_implement^⚠: baracuda_kernels_scatter_u32_can_implement (baracuda kernels scatter u32 can implement).
baracuda_kernels_scatter_u32_run^⚠: baracuda_kernels_scatter_u32_run (baracuda kernels scatter u32 run).
baracuda_kernels_sdpa_backward_bf16_can_implement^⚠: Implementability check for sdpa_backward_bf16. Host-side only.
baracuda_kernels_sdpa_backward_bf16_run^⚠: SDPA BW, bf16.
baracuda_kernels_sdpa_backward_bf16_strided_can_implement^⚠: Implementability check for sdpa_backward_bf16_strided. Host-side only.
baracuda_kernels_sdpa_backward_bf16_strided_run^⚠: SDPA BW strided, bf16.
baracuda_kernels_sdpa_backward_f16_can_implement^⚠: Implementability check for sdpa_backward_f16. Host-side only.
baracuda_kernels_sdpa_backward_f16_run^⚠: SDPA BW, f16.
baracuda_kernels_sdpa_backward_f16_strided_can_implement^⚠: Implementability check for sdpa_backward_f16_strided. Host-side only.
baracuda_kernels_sdpa_backward_f16_strided_run^⚠: SDPA BW strided, f16.
baracuda_kernels_sdpa_backward_f32_can_implement^⚠: Implementability check for sdpa_backward_f32. Host-side only.
baracuda_kernels_sdpa_backward_f32_run^⚠: SDPA BW, f32. Given the FW-saved attn ([B, H, Q, K]), Q, K, V, and upstream dy, computes dQ, dK, dV. The dscores_ws argument is a caller-allocated [B, H, Q, K] scratch buffer reused as the dattn → dscores intermediate; size matches the FW attn tensor.
baracuda_kernels_sdpa_backward_f32_strided_can_implement^⚠: Implementability check for sdpa_backward_f32_strided. Host-side only.
baracuda_kernels_sdpa_backward_f32_strided_run^⚠: SDPA BW strided, f32.
baracuda_kernels_sdpa_backward_f64_can_implement^⚠: Implementability check for sdpa_backward_f64. Host-side only.
baracuda_kernels_sdpa_backward_f64_run^⚠: SDPA BW, f64.
baracuda_kernels_sdpa_backward_f64_strided_can_implement^⚠: Implementability check for sdpa_backward_f64_strided. Host-side only.
baracuda_kernels_sdpa_backward_f64_strided_run^⚠: SDPA BW strided, f64.
baracuda_kernels_sdpa_bf16_arbmask_can_implement^⚠: Arbitrary-mask SDPA host-side can-implement, bf16.
baracuda_kernels_sdpa_bf16_arbmask_run^⚠: Arbitrary additive-mask SDPA FW, bf16 (f32 accumulators).
baracuda_kernels_sdpa_bf16_can_implement^⚠: Implementability check for sdpa_bf16. Host-side only.
baracuda_kernels_sdpa_bf16_run^⚠: SDPA FW, bf16 (f32 accumulators).
baracuda_kernels_sdpa_bf16_strided_can_implement^⚠: Implementability check for sdpa_bf16_strided. Host-side only.
baracuda_kernels_sdpa_bf16_strided_run^⚠: SDPA FW strided, bf16.
baracuda_kernels_sdpa_f16_arbmask_can_implement^⚠: Arbitrary-mask SDPA host-side can-implement, f16.
baracuda_kernels_sdpa_f16_arbmask_run^⚠: Arbitrary additive-mask SDPA FW, f16 (f32 accumulators).
baracuda_kernels_sdpa_f16_can_implement^⚠: Implementability check for sdpa_f16. Host-side only.
baracuda_kernels_sdpa_f16_run^⚠: SDPA FW, f16 (f32 accumulators).
baracuda_kernels_sdpa_f16_strided_can_implement^⚠: Implementability check for sdpa_f16_strided. Host-side only.
baracuda_kernels_sdpa_f16_strided_run^⚠: SDPA FW strided, f16.
baracuda_kernels_sdpa_f32_arbmask_can_implement^⚠: Arbitrary-mask SDPA host-side can-implement, f32.
baracuda_kernels_sdpa_f32_arbmask_run^⚠: Arbitrary additive-mask SDPA FW, f32. mask shape [B, H, Q, K] f32, applied as an additive bias on the score tile before softmax.
baracuda_kernels_sdpa_f32_can_implement^⚠: Implementability check for sdpa_f32. Host-side only.
baracuda_kernels_sdpa_f32_run^⚠: SDPA FW, f32. Computes y = softmax(Q·K^T·scale + mask) · V. The attn buffer ([B, H, Q, K]) doubles as the scores intermediate and is overwritten in place with the softmax output (saved for BW). Pass has_mask = 0 and mask = nullptr to skip the mask add. is_causal = 1 applies an upper-triangular -inf mask inside the scores kernel.
baracuda_kernels_sdpa_f32_strided_can_implement^⚠: Implementability check for sdpa_f32_strided. Host-side only.
baracuda_kernels_sdpa_f32_strided_run^⚠: SDPA FW strided, f32.
baracuda_kernels_sdpa_f64_arbmask_can_implement^⚠: Arbitrary-mask SDPA host-side can-implement, f64.
baracuda_kernels_sdpa_f64_arbmask_run^⚠: Arbitrary additive-mask SDPA FW, f64.
baracuda_kernels_sdpa_f64_can_implement^⚠: Implementability check for sdpa_f64. Host-side only.
baracuda_kernels_sdpa_f64_run^⚠: SDPA FW, f64.
baracuda_kernels_sdpa_f64_strided_can_implement^⚠: Implementability check for sdpa_f64_strided. Host-side only.
baracuda_kernels_sdpa_f64_strided_run^⚠: SDPA FW strided, f64.
baracuda_kernels_searchsorted_f32_can_implement^⚠: baracuda_kernels_searchsorted_f32_can_implement (baracuda kernels searchsorted f32 can implement).
baracuda_kernels_searchsorted_f32_run^⚠: searchsorted, f32. right == 0 = lower_bound; right == 1 = upper_bound.
baracuda_kernels_searchsorted_f64_can_implement^⚠: baracuda_kernels_searchsorted_f64_can_implement (baracuda kernels searchsorted f64 can implement).
baracuda_kernels_searchsorted_f64_run^⚠: searchsorted, f64.
baracuda_kernels_searchsorted_i32_can_implement^⚠: baracuda_kernels_searchsorted_i32_can_implement (baracuda kernels searchsorted i32 can implement).
baracuda_kernels_searchsorted_i32_run^⚠: searchsorted, i32.
baracuda_kernels_searchsorted_i64_can_implement^⚠: baracuda_kernels_searchsorted_i64_can_implement (baracuda kernels searchsorted i64 can implement).
baracuda_kernels_searchsorted_i64_run^⚠: searchsorted, i64.
baracuda_kernels_segment_max_backward_f32_can_implement^⚠: Implementability check for segment_max_backward_f32.
baracuda_kernels_segment_max_backward_f32_run^⚠: d_input[k, d] = d_output[seg, d] iff k is the (first) max-argument of the segment in column d, else 0. Sorted seg ids. f32.
baracuda_kernels_segment_max_backward_f64_can_implement^⚠: Implementability check for segment_max_backward_f64.
baracuda_kernels_segment_max_backward_f64_run^⚠: segment_max_backward — f64.
baracuda_kernels_segment_max_f32_can_implement^⚠: Implementability check for segment_max_f32.
baracuda_kernels_segment_max_f32_run^⚠: out[s, d] = max_{n : seg[n] == s} input[n, d] — sorted. f32.
baracuda_kernels_segment_max_f64_can_implement^⚠: Implementability check for segment_max_f64.
baracuda_kernels_segment_max_f64_run^⚠: segment_max — f64.
baracuda_kernels_segment_max_i64idx_f32_can_implement^⚠: baracuda_kernels_segment_max_i64idx_f32_can_implement (baracuda kernels segment max i64idx f32 can implement).
baracuda_kernels_segment_max_i64idx_f32_run^⚠: baracuda_kernels_segment_max_i64idx_f32_run (baracuda kernels segment max i64idx f32 run).
baracuda_kernels_segment_max_i64idx_f64_can_implement^⚠: baracuda_kernels_segment_max_i64idx_f64_can_implement (baracuda kernels segment max i64idx f64 can implement).
baracuda_kernels_segment_max_i64idx_f64_run^⚠: baracuda_kernels_segment_max_i64idx_f64_run (baracuda kernels segment max i64idx f64 run).
baracuda_kernels_segment_mean_backward_f32_can_implement^⚠: Implementability check for segment_mean_backward_f32.
baracuda_kernels_segment_mean_backward_f32_run^⚠: d_input[n, d] = d_output[seg[n], d] / count[seg[n]]. Workspace: num_segments * sizeof(i32). f32.
baracuda_kernels_segment_mean_backward_f64_can_implement^⚠: Implementability check for segment_mean_backward_f64.
baracuda_kernels_segment_mean_backward_f64_run^⚠: segment_mean_backward — f64.
baracuda_kernels_segment_mean_backward_i64idx_f32_can_implement^⚠: baracuda_kernels_segment_mean_backward_i64idx_f32_can_implement (baracuda kernels segment mean backward i64idx f32 can implement).
baracuda_kernels_segment_mean_backward_i64idx_f32_run^⚠: baracuda_kernels_segment_mean_backward_i64idx_f32_run (baracuda kernels segment mean backward i64idx f32 run).
baracuda_kernels_segment_mean_backward_i64idx_f64_can_implement^⚠: baracuda_kernels_segment_mean_backward_i64idx_f64_can_implement (baracuda kernels segment mean backward i64idx f64 can implement).
baracuda_kernels_segment_mean_backward_i64idx_f64_run^⚠: baracuda_kernels_segment_mean_backward_i64idx_f64_run (baracuda kernels segment mean backward i64idx f64 run).
baracuda_kernels_segment_mean_f32_can_implement^⚠: Implementability check for segment_mean_f32.
baracuda_kernels_segment_mean_f32_run^⚠: out[s, d] = mean_{n : seg[n] == s} input[n, d] — sorted. f32.
baracuda_kernels_segment_mean_f64_can_implement^⚠: Implementability check for segment_mean_f64.
baracuda_kernels_segment_mean_f64_run^⚠: segment_mean — f64.
baracuda_kernels_segment_mean_i64idx_f32_can_implement^⚠: baracuda_kernels_segment_mean_i64idx_f32_can_implement (baracuda kernels segment mean i64idx f32 can implement).
baracuda_kernels_segment_mean_i64idx_f32_run^⚠: baracuda_kernels_segment_mean_i64idx_f32_run (baracuda kernels segment mean i64idx f32 run).
baracuda_kernels_segment_mean_i64idx_f64_can_implement^⚠: baracuda_kernels_segment_mean_i64idx_f64_can_implement (baracuda kernels segment mean i64idx f64 can implement).
baracuda_kernels_segment_mean_i64idx_f64_run^⚠: baracuda_kernels_segment_mean_i64idx_f64_run (baracuda kernels segment mean i64idx f64 run).
baracuda_kernels_segment_min_backward_f32_can_implement^⚠: Implementability check for segment_min_backward_f32.
baracuda_kernels_segment_min_backward_f32_run^⚠: segment_min_backward — f32.
baracuda_kernels_segment_min_backward_f64_can_implement^⚠: Implementability check for segment_min_backward_f64.
baracuda_kernels_segment_min_backward_f64_run^⚠: segment_min_backward — f64.
baracuda_kernels_segment_min_f32_can_implement^⚠: Implementability check for segment_min_f32.
baracuda_kernels_segment_min_f32_run^⚠: out[s, d] = min_{n : seg[n] == s} input[n, d] — sorted. f32.
baracuda_kernels_segment_min_f64_can_implement^⚠: Implementability check for segment_min_f64.
baracuda_kernels_segment_min_f64_run^⚠: segment_min — f64.
baracuda_kernels_segment_min_i64idx_f32_can_implement^⚠: baracuda_kernels_segment_min_i64idx_f32_can_implement (baracuda kernels segment min i64idx f32 can implement).
baracuda_kernels_segment_min_i64idx_f32_run^⚠: baracuda_kernels_segment_min_i64idx_f32_run (baracuda kernels segment min i64idx f32 run).
baracuda_kernels_segment_min_i64idx_f64_can_implement^⚠: baracuda_kernels_segment_min_i64idx_f64_can_implement (baracuda kernels segment min i64idx f64 can implement).
baracuda_kernels_segment_min_i64idx_f64_run^⚠: baracuda_kernels_segment_min_i64idx_f64_run (baracuda kernels segment min i64idx f64 run).
baracuda_kernels_segment_prod_backward_f32_can_implement^⚠: Implementability check for segment_prod_backward_f32.
baracuda_kernels_segment_prod_backward_f32_run^⚠: segment_prod_backward — f32.
baracuda_kernels_segment_prod_backward_f64_can_implement^⚠: Implementability check for segment_prod_backward_f64.
baracuda_kernels_segment_prod_backward_f64_run^⚠: segment_prod_backward — f64.
baracuda_kernels_segment_prod_f32_can_implement^⚠: Implementability check for segment_prod_f32.
baracuda_kernels_segment_prod_f32_run^⚠: out[s, d] = prod_{n : seg[n] == s} input[n, d] — sorted. f32.
baracuda_kernels_segment_prod_f64_can_implement^⚠: Implementability check for segment_prod_f64.
baracuda_kernels_segment_prod_f64_run^⚠: segment_prod — f64.
baracuda_kernels_segment_prod_i64idx_f32_can_implement^⚠: baracuda_kernels_segment_prod_i64idx_f32_can_implement (baracuda kernels segment prod i64idx f32 can implement).
baracuda_kernels_segment_prod_i64idx_f32_run^⚠: baracuda_kernels_segment_prod_i64idx_f32_run (baracuda kernels segment prod i64idx f32 run).
baracuda_kernels_segment_prod_i64idx_f64_can_implement^⚠: baracuda_kernels_segment_prod_i64idx_f64_can_implement (baracuda kernels segment prod i64idx f64 can implement).
baracuda_kernels_segment_prod_i64idx_f64_run^⚠: baracuda_kernels_segment_prod_i64idx_f64_run (baracuda kernels segment prod i64idx f64 run).
baracuda_kernels_segment_sum_backward_f32_can_implement^⚠: Implementability check for segment_sum_backward_f32.
baracuda_kernels_segment_sum_backward_f32_run^⚠: d_input[n, d] = d_output[seg[n], d]. f32.
baracuda_kernels_segment_sum_backward_f64_can_implement^⚠: Implementability check for segment_sum_backward_f64.
baracuda_kernels_segment_sum_backward_f64_run^⚠: segment_sum_backward — f64.
baracuda_kernels_segment_sum_backward_i64idx_f32_can_implement^⚠: baracuda_kernels_segment_sum_backward_i64idx_f32_can_implement (baracuda kernels segment sum backward i64idx f32 can implement).
baracuda_kernels_segment_sum_backward_i64idx_f32_run^⚠: baracuda_kernels_segment_sum_backward_i64idx_f32_run (baracuda kernels segment sum backward i64idx f32 run).
baracuda_kernels_segment_sum_backward_i64idx_f64_can_implement^⚠: baracuda_kernels_segment_sum_backward_i64idx_f64_can_implement (baracuda kernels segment sum backward i64idx f64 can implement).
baracuda_kernels_segment_sum_backward_i64idx_f64_run^⚠: baracuda_kernels_segment_sum_backward_i64idx_f64_run (baracuda kernels segment sum backward i64idx f64 run).
baracuda_kernels_segment_sum_f32_can_implement^⚠: Implementability check for segment_sum_f32.
baracuda_kernels_segment_sum_f32_run^⚠: out[s, d] = Σ_{n : seg[n] == s} input[n, d] — sorted seg ids (monotonically non-decreasing). f32.
baracuda_kernels_segment_sum_f64_can_implement^⚠: Implementability check for segment_sum_f64.
baracuda_kernels_segment_sum_f64_run^⚠: segment_sum — f64.
baracuda_kernels_segment_sum_i64idx_f32_can_implement^⚠: baracuda_kernels_segment_sum_i64idx_f32_can_implement (baracuda kernels segment sum i64idx f32 can implement).
baracuda_kernels_segment_sum_i64idx_f32_run^⚠: baracuda_kernels_segment_sum_i64idx_f32_run (baracuda kernels segment sum i64idx f32 run).
baracuda_kernels_segment_sum_i64idx_f64_can_implement^⚠: baracuda_kernels_segment_sum_i64idx_f64_can_implement (baracuda kernels segment sum i64idx f64 can implement).
baracuda_kernels_segment_sum_i64idx_f64_run^⚠: baracuda_kernels_segment_sum_i64idx_f64_run (baracuda kernels segment sum i64idx f64 run).
baracuda_kernels_softmax_backward_bf16_can_implement^⚠: baracuda_kernels_softmax_backward_bf16_can_implement (baracuda kernels softmax backward bf16 can implement).
baracuda_kernels_softmax_backward_bf16_run^⚠: Softmax BW, bf16.
baracuda_kernels_softmax_backward_bf16_strided_can_implement^⚠: softmax_backward_bf16_strided_can_implement companion.
baracuda_kernels_softmax_backward_bf16_strided_run^⚠: Softmax BW strided sibling, bf16.
baracuda_kernels_softmax_backward_f16_can_implement^⚠: baracuda_kernels_softmax_backward_f16_can_implement (baracuda kernels softmax backward f16 can implement).
baracuda_kernels_softmax_backward_f16_run^⚠: Softmax BW, f16.
baracuda_kernels_softmax_backward_f16_strided_can_implement^⚠: softmax_backward_f16_strided_can_implement companion.
baracuda_kernels_softmax_backward_f16_strided_run^⚠: Softmax BW strided sibling, f16.
baracuda_kernels_softmax_backward_f32_can_implement^⚠: baracuda_kernels_softmax_backward_f32_can_implement (baracuda kernels softmax backward f32 can implement).
baracuda_kernels_softmax_backward_f32_run^⚠: Softmax BW, f32. dx[k] = y[k] * (dy[k] - Σ_j y[j] * dy[j]). Caller passes the saved forward output y.
baracuda_kernels_softmax_backward_f32_strided_can_implement^⚠: softmax_backward_f32_strided_can_implement companion.
baracuda_kernels_softmax_backward_f32_strided_run^⚠: Softmax BW strided sibling, f32.
baracuda_kernels_softmax_backward_f64_can_implement^⚠: baracuda_kernels_softmax_backward_f64_can_implement (baracuda kernels softmax backward f64 can implement).
baracuda_kernels_softmax_backward_f64_run^⚠: Softmax BW, f64.
baracuda_kernels_softmax_backward_f64_strided_can_implement^⚠: softmax_backward_f64_strided_can_implement companion.
baracuda_kernels_softmax_backward_f64_strided_run^⚠: Softmax BW strided sibling, f64.
baracuda_kernels_softmax_bf16_can_implement^⚠: baracuda_kernels_softmax_bf16_can_implement (baracuda kernels softmax bf16 can implement).
baracuda_kernels_softmax_bf16_run^⚠: Softmax FW, bf16.
baracuda_kernels_softmax_bf16_strided_can_implement^⚠: softmax_bf16_strided_can_implement companion.
baracuda_kernels_softmax_bf16_strided_run^⚠: Softmax FW strided sibling, bf16.
baracuda_kernels_softmax_f16_can_implement^⚠: baracuda_kernels_softmax_f16_can_implement (baracuda kernels softmax f16 can implement).
baracuda_kernels_softmax_f16_run^⚠: Softmax FW, f16. f32 accumulator inside the kernel.
baracuda_kernels_softmax_f16_strided_can_implement^⚠: softmax_f16_strided_can_implement companion.
baracuda_kernels_softmax_f16_strided_run^⚠: Softmax FW strided sibling, f16.
baracuda_kernels_softmax_f32_can_implement^⚠: baracuda_kernels_softmax_f32_can_implement (baracuda kernels softmax f32 can implement).
baracuda_kernels_softmax_f32_run^⚠: Softmax FW, f32. y[k] = exp(x[k] - max) / Σ exp(x[j] - max) along softmax_axis. Numerically stable.
baracuda_kernels_softmax_f32_strided_can_implement^⚠: softmax_f32_strided_can_implement companion.
baracuda_kernels_softmax_f32_strided_run^⚠: Softmax FW strided sibling, f32. Same contract as baracuda_kernels_softmax_f32_run; identical underlying launcher.
baracuda_kernels_softmax_f64_can_implement^⚠: baracuda_kernels_softmax_f64_can_implement (baracuda kernels softmax f64 can implement).
baracuda_kernels_softmax_f64_run^⚠: Softmax FW, f64.
baracuda_kernels_softmax_f64_strided_can_implement^⚠: softmax_f64_strided_can_implement companion.
baracuda_kernels_softmax_f64_strided_run^⚠: Softmax FW strided sibling, f64.
baracuda_kernels_solve_f32_run^⚠: Linear-system solve A · X = B via fused getrf + getrs. a_inout is overwritten with packed LU factors; b_inout is overwritten with the solution X. pivots_out is [n] (1-based per LAPACK convention).
baracuda_kernels_solve_f32_workspace_size^⚠: Solve workspace size — uses the getrf query (cuSOLVER’s getrs is workspace-free).
baracuda_kernels_solve_f64_run^⚠: Linear-system solve A · X = B via fused getrf + getrs. a_inout is overwritten with packed LU factors; b_inout is overwritten with the solution X. pivots_out is [n] (1-based per LAPACK convention).
baracuda_kernels_solve_f64_workspace_size^⚠: Solve workspace size — uses the getrf query (cuSOLVER’s getrs is workspace-free).
baracuda_kernels_sort_backward_f32_can_implement^⚠: baracuda_kernels_sort_backward_f32_can_implement (baracuda kernels sort backward f32 can implement).
baracuda_kernels_sort_backward_f32_run^⚠: Sort BW, f32. dx[indices[i]] = dy[i]; launcher zeros dx first.
baracuda_kernels_sort_backward_f64_can_implement^⚠: baracuda_kernels_sort_backward_f64_can_implement (baracuda kernels sort backward f64 can implement).
baracuda_kernels_sort_backward_f64_run^⚠: Sort BW, f64.
baracuda_kernels_sort_f32_can_implement^⚠: baracuda_kernels_sort_f32_can_implement (baracuda kernels sort f32 can implement).
baracuda_kernels_sort_f32_run^⚠: Block-bitonic sort, f32. Emits sorted values + sorted indices (saved-indices contract for BW). descending == 0 = ascending.
baracuda_kernels_sort_f64_can_implement^⚠: baracuda_kernels_sort_f64_can_implement (baracuda kernels sort f64 can implement).
baracuda_kernels_sort_f64_run^⚠: Block-bitonic sort, f64.
baracuda_kernels_sort_i32_can_implement^⚠: baracuda_kernels_sort_i32_can_implement (baracuda kernels sort i32 can implement).
baracuda_kernels_sort_i32_run^⚠: Block-bitonic sort, i32.
baracuda_kernels_sort_i64_can_implement^⚠: baracuda_kernels_sort_i64_can_implement (baracuda kernels sort i64 can implement).
baracuda_kernels_sort_i64_run^⚠: Block-bitonic sort, i64.
baracuda_kernels_sparsemax_backward_bf16_can_implement^⚠: baracuda_kernels_sparsemax_backward_bf16_can_implement (baracuda kernels sparsemax backward bf16 can implement).
baracuda_kernels_sparsemax_backward_bf16_run^⚠: Sparsemax BW, bf16.
baracuda_kernels_sparsemax_backward_f16_can_implement^⚠: baracuda_kernels_sparsemax_backward_f16_can_implement (baracuda kernels sparsemax backward f16 can implement).
baracuda_kernels_sparsemax_backward_f16_run^⚠: Sparsemax BW, f16.
baracuda_kernels_sparsemax_backward_f32_can_implement^⚠: baracuda_kernels_sparsemax_backward_f32_can_implement (baracuda kernels sparsemax backward f32 can implement).
baracuda_kernels_sparsemax_backward_f32_run^⚠: Sparsemax BW, f32. dx[i] = dy[i] - sum_dy_active / n_active for active positions (y > 0), 0 elsewhere.
baracuda_kernels_sparsemax_backward_f64_can_implement^⚠: baracuda_kernels_sparsemax_backward_f64_can_implement (baracuda kernels sparsemax backward f64 can implement).
baracuda_kernels_sparsemax_backward_f64_run^⚠: Sparsemax BW, f64.
baracuda_kernels_sparsemax_bf16_can_implement^⚠: baracuda_kernels_sparsemax_bf16_can_implement (baracuda kernels sparsemax bf16 can implement).
baracuda_kernels_sparsemax_bf16_run^⚠: Sparsemax FW, bf16.
baracuda_kernels_sparsemax_f16_can_implement^⚠: baracuda_kernels_sparsemax_f16_can_implement (baracuda kernels sparsemax f16 can implement).
baracuda_kernels_sparsemax_f16_run^⚠: Sparsemax FW, f16.
baracuda_kernels_sparsemax_f32_can_implement^⚠: baracuda_kernels_sparsemax_f32_can_implement (baracuda kernels sparsemax f32 can implement).
baracuda_kernels_sparsemax_f32_run^⚠: Sparsemax FW, f32. y = ProjSimplex(x) via threshold τ found after sorting the row descending. Row extent limited to 64.
baracuda_kernels_sparsemax_f64_can_implement^⚠: baracuda_kernels_sparsemax_f64_can_implement (baracuda kernels sparsemax f64 can implement).
baracuda_kernels_sparsemax_f64_run^⚠: Sparsemax FW, f64.
baracuda_kernels_svd_batched_f32_run^⚠: Batched Jacobi-SVD on square input. Returns V (not V^T). When jobz == 0, u_out / v_out may be null.
baracuda_kernels_svd_batched_f32_workspace_size^⚠: Batched Jacobi-SVD workspace size in bytes.
baracuda_kernels_svd_batched_f64_run^⚠: Batched Jacobi-SVD on square input. Returns V (not V^T). When jobz == 0, u_out / v_out may be null.
baracuda_kernels_svd_batched_f64_workspace_size^⚠: Batched Jacobi-SVD workspace size in bytes.
baracuda_kernels_svd_f32_run^⚠: SVD A = U · diag(S) · V^T. Requires m >= n. a_inout is overwritten by cuSOLVER as scratch.
baracuda_kernels_svd_f32_workspace_size^⚠: SVD workspace size in bytes for gesvd.
baracuda_kernels_svd_f64_run^⚠: SVD A = U · diag(S) · V^T. Requires m >= n. a_inout is overwritten by cuSOLVER as scratch.
baracuda_kernels_svd_f64_workspace_size^⚠: SVD workspace size in bytes for gesvd.
baracuda_kernels_svda_batched_f32_run^⚠: Approximate (Jacobi-bidiagonal) batched SVD on rectangular input. Returns V (not V^T). The h_r_nrm_f_out buffer is host-resident and receives per-slot residual Frobenius norms (cuSOLVER signature). Pass null to discard — but cuSOLVER may dereference even when “discarding”, so callers should pass a real buffer of batch_size f64s.
baracuda_kernels_svda_batched_f32_workspace_size^⚠: Approximate batched SVD workspace size in bytes.
baracuda_kernels_svda_batched_f64_run^⚠: Approximate (Jacobi-bidiagonal) batched SVD on rectangular input. Returns V (not V^T). The h_r_nrm_f_out buffer is host-resident and receives per-slot residual Frobenius norms (cuSOLVER signature). Pass null to discard — but cuSOLVER may dereference even when “discarding”, so callers should pass a real buffer of batch_size f64s.
baracuda_kernels_svda_batched_f64_workspace_size^⚠: Approximate batched SVD workspace size in bytes.
baracuda_kernels_ternary_addcdiv_backward_bf16_can_implement^⚠: Pre-launch check for ternary_addcdiv_backward_bf16.
baracuda_kernels_ternary_addcdiv_backward_bf16_run^⚠: Addcdiv backward, bf16.
baracuda_kernels_ternary_addcdiv_backward_f16_can_implement^⚠: Pre-launch check for ternary_addcdiv_backward_f16.
baracuda_kernels_ternary_addcdiv_backward_f16_run^⚠: Addcdiv backward, f16.
baracuda_kernels_ternary_addcdiv_backward_f32_can_implement^⚠: Pre-launch check for ternary_addcdiv_backward_f32.
baracuda_kernels_ternary_addcdiv_backward_f32_run^⚠: Addcdiv backward, f32. Reads desc.scale. Writes da = dy, db = dy*scale/c, dc = -dy*scale*b/c².
baracuda_kernels_ternary_addcdiv_backward_f64_can_implement^⚠: Pre-launch check for ternary_addcdiv_backward_f64.
baracuda_kernels_ternary_addcdiv_backward_f64_run^⚠: Addcdiv backward, f64.
baracuda_kernels_ternary_addcdiv_bf16_can_implement^⚠: Pre-launch check for addcdiv_bf16.
baracuda_kernels_ternary_addcdiv_bf16_run^⚠: addcdiv, bf16, contig.
baracuda_kernels_ternary_addcdiv_bf16_strided_can_implement^⚠: Pre-launch check for addcdiv_bf16_strided.
baracuda_kernels_ternary_addcdiv_bf16_strided_run^⚠: addcdiv, bf16, strided.
baracuda_kernels_ternary_addcdiv_f16_can_implement^⚠: Pre-launch check for addcdiv_f16.
baracuda_kernels_ternary_addcdiv_f16_run^⚠: addcdiv, f16, contig.
baracuda_kernels_ternary_addcdiv_f16_strided_can_implement^⚠: Pre-launch check for addcdiv_f16_strided.
baracuda_kernels_ternary_addcdiv_f16_strided_run^⚠: addcdiv, f16, strided.
baracuda_kernels_ternary_addcdiv_f32_can_implement^⚠: Pre-launch check for addcdiv_f32.
baracuda_kernels_ternary_addcdiv_f32_run^⚠: y = a + scale * (b / c), f32, contig.
baracuda_kernels_ternary_addcdiv_f32_strided_can_implement^⚠: Pre-launch check for addcdiv_f32_strided.
baracuda_kernels_ternary_addcdiv_f32_strided_run^⚠: addcdiv, f32, strided.
baracuda_kernels_ternary_addcdiv_f64_can_implement^⚠: Pre-launch check for addcdiv_f64.
baracuda_kernels_ternary_addcdiv_f64_run^⚠: addcdiv, f64, contig.
baracuda_kernels_ternary_addcdiv_f64_strided_can_implement^⚠: Pre-launch check for addcdiv_f64_strided.
baracuda_kernels_ternary_addcdiv_f64_strided_run^⚠: addcdiv, f64, strided.
baracuda_kernels_ternary_addcmul_backward_bf16_can_implement^⚠: Pre-launch check for ternary_addcmul_backward_bf16.
baracuda_kernels_ternary_addcmul_backward_bf16_run^⚠: Addcmul backward, bf16.
baracuda_kernels_ternary_addcmul_backward_f16_can_implement^⚠: Pre-launch check for ternary_addcmul_backward_f16.
baracuda_kernels_ternary_addcmul_backward_f16_run^⚠: Addcmul backward, f16.
baracuda_kernels_ternary_addcmul_backward_f32_can_implement^⚠: Pre-launch check for ternary_addcmul_backward_f32.
baracuda_kernels_ternary_addcmul_backward_f32_run^⚠: Addcmul backward, f32. Reads desc.scale. Writes da = dy, db = dy*scale*c, dc = dy*scale*b.
baracuda_kernels_ternary_addcmul_backward_f64_can_implement^⚠: Pre-launch check for ternary_addcmul_backward_f64.
baracuda_kernels_ternary_addcmul_backward_f64_run^⚠: Addcmul backward, f64.
baracuda_kernels_ternary_addcmul_bf16_can_implement^⚠: Pre-launch check for addcmul_bf16.
baracuda_kernels_ternary_addcmul_bf16_run^⚠: addcmul, bf16, contig.
baracuda_kernels_ternary_addcmul_bf16_strided_can_implement^⚠: Pre-launch check for addcmul_bf16_strided.
baracuda_kernels_ternary_addcmul_bf16_strided_run^⚠: addcmul, bf16, strided.
baracuda_kernels_ternary_addcmul_f16_can_implement^⚠: Pre-launch check for addcmul_f16.
baracuda_kernels_ternary_addcmul_f16_run^⚠: addcmul, f16, contig.
baracuda_kernels_ternary_addcmul_f16_strided_can_implement^⚠: Pre-launch check for addcmul_f16_strided.
baracuda_kernels_ternary_addcmul_f16_strided_run^⚠: addcmul, f16, strided.
baracuda_kernels_ternary_addcmul_f32_can_implement^⚠: Pre-launch implementability check for addcmul_f32.
baracuda_kernels_ternary_addcmul_f32_run^⚠: y = a + scale * b * c, f32, contig fast path.
baracuda_kernels_ternary_addcmul_f32_strided_can_implement^⚠: Pre-launch check for addcmul_f32_strided.
baracuda_kernels_ternary_addcmul_f32_strided_run^⚠: y = a + scale * b * c, f32, strided / broadcast path.
baracuda_kernels_ternary_addcmul_f64_can_implement^⚠: Pre-launch check for addcmul_f64.
baracuda_kernels_ternary_addcmul_f64_run^⚠: addcmul, f64, contig.
baracuda_kernels_ternary_addcmul_f64_strided_can_implement^⚠: Pre-launch check for addcmul_f64_strided.
baracuda_kernels_ternary_addcmul_f64_strided_run^⚠: addcmul, f64, strided.
baracuda_kernels_ternary_clamp_backward_bf16_can_implement^⚠: Pre-launch check for ternary_clamp_backward_bf16.
baracuda_kernels_ternary_clamp_backward_bf16_run^⚠: Clamp backward, bf16.
baracuda_kernels_ternary_clamp_backward_f16_can_implement^⚠: Pre-launch check for ternary_clamp_backward_f16.
baracuda_kernels_ternary_clamp_backward_f16_run^⚠: Clamp backward, f16.
baracuda_kernels_ternary_clamp_backward_f32_can_implement^⚠: Pre-launch check for ternary_clamp_backward_f32.
baracuda_kernels_ternary_clamp_backward_f32_run^⚠: Clamp backward, f32. Writes mask × dy per axis (a/b/c).
baracuda_kernels_ternary_clamp_backward_f64_can_implement^⚠: Pre-launch check for ternary_clamp_backward_f64.
baracuda_kernels_ternary_clamp_backward_f64_run^⚠: Clamp backward, f64.
baracuda_kernels_ternary_clamp_bf16_can_implement^⚠: Pre-launch implementability check for ternary_clamp_bf16.
baracuda_kernels_ternary_clamp_bf16_run^⚠: Ternary elementwise clamp, bf16, contig fast path.
baracuda_kernels_ternary_clamp_bf16_strided_can_implement^⚠: Pre-launch implementability check for ternary_clamp_bf16_strided.
baracuda_kernels_ternary_clamp_bf16_strided_run^⚠: Ternary elementwise clamp, bf16, strided / broadcast path.
baracuda_kernels_ternary_clamp_f16_can_implement^⚠: Pre-launch implementability check for ternary_clamp_f16.
baracuda_kernels_ternary_clamp_f16_run^⚠: Ternary elementwise clamp, f16, contig fast path.
baracuda_kernels_ternary_clamp_f16_strided_can_implement^⚠: Pre-launch implementability check for ternary_clamp_f16_strided.
baracuda_kernels_ternary_clamp_f16_strided_run^⚠: Ternary elementwise clamp, f16, strided / broadcast path.
baracuda_kernels_ternary_clamp_f32_can_implement^⚠: Pre-launch implementability check for ternary_clamp_f32.
baracuda_kernels_ternary_clamp_f32_run^⚠: Ternary elementwise clamp, f32, contig fast path.
baracuda_kernels_ternary_clamp_f32_strided_can_implement^⚠: Pre-launch implementability check for ternary_clamp_f32_strided.
baracuda_kernels_ternary_clamp_f32_strided_run^⚠: Ternary elementwise clamp, f32, strided / broadcast path. This is the ternary-strided trailblazer — its safety contract (including aliasing) carries over to every ternary strided launcher across all dtypes.
baracuda_kernels_ternary_clamp_f64_can_implement^⚠: Pre-launch implementability check for ternary_clamp_f64.
baracuda_kernels_ternary_clamp_f64_run^⚠: Ternary elementwise clamp, f64, contig fast path.
baracuda_kernels_ternary_clamp_f64_strided_can_implement^⚠: Pre-launch implementability check for ternary_clamp_f64_strided.
baracuda_kernels_ternary_clamp_f64_strided_run^⚠: Ternary elementwise clamp, f64, strided / broadcast path.
baracuda_kernels_ternary_fma_backward_bf16_can_implement^⚠: Pre-launch check for ternary_fma_backward_bf16.
baracuda_kernels_ternary_fma_backward_bf16_run^⚠: Fma backward, bf16.
baracuda_kernels_ternary_fma_backward_f16_can_implement^⚠: Pre-launch check for ternary_fma_backward_f16.
baracuda_kernels_ternary_fma_backward_f16_run^⚠: Fma backward, f16.
baracuda_kernels_ternary_fma_backward_f32_can_implement^⚠: Pre-launch check for ternary_fma_backward_f32.
baracuda_kernels_ternary_fma_backward_f32_run^⚠: Fma backward, f32. Writes da = dy*b, db = dy*a, dc = dy.
baracuda_kernels_ternary_fma_backward_f64_can_implement^⚠: Pre-launch check for ternary_fma_backward_f64.
baracuda_kernels_ternary_fma_backward_f64_run^⚠: Fma backward, f64.
baracuda_kernels_ternary_fma_bf16_can_implement^⚠: Pre-launch implementability check for ternary_fma_bf16.
baracuda_kernels_ternary_fma_bf16_run^⚠: Ternary elementwise fma, bf16, contig fast path.
baracuda_kernels_ternary_fma_bf16_strided_can_implement^⚠: Pre-launch implementability check for ternary_fma_bf16_strided.
baracuda_kernels_ternary_fma_bf16_strided_run^⚠: Ternary elementwise fma, bf16, strided / broadcast path.
baracuda_kernels_ternary_fma_f16_can_implement^⚠: Pre-launch implementability check for ternary_fma_f16.
baracuda_kernels_ternary_fma_f16_run^⚠: Ternary elementwise fma, f16, contig fast path.
baracuda_kernels_ternary_fma_f16_strided_can_implement^⚠: Pre-launch implementability check for ternary_fma_f16_strided.
baracuda_kernels_ternary_fma_f16_strided_run^⚠: Ternary elementwise fma, f16, strided / broadcast path.
baracuda_kernels_ternary_fma_f32_can_implement^⚠: Pre-launch implementability check for ternary_fma_f32.
baracuda_kernels_ternary_fma_f32_run^⚠: Ternary elementwise fma, f32, contig fast path.
baracuda_kernels_ternary_fma_f32_strided_can_implement^⚠: Pre-launch implementability check for ternary_fma_f32_strided.
baracuda_kernels_ternary_fma_f32_strided_run^⚠: Ternary elementwise fma, f32, strided / broadcast path.
baracuda_kernels_ternary_fma_f64_can_implement^⚠: Pre-launch implementability check for ternary_fma_f64.
baracuda_kernels_ternary_fma_f64_run^⚠: Ternary elementwise fma, f64, contig fast path.
baracuda_kernels_ternary_fma_f64_strided_can_implement^⚠: Pre-launch implementability check for ternary_fma_f64_strided.
baracuda_kernels_ternary_fma_f64_strided_run^⚠: Ternary elementwise fma, f64, strided / broadcast path.
baracuda_kernels_topk_backward_f32_can_implement^⚠: baracuda_kernels_topk_backward_f32_can_implement (baracuda kernels topk backward f32 can implement).
baracuda_kernels_topk_backward_f32_run^⚠: Top-k BW, f32. Scatter k-wide dy into row_len-wide dx (zero-init) via saved indices.
baracuda_kernels_topk_backward_f64_can_implement^⚠: baracuda_kernels_topk_backward_f64_can_implement (baracuda kernels topk backward f64 can implement).
baracuda_kernels_topk_backward_f64_run^⚠: Top-k BW, f64.
baracuda_kernels_topk_f32_can_implement^⚠: baracuda_kernels_topk_f32_can_implement (baracuda kernels topk f32 can implement).
baracuda_kernels_topk_f32_run^⚠: Block-bitonic top-k, f32. Caps k ≤ 64 and row_len ≤ 1024. largest == 1 = top-k by value; largest == 0 = bottom-k.
baracuda_kernels_topk_f64_can_implement^⚠: baracuda_kernels_topk_f64_can_implement (baracuda kernels topk f64 can implement).
baracuda_kernels_topk_f64_run^⚠: Block-bitonic top-k, f64.
baracuda_kernels_trace_bf16_can_implement^⚠: baracuda_kernels_trace_bf16_can_implement (baracuda kernels trace bf16 can implement).
baracuda_kernels_trace_bf16_run^⚠: Trace, bf16 (f32-detour accumulator).
baracuda_kernels_trace_f16_can_implement^⚠: baracuda_kernels_trace_f16_can_implement (baracuda kernels trace f16 can implement).
baracuda_kernels_trace_f16_run^⚠: Trace, f16 (f32-detour accumulator).
baracuda_kernels_trace_f32_can_implement^⚠: baracuda_kernels_trace_f32_can_implement (baracuda kernels trace f32 can implement).
baracuda_kernels_trace_f32_run^⚠: Trace of a 2-D square matrix, f32. y[0] = Σ x[i * stride_row + i * stride_col] for i in 0..rows. Output is a single scalar.
baracuda_kernels_trace_f64_can_implement^⚠: baracuda_kernels_trace_f64_can_implement (baracuda kernels trace f64 can implement).
baracuda_kernels_trace_f64_run^⚠: Trace, f64.
baracuda_kernels_tril_bf16_can_implement^⚠: Implementability check for tril_bf16.
baracuda_kernels_tril_bf16_run^⚠: Tril, bf16.
baracuda_kernels_tril_bf16_strided_can_implement^⚠: Implementability check for tril_bf16_strided.
baracuda_kernels_tril_bf16_strided_run^⚠: Tril strided, bf16.
baracuda_kernels_tril_bool_can_implement^⚠: Implementability check for tril_bool.
baracuda_kernels_tril_bool_run^⚠: Tril, Bool (storage = u8).
baracuda_kernels_tril_bool_strided_can_implement^⚠: Implementability check for tril_bool_strided.
baracuda_kernels_tril_bool_strided_run^⚠: Tril strided, Bool (storage = u8).
baracuda_kernels_tril_f16_can_implement^⚠: Implementability check for tril_f16.
baracuda_kernels_tril_f16_run^⚠: Tril, f16.
baracuda_kernels_tril_f16_strided_can_implement^⚠: Implementability check for tril_f16_strided.
baracuda_kernels_tril_f16_strided_run^⚠: Tril strided, f16.
baracuda_kernels_tril_f32_can_implement^⚠: Implementability check for tril_f32.
baracuda_kernels_tril_f32_run^⚠: Tril, f32.
baracuda_kernels_tril_f32_strided_can_implement^⚠: Implementability check for tril_f32_strided.
baracuda_kernels_tril_f32_strided_run^⚠: Tril strided, f32.
baracuda_kernels_tril_f64_can_implement^⚠: Implementability check for tril_f64.
baracuda_kernels_tril_f64_run^⚠: Tril, f64.
baracuda_kernels_tril_f64_strided_can_implement^⚠: Implementability check for tril_f64_strided.
baracuda_kernels_tril_f64_strided_run^⚠: Tril strided, f64.
baracuda_kernels_tril_i32_can_implement^⚠: Implementability check for tril_i32.
baracuda_kernels_tril_i32_run^⚠: Tril, i32.
baracuda_kernels_tril_i32_strided_can_implement^⚠: Implementability check for tril_i32_strided.
baracuda_kernels_tril_i32_strided_run^⚠: Tril strided, i32.
baracuda_kernels_tril_i64_can_implement^⚠: Implementability check for tril_i64.
baracuda_kernels_tril_i64_run^⚠: Tril, i64.
baracuda_kernels_tril_i64_strided_can_implement^⚠: Implementability check for tril_i64_strided.
baracuda_kernels_tril_i64_strided_run^⚠: Tril strided, i64.
baracuda_kernels_triu_bf16_can_implement^⚠: Implementability check for triu_bf16.
baracuda_kernels_triu_bf16_run^⚠: Triu, bf16.
baracuda_kernels_triu_bf16_strided_can_implement^⚠: Implementability check for triu_bf16_strided.
baracuda_kernels_triu_bf16_strided_run^⚠: Triu strided, bf16.
baracuda_kernels_triu_bool_can_implement^⚠: Implementability check for triu_bool.
baracuda_kernels_triu_bool_run^⚠: Triu, Bool (storage = u8).
baracuda_kernels_triu_bool_strided_can_implement^⚠: Implementability check for triu_bool_strided.
baracuda_kernels_triu_bool_strided_run^⚠: Triu strided, Bool (storage = u8).
baracuda_kernels_triu_f16_can_implement^⚠: Implementability check for triu_f16.
baracuda_kernels_triu_f16_run^⚠: Triu, f16.
baracuda_kernels_triu_f16_strided_can_implement^⚠: Implementability check for triu_f16_strided.
baracuda_kernels_triu_f16_strided_run^⚠: Triu strided, f16.
baracuda_kernels_triu_f32_can_implement^⚠: Implementability check for triu_f32.
baracuda_kernels_triu_f32_run^⚠: Triu, f32. This is the triu trailblazer — its aliasing contract carries over to every other triu_<dt>_run, triu_<dt>_strided_run, and the sibling tril_* family.
baracuda_kernels_triu_f32_strided_can_implement^⚠: Implementability check for triu_f32_strided.
baracuda_kernels_triu_f32_strided_run^⚠: Triu strided, f32.
baracuda_kernels_triu_f64_can_implement^⚠: Implementability check for triu_f64.
baracuda_kernels_triu_f64_run^⚠: Triu, f64.
baracuda_kernels_triu_f64_strided_can_implement^⚠: Implementability check for triu_f64_strided.
baracuda_kernels_triu_f64_strided_run^⚠: Triu strided, f64.
baracuda_kernels_triu_i32_can_implement^⚠: Implementability check for triu_i32.
baracuda_kernels_triu_i32_run^⚠: Triu, i32.
baracuda_kernels_triu_i32_strided_can_implement^⚠: Implementability check for triu_i32_strided.
baracuda_kernels_triu_i32_strided_run^⚠: Triu strided, i32.
baracuda_kernels_triu_i64_can_implement^⚠: Implementability check for triu_i64.
baracuda_kernels_triu_i64_run^⚠: Triu, i64.
baracuda_kernels_triu_i64_strided_can_implement^⚠: Implementability check for triu_i64_strided.
baracuda_kernels_triu_i64_strided_run^⚠: Triu strided, i64.
baracuda_kernels_unary_abs_bf16_can_implement^⚠: Pre-launch implementability check for unary_abs_bf16.
baracuda_kernels_unary_abs_bf16_run^⚠: Unary elementwise abs, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_abs_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_abs_bf16_strided.
baracuda_kernels_unary_abs_bf16_strided_run^⚠: Unary elementwise abs, bf16 dtype, strided path.
baracuda_kernels_unary_abs_f16_can_implement^⚠: Pre-launch implementability check for unary_abs_f16.
baracuda_kernels_unary_abs_f16_run^⚠: Unary elementwise abs, f16 dtype, contiguous fast path.
baracuda_kernels_unary_abs_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_abs_f16_strided.
baracuda_kernels_unary_abs_f16_strided_run^⚠: Unary elementwise abs, f16 dtype, strided path.
baracuda_kernels_unary_abs_f32_can_implement^⚠: Pre-launch implementability check for unary_abs_f32.
baracuda_kernels_unary_abs_f32_run^⚠: Unary elementwise abs, f32 dtype, contiguous fast path.
baracuda_kernels_unary_abs_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_abs_f32_strided.
baracuda_kernels_unary_abs_f32_strided_run^⚠: Unary elementwise abs, f32 dtype, strided path.
baracuda_kernels_unary_abs_f64_can_implement^⚠: Pre-launch implementability check for unary_abs_f64.
baracuda_kernels_unary_abs_f64_run^⚠: Unary elementwise abs, f64 dtype, contiguous fast path.
baracuda_kernels_unary_abs_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_abs_f64_strided.
baracuda_kernels_unary_abs_f64_strided_run^⚠: Unary elementwise abs, f64 dtype, strided path.
baracuda_kernels_unary_acos_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_acos_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_acos_backward_bf16_run^⚠: Acos backward, bf16.
baracuda_kernels_unary_acos_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_acos_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_acos_backward_f16_run^⚠: Acos backward, f16.
baracuda_kernels_unary_acos_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_acos_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_acos_backward_f32_run^⚠: Acos backward, f32. dx = -dy / sqrt(1 - x²). Saved-x. Domain: |x| < 1.
baracuda_kernels_unary_acos_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_acos_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_acos_backward_f64_run^⚠: Acos backward, f64.
baracuda_kernels_unary_acos_bf16_can_implement^⚠: Pre-launch implementability check for unary_acos_bf16.
baracuda_kernels_unary_acos_bf16_run^⚠: Unary elementwise acos, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_acos_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_acos_bf16_strided.
baracuda_kernels_unary_acos_bf16_strided_run^⚠: Unary elementwise acos, bf16 dtype, strided path.
baracuda_kernels_unary_acos_f16_can_implement^⚠: Pre-launch implementability check for unary_acos_f16.
baracuda_kernels_unary_acos_f16_run^⚠: Unary elementwise acos, f16 dtype, contiguous fast path.
baracuda_kernels_unary_acos_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_acos_f16_strided.
baracuda_kernels_unary_acos_f16_strided_run^⚠: Unary elementwise acos, f16 dtype, strided path.
baracuda_kernels_unary_acos_f32_can_implement^⚠: Pre-launch implementability check for unary_acos_f32.
baracuda_kernels_unary_acos_f32_run^⚠: Unary elementwise acos, f32 dtype, contiguous fast path.
baracuda_kernels_unary_acos_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_acos_f32_strided.
baracuda_kernels_unary_acos_f32_strided_run^⚠: Unary elementwise acos, f32 dtype, strided path.
baracuda_kernels_unary_acos_f64_can_implement^⚠: Pre-launch implementability check for unary_acos_f64.
baracuda_kernels_unary_acos_f64_run^⚠: Unary elementwise acos, f64 dtype, contiguous fast path.
baracuda_kernels_unary_acos_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_acos_f64_strided.
baracuda_kernels_unary_acos_f64_strided_run^⚠: Unary elementwise acos, f64 dtype, strided path.
baracuda_kernels_unary_acosh_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_acosh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_acosh_backward_bf16_run^⚠: Acosh backward, bf16.
baracuda_kernels_unary_acosh_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_acosh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_acosh_backward_f16_run^⚠: Acosh backward, f16.
baracuda_kernels_unary_acosh_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_acosh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_acosh_backward_f32_run^⚠: Acosh backward, f32. dx = dy / sqrt(x² - 1). Saved-x. Domain: x > 1.
baracuda_kernels_unary_acosh_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_acosh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_acosh_backward_f64_run^⚠: Acosh backward, f64.
baracuda_kernels_unary_acosh_bf16_can_implement^⚠: Pre-launch implementability check for unary_acosh_bf16.
baracuda_kernels_unary_acosh_bf16_run^⚠: Unary elementwise acosh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_acosh_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_acosh_bf16_strided.
baracuda_kernels_unary_acosh_bf16_strided_run^⚠: Unary elementwise acosh, bf16 dtype, strided path.
baracuda_kernels_unary_acosh_f16_can_implement^⚠: Pre-launch implementability check for unary_acosh_f16.
baracuda_kernels_unary_acosh_f16_run^⚠: Unary elementwise acosh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_acosh_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_acosh_f16_strided.
baracuda_kernels_unary_acosh_f16_strided_run^⚠: Unary elementwise acosh, f16 dtype, strided path.
baracuda_kernels_unary_acosh_f32_can_implement^⚠: Pre-launch implementability check for unary_acosh_f32.
baracuda_kernels_unary_acosh_f32_run^⚠: Unary elementwise acosh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_acosh_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_acosh_f32_strided.
baracuda_kernels_unary_acosh_f32_strided_run^⚠: Unary elementwise acosh, f32 dtype, strided path.
baracuda_kernels_unary_acosh_f64_can_implement^⚠: Pre-launch implementability check for unary_acosh_f64.
baracuda_kernels_unary_acosh_f64_run^⚠: Unary elementwise acosh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_acosh_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_acosh_f64_strided.
baracuda_kernels_unary_acosh_f64_strided_run^⚠: Unary elementwise acosh, f64 dtype, strided path.
baracuda_kernels_unary_asin_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_asin_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_asin_backward_bf16_run^⚠: Asin backward, bf16.
baracuda_kernels_unary_asin_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_asin_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_asin_backward_f16_run^⚠: Asin backward, f16.
baracuda_kernels_unary_asin_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_asin_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_asin_backward_f32_run^⚠: Asin backward, f32. dx = dy / sqrt(1 - x²). Saved-x. Domain: |x| < 1.
baracuda_kernels_unary_asin_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_asin_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_asin_backward_f64_run^⚠: Asin backward, f64.
baracuda_kernels_unary_asin_bf16_can_implement^⚠: Pre-launch implementability check for unary_asin_bf16.
baracuda_kernels_unary_asin_bf16_run^⚠: Unary elementwise asin, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_asin_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_asin_bf16_strided.
baracuda_kernels_unary_asin_bf16_strided_run^⚠: Unary elementwise asin, bf16 dtype, strided path.
baracuda_kernels_unary_asin_f16_can_implement^⚠: Pre-launch implementability check for unary_asin_f16.
baracuda_kernels_unary_asin_f16_run^⚠: Unary elementwise asin, f16 dtype, contiguous fast path.
baracuda_kernels_unary_asin_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_asin_f16_strided.
baracuda_kernels_unary_asin_f16_strided_run^⚠: Unary elementwise asin, f16 dtype, strided path.
baracuda_kernels_unary_asin_f32_can_implement^⚠: Pre-launch implementability check for unary_asin_f32.
baracuda_kernels_unary_asin_f32_run^⚠: Unary elementwise asin, f32 dtype, contiguous fast path.
baracuda_kernels_unary_asin_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_asin_f32_strided.
baracuda_kernels_unary_asin_f32_strided_run^⚠: Unary elementwise asin, f32 dtype, strided path.
baracuda_kernels_unary_asin_f64_can_implement^⚠: Pre-launch implementability check for unary_asin_f64.
baracuda_kernels_unary_asin_f64_run^⚠: Unary elementwise asin, f64 dtype, contiguous fast path.
baracuda_kernels_unary_asin_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_asin_f64_strided.
baracuda_kernels_unary_asin_f64_strided_run^⚠: Unary elementwise asin, f64 dtype, strided path.
baracuda_kernels_unary_asinh_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_asinh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_asinh_backward_bf16_run^⚠: Asinh backward, bf16.
baracuda_kernels_unary_asinh_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_asinh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_asinh_backward_f16_run^⚠: Asinh backward, f16.
baracuda_kernels_unary_asinh_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_asinh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_asinh_backward_f32_run^⚠: Asinh backward, f32. dx = dy / sqrt(1 + x²). Saved-x.
baracuda_kernels_unary_asinh_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_asinh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_asinh_backward_f64_run^⚠: Asinh backward, f64.
baracuda_kernels_unary_asinh_bf16_can_implement^⚠: Pre-launch implementability check for unary_asinh_bf16.
baracuda_kernels_unary_asinh_bf16_run^⚠: Unary elementwise asinh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_asinh_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_asinh_bf16_strided.
baracuda_kernels_unary_asinh_bf16_strided_run^⚠: Unary elementwise asinh, bf16 dtype, strided path.
baracuda_kernels_unary_asinh_f16_can_implement^⚠: Pre-launch implementability check for unary_asinh_f16.
baracuda_kernels_unary_asinh_f16_run^⚠: Unary elementwise asinh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_asinh_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_asinh_f16_strided.
baracuda_kernels_unary_asinh_f16_strided_run^⚠: Unary elementwise asinh, f16 dtype, strided path.
baracuda_kernels_unary_asinh_f32_can_implement^⚠: Pre-launch implementability check for unary_asinh_f32.
baracuda_kernels_unary_asinh_f32_run^⚠: Unary elementwise asinh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_asinh_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_asinh_f32_strided.
baracuda_kernels_unary_asinh_f32_strided_run^⚠: Unary elementwise asinh, f32 dtype, strided path.
baracuda_kernels_unary_asinh_f64_can_implement^⚠: Pre-launch implementability check for unary_asinh_f64.
baracuda_kernels_unary_asinh_f64_run^⚠: Unary elementwise asinh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_asinh_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_asinh_f64_strided.
baracuda_kernels_unary_asinh_f64_strided_run^⚠: Unary elementwise asinh, f64 dtype, strided path.
baracuda_kernels_unary_atan_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_atan_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_atan_backward_bf16_run^⚠: Atan backward, bf16.
baracuda_kernels_unary_atan_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_atan_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_atan_backward_f16_run^⚠: Atan backward, f16.
baracuda_kernels_unary_atan_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_atan_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_atan_backward_f32_run^⚠: Atan backward, f32. dx = dy / (1 + x²).
baracuda_kernels_unary_atan_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_atan_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_atan_backward_f64_run^⚠: Atan backward, f64.
baracuda_kernels_unary_atan_bf16_can_implement^⚠: Pre-launch implementability check for unary_atan_bf16.
baracuda_kernels_unary_atan_bf16_run^⚠: Unary elementwise atan, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_atan_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_atan_bf16_strided.
baracuda_kernels_unary_atan_bf16_strided_run^⚠: Unary elementwise atan, bf16 dtype, strided path.
baracuda_kernels_unary_atan_f16_can_implement^⚠: Pre-launch implementability check for unary_atan_f16.
baracuda_kernels_unary_atan_f16_run^⚠: Unary elementwise atan, f16 dtype, contiguous fast path.
baracuda_kernels_unary_atan_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_atan_f16_strided.
baracuda_kernels_unary_atan_f16_strided_run^⚠: Unary elementwise atan, f16 dtype, strided path.
baracuda_kernels_unary_atan_f32_can_implement^⚠: Pre-launch implementability check for unary_atan_f32.
baracuda_kernels_unary_atan_f32_run^⚠: Unary elementwise atan, f32 dtype, contiguous fast path.
baracuda_kernels_unary_atan_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_atan_f32_strided.
baracuda_kernels_unary_atan_f32_strided_run^⚠: Unary elementwise atan, f32 dtype, strided path.
baracuda_kernels_unary_atan_f64_can_implement^⚠: Pre-launch implementability check for unary_atan_f64.
baracuda_kernels_unary_atan_f64_run^⚠: Unary elementwise atan, f64 dtype, contiguous fast path.
baracuda_kernels_unary_atan_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_atan_f64_strided.
baracuda_kernels_unary_atan_f64_strided_run^⚠: Unary elementwise atan, f64 dtype, strided path.
baracuda_kernels_unary_atanh_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_atanh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_atanh_backward_bf16_run^⚠: Atanh backward, bf16.
baracuda_kernels_unary_atanh_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_atanh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_atanh_backward_f16_run^⚠: Atanh backward, f16.
baracuda_kernels_unary_atanh_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_atanh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_atanh_backward_f32_run^⚠: Atanh backward, f32. dx = dy / (1 - x²). Saved-x. Domain: |x| < 1.
baracuda_kernels_unary_atanh_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_atanh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_atanh_backward_f64_run^⚠: Atanh backward, f64.
baracuda_kernels_unary_atanh_bf16_can_implement^⚠: Pre-launch implementability check for unary_atanh_bf16.
baracuda_kernels_unary_atanh_bf16_run^⚠: Unary elementwise atanh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_atanh_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_atanh_bf16_strided.
baracuda_kernels_unary_atanh_bf16_strided_run^⚠: Unary elementwise atanh, bf16 dtype, strided path.
baracuda_kernels_unary_atanh_f16_can_implement^⚠: Pre-launch implementability check for unary_atanh_f16.
baracuda_kernels_unary_atanh_f16_run^⚠: Unary elementwise atanh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_atanh_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_atanh_f16_strided.
baracuda_kernels_unary_atanh_f16_strided_run^⚠: Unary elementwise atanh, f16 dtype, strided path.
baracuda_kernels_unary_atanh_f32_can_implement^⚠: Pre-launch implementability check for unary_atanh_f32.
baracuda_kernels_unary_atanh_f32_run^⚠: Unary elementwise atanh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_atanh_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_atanh_f32_strided.
baracuda_kernels_unary_atanh_f32_strided_run^⚠: Unary elementwise atanh, f32 dtype, strided path.
baracuda_kernels_unary_atanh_f64_can_implement^⚠: Pre-launch implementability check for unary_atanh_f64.
baracuda_kernels_unary_atanh_f64_run^⚠: Unary elementwise atanh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_atanh_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_atanh_f64_strided.
baracuda_kernels_unary_atanh_f64_strided_run^⚠: Unary elementwise atanh, f64 dtype, strided path.
baracuda_kernels_unary_cbrt_bf16_can_implement^⚠: Pre-launch implementability check for unary_cbrt_bf16.
baracuda_kernels_unary_cbrt_bf16_run^⚠: Unary elementwise cbrt, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_cbrt_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_cbrt_bf16_strided.
baracuda_kernels_unary_cbrt_bf16_strided_run^⚠: Unary elementwise cbrt, bf16 dtype, strided path.
baracuda_kernels_unary_cbrt_f16_can_implement^⚠: Pre-launch implementability check for unary_cbrt_f16.
baracuda_kernels_unary_cbrt_f16_run^⚠: Unary elementwise cbrt, f16 dtype, contiguous fast path.
baracuda_kernels_unary_cbrt_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_cbrt_f16_strided.
baracuda_kernels_unary_cbrt_f16_strided_run^⚠: Unary elementwise cbrt, f16 dtype, strided path.
baracuda_kernels_unary_cbrt_f32_can_implement^⚠: Pre-launch implementability check for unary_cbrt_f32.
baracuda_kernels_unary_cbrt_f32_run^⚠: Unary elementwise cbrt, f32 dtype, contiguous fast path.
baracuda_kernels_unary_cbrt_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_cbrt_f32_strided.
baracuda_kernels_unary_cbrt_f32_strided_run^⚠: Unary elementwise cbrt, f32 dtype, strided path.
baracuda_kernels_unary_cbrt_f64_can_implement^⚠: Pre-launch implementability check for unary_cbrt_f64.
baracuda_kernels_unary_cbrt_f64_run^⚠: Unary elementwise cbrt, f64 dtype, contiguous fast path.
baracuda_kernels_unary_cbrt_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_cbrt_f64_strided.
baracuda_kernels_unary_cbrt_f64_strided_run^⚠: Unary elementwise cbrt, f64 dtype, strided path.
baracuda_kernels_unary_ceil_bf16_can_implement^⚠: Pre-launch implementability check for unary_ceil_bf16.
baracuda_kernels_unary_ceil_bf16_run^⚠: Unary elementwise ceil, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_ceil_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_ceil_bf16_strided.
baracuda_kernels_unary_ceil_bf16_strided_run^⚠: Unary elementwise ceil, bf16 dtype, strided path.
baracuda_kernels_unary_ceil_f16_can_implement^⚠: Pre-launch implementability check for unary_ceil_f16.
baracuda_kernels_unary_ceil_f16_run^⚠: Unary elementwise ceil, f16 dtype, contiguous fast path.
baracuda_kernels_unary_ceil_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_ceil_f16_strided.
baracuda_kernels_unary_ceil_f16_strided_run^⚠: Unary elementwise ceil, f16 dtype, strided path.
baracuda_kernels_unary_ceil_f32_can_implement^⚠: Pre-launch implementability check for unary_ceil_f32.
baracuda_kernels_unary_ceil_f32_run^⚠: Unary elementwise ceil, f32 dtype, contiguous fast path.
baracuda_kernels_unary_ceil_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_ceil_f32_strided.
baracuda_kernels_unary_ceil_f32_strided_run^⚠: Unary elementwise ceil, f32 dtype, strided path.
baracuda_kernels_unary_ceil_f64_can_implement^⚠: Pre-launch implementability check for unary_ceil_f64.
baracuda_kernels_unary_ceil_f64_run^⚠: Unary elementwise ceil, f64 dtype, contiguous fast path.
baracuda_kernels_unary_ceil_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_ceil_f64_strided.
baracuda_kernels_unary_ceil_f64_strided_run^⚠: Unary elementwise ceil, f64 dtype, strided path.
baracuda_kernels_unary_cos_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_cos_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_cos_backward_bf16_run^⚠: Cos backward, bf16.
baracuda_kernels_unary_cos_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_cos_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_cos_backward_f16_run^⚠: Cos backward, f16.
baracuda_kernels_unary_cos_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_cos_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_cos_backward_f32_run^⚠: Cos backward, f32. dx = -dy * sin(x). Saved-x.
baracuda_kernels_unary_cos_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_cos_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_cos_backward_f64_run^⚠: Cos backward, f64.
baracuda_kernels_unary_cos_bf16_can_implement^⚠: Pre-launch implementability check for unary_cos_bf16.
baracuda_kernels_unary_cos_bf16_run^⚠: Unary elementwise cos, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_cos_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_cos_bf16_strided.
baracuda_kernels_unary_cos_bf16_strided_run^⚠: Unary elementwise cos, bf16 dtype, strided path.
baracuda_kernels_unary_cos_f16_can_implement^⚠: Pre-launch implementability check for unary_cos_f16.
baracuda_kernels_unary_cos_f16_run^⚠: Unary elementwise cos, f16 dtype, contiguous fast path.
baracuda_kernels_unary_cos_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_cos_f16_strided.
baracuda_kernels_unary_cos_f16_strided_run^⚠: Unary elementwise cos, f16 dtype, strided path.
baracuda_kernels_unary_cos_f32_can_implement^⚠: Pre-launch implementability check for unary_cos_f32.
baracuda_kernels_unary_cos_f32_run^⚠: Unary elementwise cos, f32 dtype, contiguous fast path.
baracuda_kernels_unary_cos_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_cos_f32_strided.
baracuda_kernels_unary_cos_f32_strided_run^⚠: Unary elementwise cos, f32 dtype, strided path.
baracuda_kernels_unary_cos_f64_can_implement^⚠: Pre-launch implementability check for unary_cos_f64.
baracuda_kernels_unary_cos_f64_run^⚠: Unary elementwise cos, f64 dtype, contiguous fast path.
baracuda_kernels_unary_cos_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_cos_f64_strided.
baracuda_kernels_unary_cos_f64_strided_run^⚠: Unary elementwise cos, f64 dtype, strided path.
baracuda_kernels_unary_cosh_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_cosh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_cosh_backward_bf16_run^⚠: Cosh backward, bf16.
baracuda_kernels_unary_cosh_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_cosh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_cosh_backward_f16_run^⚠: Cosh backward, f16.
baracuda_kernels_unary_cosh_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_cosh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_cosh_backward_f32_run^⚠: Cosh backward, f32. dx = dy * sinh(x). Saved-x.
baracuda_kernels_unary_cosh_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_cosh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_cosh_backward_f64_run^⚠: Cosh backward, f64.
baracuda_kernels_unary_cosh_bf16_can_implement^⚠: Pre-launch implementability check for unary_cosh_bf16.
baracuda_kernels_unary_cosh_bf16_run^⚠: Unary elementwise cosh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_cosh_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_cosh_bf16_strided.
baracuda_kernels_unary_cosh_bf16_strided_run^⚠: Unary elementwise cosh, bf16 dtype, strided path.
baracuda_kernels_unary_cosh_f16_can_implement^⚠: Pre-launch implementability check for unary_cosh_f16.
baracuda_kernels_unary_cosh_f16_run^⚠: Unary elementwise cosh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_cosh_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_cosh_f16_strided.
baracuda_kernels_unary_cosh_f16_strided_run^⚠: Unary elementwise cosh, f16 dtype, strided path.
baracuda_kernels_unary_cosh_f32_can_implement^⚠: Pre-launch implementability check for unary_cosh_f32.
baracuda_kernels_unary_cosh_f32_run^⚠: Unary elementwise cosh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_cosh_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_cosh_f32_strided.
baracuda_kernels_unary_cosh_f32_strided_run^⚠: Unary elementwise cosh, f32 dtype, strided path.
baracuda_kernels_unary_cosh_f64_can_implement^⚠: Pre-launch implementability check for unary_cosh_f64.
baracuda_kernels_unary_cosh_f64_run^⚠: Unary elementwise cosh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_cosh_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_cosh_f64_strided.
baracuda_kernels_unary_cosh_f64_strided_run^⚠: Unary elementwise cosh, f64 dtype, strided path.
baracuda_kernels_unary_cube_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_cube_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_cube_backward_bf16_run^⚠: Cube backward, bf16.
baracuda_kernels_unary_cube_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_cube_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_cube_backward_f16_run^⚠: Cube backward, f16.
baracuda_kernels_unary_cube_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_cube_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_cube_backward_f32_run^⚠: Cube backward, f32. dx = dy * 3 * x².
baracuda_kernels_unary_cube_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_cube_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_cube_backward_f64_run^⚠: Cube backward, f64.
baracuda_kernels_unary_cube_bf16_can_implement^⚠: Pre-launch implementability check for unary_cube_bf16.
baracuda_kernels_unary_cube_bf16_run^⚠: Unary elementwise cube, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_cube_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_cube_bf16_strided.
baracuda_kernels_unary_cube_bf16_strided_run^⚠: Unary elementwise cube, bf16 dtype, strided path.
baracuda_kernels_unary_cube_f16_can_implement^⚠: Pre-launch implementability check for unary_cube_f16.
baracuda_kernels_unary_cube_f16_run^⚠: Unary elementwise cube, f16 dtype, contiguous fast path.
baracuda_kernels_unary_cube_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_cube_f16_strided.
baracuda_kernels_unary_cube_f16_strided_run^⚠: Unary elementwise cube, f16 dtype, strided path.
baracuda_kernels_unary_cube_f32_can_implement^⚠: Pre-launch implementability check for unary_cube_f32.
baracuda_kernels_unary_cube_f32_run^⚠: Unary elementwise cube, f32 dtype, contiguous fast path.
baracuda_kernels_unary_cube_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_cube_f32_strided.
baracuda_kernels_unary_cube_f32_strided_run^⚠: Unary elementwise cube, f32 dtype, strided path.
baracuda_kernels_unary_cube_f64_can_implement^⚠: Pre-launch implementability check for unary_cube_f64.
baracuda_kernels_unary_cube_f64_run^⚠: Unary elementwise cube, f64 dtype, contiguous fast path.
baracuda_kernels_unary_cube_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_cube_f64_strided.
baracuda_kernels_unary_cube_f64_strided_run^⚠: Unary elementwise cube, f64 dtype, strided path.
baracuda_kernels_unary_elu_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_elu_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_elu_backward_bf16_run^⚠: ELU backward, bf16.
baracuda_kernels_unary_elu_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_elu_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_elu_backward_f16_run^⚠: ELU backward, f16.
baracuda_kernels_unary_elu_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_elu_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_elu_backward_f32_run^⚠: ELU backward, f32. dx = (x > 0) ? dy : dy·α·exp(x) with α=1.0. Saved-x.
baracuda_kernels_unary_elu_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_elu_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_elu_backward_f64_run^⚠: ELU backward, f64.
baracuda_kernels_unary_elu_bf16_can_implement^⚠: Pre-launch implementability check for unary_elu_bf16.
baracuda_kernels_unary_elu_bf16_run^⚠: Unary elementwise elu(x; α), bf16, contig.
baracuda_kernels_unary_elu_bf16_strided_can_implement^⚠: Implementability check for baracuda_kernels_unary_elu_bf16_strided. Host-side only.
baracuda_kernels_unary_elu_bf16_strided_run^⚠: Unary elementwise elu(x; α), bf16, strided.
baracuda_kernels_unary_elu_f16_can_implement^⚠: Pre-launch implementability check for unary_elu_f16.
baracuda_kernels_unary_elu_f16_run^⚠: Unary elementwise elu(x; α), f16, contig.
baracuda_kernels_unary_elu_f16_strided_can_implement^⚠: Implementability check for baracuda_kernels_unary_elu_f16_strided. Host-side only.
baracuda_kernels_unary_elu_f16_strided_run^⚠: Unary elementwise elu(x; α), f16, strided.
baracuda_kernels_unary_elu_f32_can_implement^⚠: Pre-launch implementability check for unary_elu_f32.
baracuda_kernels_unary_elu_f32_run^⚠: Unary elementwise elu(x; α) = x if x>0 else α·(exp(x)-1), f32, contig.
baracuda_kernels_unary_elu_f32_strided_can_implement^⚠: Implementability check for baracuda_kernels_unary_elu_f32_strided. Host-side only.
baracuda_kernels_unary_elu_f32_strided_run^⚠: Unary elementwise elu(x; α), f32, strided.
baracuda_kernels_unary_elu_f64_can_implement^⚠: Pre-launch implementability check for unary_elu_f64.
baracuda_kernels_unary_elu_f64_run^⚠: Unary elementwise elu(x; α), f64, contig.
baracuda_kernels_unary_elu_f64_strided_can_implement^⚠: Implementability check for baracuda_kernels_unary_elu_f64_strided. Host-side only.
baracuda_kernels_unary_elu_f64_strided_run^⚠: Unary elementwise elu(x; α), f64, strided.
baracuda_kernels_unary_erf_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_erf_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_erf_backward_bf16_run^⚠: Erf backward, bf16.
baracuda_kernels_unary_erf_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_erf_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_erf_backward_f16_run^⚠: Erf backward, f16.
baracuda_kernels_unary_erf_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_erf_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_erf_backward_f32_run^⚠: Erf backward, f32. dx = dy * (2/√π) * exp(-x²).
baracuda_kernels_unary_erf_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_erf_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_erf_backward_f64_run^⚠: Erf backward, f64.
baracuda_kernels_unary_erf_bf16_can_implement^⚠: Pre-launch implementability check for unary_erf_bf16.
baracuda_kernels_unary_erf_bf16_run^⚠: Unary elementwise erf, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_erf_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_erf_bf16_strided.
baracuda_kernels_unary_erf_bf16_strided_run^⚠: Unary elementwise erf, bf16 dtype, strided path.
baracuda_kernels_unary_erf_f16_can_implement^⚠: Pre-launch implementability check for unary_erf_f16.
baracuda_kernels_unary_erf_f16_run^⚠: Unary elementwise erf, f16 dtype, contiguous fast path.
baracuda_kernels_unary_erf_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_erf_f16_strided.
baracuda_kernels_unary_erf_f16_strided_run^⚠: Unary elementwise erf, f16 dtype, strided path.
baracuda_kernels_unary_erf_f32_can_implement^⚠: Pre-launch implementability check for unary_erf_f32.
baracuda_kernels_unary_erf_f32_run^⚠: Unary elementwise erf, f32 dtype, contiguous fast path.
baracuda_kernels_unary_erf_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_erf_f32_strided.
baracuda_kernels_unary_erf_f32_strided_run^⚠: Unary elementwise erf, f32 dtype, strided path.
baracuda_kernels_unary_erf_f64_can_implement^⚠: Pre-launch implementability check for unary_erf_f64.
baracuda_kernels_unary_erf_f64_run^⚠: Unary elementwise erf, f64 dtype, contiguous fast path.
baracuda_kernels_unary_erf_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_erf_f64_strided.
baracuda_kernels_unary_erf_f64_strided_run^⚠: Unary elementwise erf, f64 dtype, strided path.
baracuda_kernels_unary_erfc_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_erfc_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_erfc_backward_bf16_run^⚠: Erfc backward, bf16.
baracuda_kernels_unary_erfc_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_erfc_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_erfc_backward_f16_run^⚠: Erfc backward, f16.
baracuda_kernels_unary_erfc_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_erfc_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_erfc_backward_f32_run^⚠: Erfc backward, f32. dx = -dy * (2/√π) * exp(-x²).
baracuda_kernels_unary_erfc_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_erfc_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_erfc_backward_f64_run^⚠: Erfc backward, f64.
baracuda_kernels_unary_erfc_bf16_can_implement^⚠: Pre-launch implementability check for unary_erfc_bf16.
baracuda_kernels_unary_erfc_bf16_run^⚠: Unary elementwise erfc, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_erfc_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_erfc_bf16_strided.
baracuda_kernels_unary_erfc_bf16_strided_run^⚠: Unary elementwise erfc, bf16 dtype, strided path.
baracuda_kernels_unary_erfc_f16_can_implement^⚠: Pre-launch implementability check for unary_erfc_f16.
baracuda_kernels_unary_erfc_f16_run^⚠: Unary elementwise erfc, f16 dtype, contiguous fast path.
baracuda_kernels_unary_erfc_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_erfc_f16_strided.
baracuda_kernels_unary_erfc_f16_strided_run^⚠: Unary elementwise erfc, f16 dtype, strided path.
baracuda_kernels_unary_erfc_f32_can_implement^⚠: Pre-launch implementability check for unary_erfc_f32.
baracuda_kernels_unary_erfc_f32_run^⚠: Unary elementwise erfc, f32 dtype, contiguous fast path.
baracuda_kernels_unary_erfc_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_erfc_f32_strided.
baracuda_kernels_unary_erfc_f32_strided_run^⚠: Unary elementwise erfc, f32 dtype, strided path.
baracuda_kernels_unary_erfc_f64_can_implement^⚠: Pre-launch implementability check for unary_erfc_f64.
baracuda_kernels_unary_erfc_f64_run^⚠: Unary elementwise erfc, f64 dtype, contiguous fast path.
baracuda_kernels_unary_erfc_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_erfc_f64_strided.
baracuda_kernels_unary_erfc_f64_strided_run^⚠: Unary elementwise erfc, f64 dtype, strided path.
baracuda_kernels_unary_exp2_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_exp2_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp2_backward_bf16_run^⚠: Exp2 backward, bf16.
baracuda_kernels_unary_exp2_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_exp2_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp2_backward_f16_run^⚠: Exp2 backward, f16.
baracuda_kernels_unary_exp2_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_exp2_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp2_backward_f32_run^⚠: Exp2 backward, f32. dx = dy * y * ln(2).
baracuda_kernels_unary_exp2_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_exp2_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp2_backward_f64_run^⚠: Exp2 backward, f64.
baracuda_kernels_unary_exp2_bf16_can_implement^⚠: Pre-launch implementability check for unary_exp2_bf16.
baracuda_kernels_unary_exp2_bf16_run^⚠: Unary elementwise exp2, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_exp2_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_exp2_bf16_strided.
baracuda_kernels_unary_exp2_bf16_strided_run^⚠: Unary elementwise exp2, bf16 dtype, strided path.
baracuda_kernels_unary_exp2_f16_can_implement^⚠: Pre-launch implementability check for unary_exp2_f16.
baracuda_kernels_unary_exp2_f16_run^⚠: Unary elementwise exp2, f16 dtype, contiguous fast path.
baracuda_kernels_unary_exp2_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_exp2_f16_strided.
baracuda_kernels_unary_exp2_f16_strided_run^⚠: Unary elementwise exp2, f16 dtype, strided path.
baracuda_kernels_unary_exp2_f32_can_implement^⚠: Pre-launch implementability check for unary_exp2_f32.
baracuda_kernels_unary_exp2_f32_run^⚠: Unary elementwise exp2, f32 dtype, contiguous fast path.
baracuda_kernels_unary_exp2_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_exp2_f32_strided.
baracuda_kernels_unary_exp2_f32_strided_run^⚠: Unary elementwise exp2, f32 dtype, strided path.
baracuda_kernels_unary_exp2_f64_can_implement^⚠: Pre-launch implementability check for unary_exp2_f64.
baracuda_kernels_unary_exp2_f64_run^⚠: Unary elementwise exp2, f64 dtype, contiguous fast path.
baracuda_kernels_unary_exp2_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_exp2_f64_strided.
baracuda_kernels_unary_exp2_f64_strided_run^⚠: Unary elementwise exp2, f64 dtype, strided path.
baracuda_kernels_unary_exp_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_exp_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp_backward_bf16_run^⚠: Exp backward, bf16.
baracuda_kernels_unary_exp_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_exp_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp_backward_f16_run^⚠: Exp backward, f16.
baracuda_kernels_unary_exp_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_exp_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp_backward_f32_run^⚠: Exp backward, f32. dx = dy * y. Caller must pass the forward output y as saved.
baracuda_kernels_unary_exp_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_exp_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_exp_backward_f64_run^⚠: Exp backward, f64.
baracuda_kernels_unary_exp_bf16_can_implement^⚠: Pre-launch implementability check for unary_exp_bf16.
baracuda_kernels_unary_exp_bf16_run^⚠: Unary elementwise exp, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_exp_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_exp_bf16_strided.
baracuda_kernels_unary_exp_bf16_strided_run^⚠: Unary elementwise exp, bf16 dtype, strided path.
baracuda_kernels_unary_exp_f16_can_implement^⚠: Pre-launch implementability check for unary_exp_f16.
baracuda_kernels_unary_exp_f16_run^⚠: Unary elementwise exp, f16 dtype, contiguous fast path.
baracuda_kernels_unary_exp_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_exp_f16_strided.
baracuda_kernels_unary_exp_f16_strided_run^⚠: Unary elementwise exp, f16 dtype, strided path.
baracuda_kernels_unary_exp_f32_can_implement^⚠: Pre-launch implementability check for unary_exp_f32.
baracuda_kernels_unary_exp_f32_run^⚠: Unary elementwise exp, f32 dtype, contiguous fast path.
baracuda_kernels_unary_exp_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_exp_f32_strided.
baracuda_kernels_unary_exp_f32_strided_run^⚠: Unary elementwise exp, f32 dtype, strided path.
baracuda_kernels_unary_exp_f64_can_implement^⚠: Pre-launch implementability check for unary_exp_f64.
baracuda_kernels_unary_exp_f64_run^⚠: Unary elementwise exp, f64 dtype, contiguous fast path.
baracuda_kernels_unary_exp_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_exp_f64_strided.
baracuda_kernels_unary_exp_f64_strided_run^⚠: Unary elementwise exp, f64 dtype, strided path.
baracuda_kernels_unary_expm1_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_expm1_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_expm1_backward_bf16_run^⚠: Expm1 backward, bf16.
baracuda_kernels_unary_expm1_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_expm1_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_expm1_backward_f16_run^⚠: Expm1 backward, f16.
baracuda_kernels_unary_expm1_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_expm1_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_expm1_backward_f32_run^⚠: Expm1 backward, f32. dx = dy * (y + 1). Saved-y.
baracuda_kernels_unary_expm1_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_expm1_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_expm1_backward_f64_run^⚠: Expm1 backward, f64.
baracuda_kernels_unary_expm1_bf16_can_implement^⚠: Pre-launch implementability check for unary_expm1_bf16.
baracuda_kernels_unary_expm1_bf16_run^⚠: Unary elementwise expm1, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_expm1_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_expm1_bf16_strided.
baracuda_kernels_unary_expm1_bf16_strided_run^⚠: Unary elementwise expm1, bf16 dtype, strided path.
baracuda_kernels_unary_expm1_f16_can_implement^⚠: Pre-launch implementability check for unary_expm1_f16.
baracuda_kernels_unary_expm1_f16_run^⚠: Unary elementwise expm1, f16 dtype, contiguous fast path.
baracuda_kernels_unary_expm1_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_expm1_f16_strided.
baracuda_kernels_unary_expm1_f16_strided_run^⚠: Unary elementwise expm1, f16 dtype, strided path.
baracuda_kernels_unary_expm1_f32_can_implement^⚠: Pre-launch implementability check for unary_expm1_f32.
baracuda_kernels_unary_expm1_f32_run^⚠: Unary elementwise expm1, f32 dtype, contiguous fast path.
baracuda_kernels_unary_expm1_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_expm1_f32_strided.
baracuda_kernels_unary_expm1_f32_strided_run^⚠: Unary elementwise expm1, f32 dtype, strided path.
baracuda_kernels_unary_expm1_f64_can_implement^⚠: Pre-launch implementability check for unary_expm1_f64.
baracuda_kernels_unary_expm1_f64_run^⚠: Unary elementwise expm1, f64 dtype, contiguous fast path.
baracuda_kernels_unary_expm1_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_expm1_f64_strided.
baracuda_kernels_unary_expm1_f64_strided_run^⚠: Unary elementwise expm1, f64 dtype, strided path.
baracuda_kernels_unary_floor_bf16_can_implement^⚠: Pre-launch implementability check for unary_floor_bf16.
baracuda_kernels_unary_floor_bf16_run^⚠: Unary elementwise floor, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_floor_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_floor_bf16_strided.
baracuda_kernels_unary_floor_bf16_strided_run^⚠: Unary elementwise floor, bf16 dtype, strided path.
baracuda_kernels_unary_floor_f16_can_implement^⚠: Pre-launch implementability check for unary_floor_f16.
baracuda_kernels_unary_floor_f16_run^⚠: Unary elementwise floor, f16 dtype, contiguous fast path.
baracuda_kernels_unary_floor_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_floor_f16_strided.
baracuda_kernels_unary_floor_f16_strided_run^⚠: Unary elementwise floor, f16 dtype, strided path.
baracuda_kernels_unary_floor_f32_can_implement^⚠: Pre-launch implementability check for unary_floor_f32.
baracuda_kernels_unary_floor_f32_run^⚠: Unary elementwise floor, f32 dtype, contiguous fast path.
baracuda_kernels_unary_floor_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_floor_f32_strided.
baracuda_kernels_unary_floor_f32_strided_run^⚠: Unary elementwise floor, f32 dtype, strided path.
baracuda_kernels_unary_floor_f64_can_implement^⚠: Pre-launch implementability check for unary_floor_f64.
baracuda_kernels_unary_floor_f64_run^⚠: Unary elementwise floor, f64 dtype, contiguous fast path.
baracuda_kernels_unary_floor_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_floor_f64_strided.
baracuda_kernels_unary_floor_f64_strided_run^⚠: Unary elementwise floor, f64 dtype, strided path.
baracuda_kernels_unary_frac_bf16_can_implement^⚠: Pre-launch implementability check for unary_frac_bf16.
baracuda_kernels_unary_frac_bf16_run^⚠: Unary elementwise frac, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_frac_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_frac_bf16_strided.
baracuda_kernels_unary_frac_bf16_strided_run^⚠: Unary elementwise frac, bf16 dtype, strided path.
baracuda_kernels_unary_frac_f16_can_implement^⚠: Pre-launch implementability check for unary_frac_f16.
baracuda_kernels_unary_frac_f16_run^⚠: Unary elementwise frac, f16 dtype, contiguous fast path.
baracuda_kernels_unary_frac_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_frac_f16_strided.
baracuda_kernels_unary_frac_f16_strided_run^⚠: Unary elementwise frac, f16 dtype, strided path.
baracuda_kernels_unary_frac_f32_can_implement^⚠: Pre-launch implementability check for unary_frac_f32.
baracuda_kernels_unary_frac_f32_run^⚠: Unary elementwise frac, f32 dtype, contiguous fast path.
baracuda_kernels_unary_frac_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_frac_f32_strided.
baracuda_kernels_unary_frac_f32_strided_run^⚠: Unary elementwise frac, f32 dtype, strided path.
baracuda_kernels_unary_frac_f64_can_implement^⚠: Pre-launch implementability check for unary_frac_f64.
baracuda_kernels_unary_frac_f64_run^⚠: Unary elementwise frac, f64 dtype, contiguous fast path.
baracuda_kernels_unary_frac_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_frac_f64_strided.
baracuda_kernels_unary_frac_f64_strided_run^⚠: Unary elementwise frac, f64 dtype, strided path.
baracuda_kernels_unary_gelu_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_gelu_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_backward_bf16_run^⚠: GELU (erf-based) backward, bf16.
baracuda_kernels_unary_gelu_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_gelu_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_backward_f16_run^⚠: GELU (erf-based) backward, f16.
baracuda_kernels_unary_gelu_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_gelu_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_backward_f32_run^⚠: GELU (exact / erf-based) backward, f32. dx = dy * (Φ(x) + x*φ(x)). Saved-x.
baracuda_kernels_unary_gelu_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_gelu_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_backward_f64_run^⚠: GELU (erf-based) backward, f64.
baracuda_kernels_unary_gelu_bf16_can_implement^⚠: Pre-launch implementability check for unary_gelu_bf16.
baracuda_kernels_unary_gelu_bf16_run^⚠: Unary elementwise gelu, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_gelu_bf16_strided.
baracuda_kernels_unary_gelu_bf16_strided_run^⚠: Unary elementwise gelu, bf16 dtype, strided path.
baracuda_kernels_unary_gelu_erf_bf16_can_implement^⚠: baracuda_kernels_unary_gelu_erf_bf16_can_implement (baracuda kernels unary gelu erf bf16 can implement).
baracuda_kernels_unary_gelu_erf_bf16_run^⚠: unary_gelu_erf, bf16, contig.
baracuda_kernels_unary_gelu_erf_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_gelu_erf_bf16_strided.
baracuda_kernels_unary_gelu_erf_bf16_strided_run^⚠: baracuda_kernels_unary_gelu_erf_bf16_strided_run (baracuda kernels unary gelu erf bf16 strided run).
baracuda_kernels_unary_gelu_erf_f16_can_implement^⚠: baracuda_kernels_unary_gelu_erf_f16_can_implement (baracuda kernels unary gelu erf f16 can implement).
baracuda_kernels_unary_gelu_erf_f16_run^⚠: unary_gelu_erf, f16, contig.
baracuda_kernels_unary_gelu_erf_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_gelu_erf_f16_strided.
baracuda_kernels_unary_gelu_erf_f16_strided_run^⚠: baracuda_kernels_unary_gelu_erf_f16_strided_run (baracuda kernels unary gelu erf f16 strided run).
baracuda_kernels_unary_gelu_erf_f32_can_implement^⚠: baracuda_kernels_unary_gelu_erf_f32_can_implement (baracuda kernels unary gelu erf f32 can implement).
baracuda_kernels_unary_gelu_erf_f32_run^⚠: unary_gelu_erf, f32, contig.
baracuda_kernels_unary_gelu_erf_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_gelu_erf_f32_strided.
baracuda_kernels_unary_gelu_erf_f32_strided_run^⚠: baracuda_kernels_unary_gelu_erf_f32_strided_run (baracuda kernels unary gelu erf f32 strided run).
baracuda_kernels_unary_gelu_erf_f64_can_implement^⚠: baracuda_kernels_unary_gelu_erf_f64_can_implement (baracuda kernels unary gelu erf f64 can implement).
baracuda_kernels_unary_gelu_erf_f64_run^⚠: unary_gelu_erf, f64, contig.
baracuda_kernels_unary_gelu_erf_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_gelu_erf_f64_strided.
baracuda_kernels_unary_gelu_erf_f64_strided_run^⚠: baracuda_kernels_unary_gelu_erf_f64_strided_run (baracuda kernels unary gelu erf f64 strided run).
baracuda_kernels_unary_gelu_f16_can_implement^⚠: Pre-launch implementability check for unary_gelu_f16.
baracuda_kernels_unary_gelu_f16_run^⚠: Unary elementwise gelu, f16 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_gelu_f16_strided.
baracuda_kernels_unary_gelu_f16_strided_run^⚠: Unary elementwise gelu, f16 dtype, strided path.
baracuda_kernels_unary_gelu_f32_can_implement^⚠: Pre-launch implementability check for unary_gelu_f32.
baracuda_kernels_unary_gelu_f32_run^⚠: Unary elementwise gelu, f32 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_gelu_f32_strided.
baracuda_kernels_unary_gelu_f32_strided_run^⚠: Unary elementwise gelu, f32 dtype, strided path.
baracuda_kernels_unary_gelu_f64_can_implement^⚠: Pre-launch implementability check for unary_gelu_f64.
baracuda_kernels_unary_gelu_f64_run^⚠: Unary elementwise gelu, f64 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_gelu_f64_strided.
baracuda_kernels_unary_gelu_f64_strided_run^⚠: Unary elementwise gelu, f64 dtype, strided path.
baracuda_kernels_unary_gelu_tanh_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_gelu_tanh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_tanh_backward_bf16_run^⚠: GELU (tanh approximation) backward, bf16.
baracuda_kernels_unary_gelu_tanh_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_gelu_tanh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_tanh_backward_f16_run^⚠: GELU (tanh approximation) backward, f16.
baracuda_kernels_unary_gelu_tanh_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_gelu_tanh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_tanh_backward_f32_run^⚠: GELU (tanh approximation) backward, f32. Saved-x.
baracuda_kernels_unary_gelu_tanh_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_gelu_tanh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_gelu_tanh_backward_f64_run^⚠: GELU (tanh approximation) backward, f64.
baracuda_kernels_unary_gelu_tanh_bf16_can_implement^⚠: Pre-launch implementability check for unary_gelu_tanh_bf16.
baracuda_kernels_unary_gelu_tanh_bf16_run^⚠: Unary elementwise gelu_tanh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_tanh_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_gelu_tanh_bf16_strided.
baracuda_kernels_unary_gelu_tanh_bf16_strided_run^⚠: Unary elementwise gelu_tanh, bf16 dtype, strided path.
baracuda_kernels_unary_gelu_tanh_f16_can_implement^⚠: Pre-launch implementability check for unary_gelu_tanh_f16.
baracuda_kernels_unary_gelu_tanh_f16_run^⚠: Unary elementwise gelu_tanh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_tanh_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_gelu_tanh_f16_strided.
baracuda_kernels_unary_gelu_tanh_f16_strided_run^⚠: Unary elementwise gelu_tanh, f16 dtype, strided path.
baracuda_kernels_unary_gelu_tanh_f32_can_implement^⚠: Pre-launch implementability check for unary_gelu_tanh_f32.
baracuda_kernels_unary_gelu_tanh_f32_run^⚠: Unary elementwise gelu_tanh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_tanh_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_gelu_tanh_f32_strided.
baracuda_kernels_unary_gelu_tanh_f32_strided_run^⚠: Unary elementwise gelu_tanh, f32 dtype, strided path.
baracuda_kernels_unary_gelu_tanh_f64_can_implement^⚠: Pre-launch implementability check for unary_gelu_tanh_f64.
baracuda_kernels_unary_gelu_tanh_f64_run^⚠: Unary elementwise gelu_tanh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_gelu_tanh_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_gelu_tanh_f64_strided.
baracuda_kernels_unary_gelu_tanh_f64_strided_run^⚠: Unary elementwise gelu_tanh, f64 dtype, strided path.
baracuda_kernels_unary_hardshrink_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_hardshrink_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardshrink_backward_bf16_run^⚠: Hardshrink backward, bf16.
baracuda_kernels_unary_hardshrink_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_hardshrink_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardshrink_backward_f16_run^⚠: Hardshrink backward, f16.
baracuda_kernels_unary_hardshrink_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_hardshrink_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardshrink_backward_f32_run^⚠: Hardshrink backward, f32. dx = (|x| > λ) ? dy : 0 with λ=0.5. Saved-x.
baracuda_kernels_unary_hardshrink_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_hardshrink_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardshrink_backward_f64_run^⚠: Hardshrink backward, f64.
baracuda_kernels_unary_hardshrink_bf16_can_implement^⚠: Pre-launch implementability check for unary_hardshrink_bf16.
baracuda_kernels_unary_hardshrink_bf16_run^⚠: Unary elementwise hardshrink (λ=0.5), bf16, contig.
baracuda_kernels_unary_hardshrink_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_hardshrink_bf16_strided.
baracuda_kernels_unary_hardshrink_bf16_strided_run^⚠: Unary elementwise hardshrink (λ=0.5), bf16, strided.
baracuda_kernels_unary_hardshrink_f16_can_implement^⚠: Pre-launch implementability check for unary_hardshrink_f16.
baracuda_kernels_unary_hardshrink_f16_run^⚠: Unary elementwise hardshrink (λ=0.5), f16, contig.
baracuda_kernels_unary_hardshrink_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_hardshrink_f16_strided.
baracuda_kernels_unary_hardshrink_f16_strided_run^⚠: Unary elementwise hardshrink (λ=0.5), f16, strided.
baracuda_kernels_unary_hardshrink_f32_can_implement^⚠: Pre-launch implementability check for unary_hardshrink_f32.
baracuda_kernels_unary_hardshrink_f32_run^⚠: Unary elementwise hardshrink (λ=0.5), f32, contig.
baracuda_kernels_unary_hardshrink_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_hardshrink_f32_strided.
baracuda_kernels_unary_hardshrink_f32_strided_run^⚠: Unary elementwise hardshrink (λ=0.5), f32, strided.
baracuda_kernels_unary_hardshrink_f64_can_implement^⚠: Pre-launch implementability check for unary_hardshrink_f64.
baracuda_kernels_unary_hardshrink_f64_run^⚠: Unary elementwise hardshrink (λ=0.5), f64, contig.
baracuda_kernels_unary_hardshrink_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_hardshrink_f64_strided.
baracuda_kernels_unary_hardshrink_f64_strided_run^⚠: Unary elementwise hardshrink (λ=0.5), f64, strided.
baracuda_kernels_unary_hardsigmoid_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_hardsigmoid_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardsigmoid_backward_bf16_run^⚠: Hardsigmoid backward, bf16.
baracuda_kernels_unary_hardsigmoid_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_hardsigmoid_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardsigmoid_backward_f16_run^⚠: Hardsigmoid backward, f16.
baracuda_kernels_unary_hardsigmoid_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_hardsigmoid_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardsigmoid_backward_f32_run^⚠: Hardsigmoid backward, f32. dx = (-3 < x < 3) ? dy / 6 : 0. Saved-x.
baracuda_kernels_unary_hardsigmoid_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_hardsigmoid_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardsigmoid_backward_f64_run^⚠: Hardsigmoid backward, f64.
baracuda_kernels_unary_hardsigmoid_bf16_can_implement^⚠: Pre-launch implementability check for unary_hardsigmoid_bf16.
baracuda_kernels_unary_hardsigmoid_bf16_run^⚠: Unary elementwise hardsigmoid, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_hardsigmoid_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_hardsigmoid_bf16_strided.
baracuda_kernels_unary_hardsigmoid_bf16_strided_run^⚠: Unary elementwise hardsigmoid, bf16 dtype, strided path.
baracuda_kernels_unary_hardsigmoid_f16_can_implement^⚠: Pre-launch implementability check for unary_hardsigmoid_f16.
baracuda_kernels_unary_hardsigmoid_f16_run^⚠: Unary elementwise hardsigmoid, f16 dtype, contiguous fast path.
baracuda_kernels_unary_hardsigmoid_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_hardsigmoid_f16_strided.
baracuda_kernels_unary_hardsigmoid_f16_strided_run^⚠: Unary elementwise hardsigmoid, f16 dtype, strided path.
baracuda_kernels_unary_hardsigmoid_f32_can_implement^⚠: Pre-launch implementability check for unary_hardsigmoid_f32.
baracuda_kernels_unary_hardsigmoid_f32_run^⚠: Unary elementwise hardsigmoid, f32 dtype, contiguous fast path.
baracuda_kernels_unary_hardsigmoid_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_hardsigmoid_f32_strided.
baracuda_kernels_unary_hardsigmoid_f32_strided_run^⚠: Unary elementwise hardsigmoid, f32 dtype, strided path.
baracuda_kernels_unary_hardsigmoid_f64_can_implement^⚠: Pre-launch implementability check for unary_hardsigmoid_f64.
baracuda_kernels_unary_hardsigmoid_f64_run^⚠: Unary elementwise hardsigmoid, f64 dtype, contiguous fast path.
baracuda_kernels_unary_hardsigmoid_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_hardsigmoid_f64_strided.
baracuda_kernels_unary_hardsigmoid_f64_strided_run^⚠: Unary elementwise hardsigmoid, f64 dtype, strided path.
baracuda_kernels_unary_hardswish_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_hardswish_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardswish_backward_bf16_run^⚠: Hardswish backward, bf16.
baracuda_kernels_unary_hardswish_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_hardswish_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardswish_backward_f16_run^⚠: Hardswish backward, f16.
baracuda_kernels_unary_hardswish_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_hardswish_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardswish_backward_f32_run^⚠: Hardswish backward, f32. Three-region piecewise + (2x+3)/6 middle. Saved-x.
baracuda_kernels_unary_hardswish_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_hardswish_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardswish_backward_f64_run^⚠: Hardswish backward, f64.
baracuda_kernels_unary_hardswish_bf16_can_implement^⚠: Pre-launch implementability check for unary_hardswish_bf16.
baracuda_kernels_unary_hardswish_bf16_run^⚠: Unary elementwise hardswish, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_hardswish_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_hardswish_bf16_strided.
baracuda_kernels_unary_hardswish_bf16_strided_run^⚠: Unary elementwise hardswish, bf16 dtype, strided path.
baracuda_kernels_unary_hardswish_f16_can_implement^⚠: Pre-launch implementability check for unary_hardswish_f16.
baracuda_kernels_unary_hardswish_f16_run^⚠: Unary elementwise hardswish, f16 dtype, contiguous fast path.
baracuda_kernels_unary_hardswish_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_hardswish_f16_strided.
baracuda_kernels_unary_hardswish_f16_strided_run^⚠: Unary elementwise hardswish, f16 dtype, strided path.
baracuda_kernels_unary_hardswish_f32_can_implement^⚠: Pre-launch implementability check for unary_hardswish_f32.
baracuda_kernels_unary_hardswish_f32_run^⚠: Unary elementwise hardswish, f32 dtype, contiguous fast path.
baracuda_kernels_unary_hardswish_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_hardswish_f32_strided.
baracuda_kernels_unary_hardswish_f32_strided_run^⚠: Unary elementwise hardswish, f32 dtype, strided path.
baracuda_kernels_unary_hardswish_f64_can_implement^⚠: Pre-launch implementability check for unary_hardswish_f64.
baracuda_kernels_unary_hardswish_f64_run^⚠: Unary elementwise hardswish, f64 dtype, contiguous fast path.
baracuda_kernels_unary_hardswish_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_hardswish_f64_strided.
baracuda_kernels_unary_hardswish_f64_strided_run^⚠: Unary elementwise hardswish, f64 dtype, strided path.
baracuda_kernels_unary_hardtanh_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_hardtanh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardtanh_backward_bf16_run^⚠: Hardtanh backward, bf16.
baracuda_kernels_unary_hardtanh_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_hardtanh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardtanh_backward_f16_run^⚠: Hardtanh backward, f16.
baracuda_kernels_unary_hardtanh_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_hardtanh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardtanh_backward_f32_run^⚠: Hardtanh backward, f32. dx = (-1 < x < 1) ? dy : 0. Saved-x.
baracuda_kernels_unary_hardtanh_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_hardtanh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_hardtanh_backward_f64_run^⚠: Hardtanh backward, f64.
baracuda_kernels_unary_hardtanh_bf16_can_implement^⚠: Pre-launch implementability check for unary_hardtanh_bf16.
baracuda_kernels_unary_hardtanh_bf16_run^⚠: Unary elementwise hardtanh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_hardtanh_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_hardtanh_bf16_strided.
baracuda_kernels_unary_hardtanh_bf16_strided_run^⚠: Unary elementwise hardtanh, bf16 dtype, strided path.
baracuda_kernels_unary_hardtanh_f16_can_implement^⚠: Pre-launch implementability check for unary_hardtanh_f16.
baracuda_kernels_unary_hardtanh_f16_run^⚠: Unary elementwise hardtanh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_hardtanh_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_hardtanh_f16_strided.
baracuda_kernels_unary_hardtanh_f16_strided_run^⚠: Unary elementwise hardtanh, f16 dtype, strided path.
baracuda_kernels_unary_hardtanh_f32_can_implement^⚠: Pre-launch implementability check for unary_hardtanh_f32.
baracuda_kernels_unary_hardtanh_f32_run^⚠: Unary elementwise hardtanh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_hardtanh_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_hardtanh_f32_strided.
baracuda_kernels_unary_hardtanh_f32_strided_run^⚠: Unary elementwise hardtanh, f32 dtype, strided path.
baracuda_kernels_unary_hardtanh_f64_can_implement^⚠: Pre-launch implementability check for unary_hardtanh_f64.
baracuda_kernels_unary_hardtanh_f64_run^⚠: Unary elementwise hardtanh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_hardtanh_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_hardtanh_f64_strided.
baracuda_kernels_unary_hardtanh_f64_strided_run^⚠: Unary elementwise hardtanh, f64 dtype, strided path.
baracuda_kernels_unary_leaky_relu_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_leaky_relu_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_leaky_relu_backward_bf16_run^⚠: LeakyReLU backward, bf16.
baracuda_kernels_unary_leaky_relu_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_leaky_relu_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_leaky_relu_backward_f16_run^⚠: LeakyReLU backward, f16.
baracuda_kernels_unary_leaky_relu_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_leaky_relu_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_leaky_relu_backward_f32_run^⚠: LeakyReLU backward, f32. dx = (x > 0) ? dy : dy·α with α=0.01. Saved-x.
baracuda_kernels_unary_leaky_relu_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_leaky_relu_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_leaky_relu_backward_f64_run^⚠: LeakyReLU backward, f64.
baracuda_kernels_unary_leaky_relu_bf16_can_implement^⚠: Pre-launch implementability check for unary_leaky_relu_bf16.
baracuda_kernels_unary_leaky_relu_bf16_run^⚠: Unary elementwise leaky_relu (α=0.01), bf16, contig.
baracuda_kernels_unary_leaky_relu_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_leaky_relu_bf16_strided.
baracuda_kernels_unary_leaky_relu_bf16_strided_run^⚠: Unary elementwise leaky_relu (α=0.01), bf16, strided.
baracuda_kernels_unary_leaky_relu_f16_can_implement^⚠: Pre-launch implementability check for unary_leaky_relu_f16.
baracuda_kernels_unary_leaky_relu_f16_run^⚠: Unary elementwise leaky_relu (α=0.01), f16, contig.
baracuda_kernels_unary_leaky_relu_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_leaky_relu_f16_strided.
baracuda_kernels_unary_leaky_relu_f16_strided_run^⚠: Unary elementwise leaky_relu (α=0.01), f16, strided.
baracuda_kernels_unary_leaky_relu_f32_can_implement^⚠: Pre-launch implementability check for unary_leaky_relu_f32.
baracuda_kernels_unary_leaky_relu_f32_run^⚠: Unary elementwise leaky_relu (α=0.01), f32, contig.
baracuda_kernels_unary_leaky_relu_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_leaky_relu_f32_strided.
baracuda_kernels_unary_leaky_relu_f32_strided_run^⚠: Unary elementwise leaky_relu (α=0.01), f32, strided.
baracuda_kernels_unary_leaky_relu_f64_can_implement^⚠: Pre-launch implementability check for unary_leaky_relu_f64.
baracuda_kernels_unary_leaky_relu_f64_run^⚠: Unary elementwise leaky_relu (α=0.01), f64, contig.
baracuda_kernels_unary_leaky_relu_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_leaky_relu_f64_strided.
baracuda_kernels_unary_leaky_relu_f64_strided_run^⚠: Unary elementwise leaky_relu (α=0.01), f64, strided.
baracuda_kernels_unary_lgamma_bf16_can_implement^⚠: Pre-launch implementability check for unary_lgamma_bf16.
baracuda_kernels_unary_lgamma_bf16_run^⚠: Unary elementwise lgamma, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_lgamma_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_lgamma_bf16_strided.
baracuda_kernels_unary_lgamma_bf16_strided_run^⚠: Unary elementwise lgamma, bf16 dtype, strided path.
baracuda_kernels_unary_lgamma_f16_can_implement^⚠: Pre-launch implementability check for unary_lgamma_f16.
baracuda_kernels_unary_lgamma_f16_run^⚠: Unary elementwise lgamma, f16 dtype, contiguous fast path.
baracuda_kernels_unary_lgamma_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_lgamma_f16_strided.
baracuda_kernels_unary_lgamma_f16_strided_run^⚠: Unary elementwise lgamma, f16 dtype, strided path.
baracuda_kernels_unary_lgamma_f32_can_implement^⚠: Pre-launch implementability check for unary_lgamma_f32.
baracuda_kernels_unary_lgamma_f32_run^⚠: Unary elementwise lgamma, f32 dtype, contiguous fast path.
baracuda_kernels_unary_lgamma_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_lgamma_f32_strided.
baracuda_kernels_unary_lgamma_f32_strided_run^⚠: Unary elementwise lgamma, f32 dtype, strided path.
baracuda_kernels_unary_lgamma_f64_can_implement^⚠: Pre-launch implementability check for unary_lgamma_f64.
baracuda_kernels_unary_lgamma_f64_run^⚠: Unary elementwise lgamma, f64 dtype, contiguous fast path.
baracuda_kernels_unary_lgamma_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_lgamma_f64_strided.
baracuda_kernels_unary_lgamma_f64_strided_run^⚠: Unary elementwise lgamma, f64 dtype, strided path.
baracuda_kernels_unary_log1p_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_log1p_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log1p_backward_bf16_run^⚠: Log1p backward, bf16.
baracuda_kernels_unary_log1p_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_log1p_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log1p_backward_f16_run^⚠: Log1p backward, f16.
baracuda_kernels_unary_log1p_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_log1p_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_log1p_backward_f32_run^⚠: Log1p backward, f32. dx = dy / (1 + x).
baracuda_kernels_unary_log1p_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_log1p_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_log1p_backward_f64_run^⚠: Log1p backward, f64.
baracuda_kernels_unary_log1p_bf16_can_implement^⚠: Pre-launch implementability check for unary_log1p_bf16.
baracuda_kernels_unary_log1p_bf16_run^⚠: Unary elementwise log1p, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_log1p_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_log1p_bf16_strided.
baracuda_kernels_unary_log1p_bf16_strided_run^⚠: Unary elementwise log1p, bf16 dtype, strided path.
baracuda_kernels_unary_log1p_f16_can_implement^⚠: Pre-launch implementability check for unary_log1p_f16.
baracuda_kernels_unary_log1p_f16_run^⚠: Unary elementwise log1p, f16 dtype, contiguous fast path.
baracuda_kernels_unary_log1p_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_log1p_f16_strided.
baracuda_kernels_unary_log1p_f16_strided_run^⚠: Unary elementwise log1p, f16 dtype, strided path.
baracuda_kernels_unary_log1p_f32_can_implement^⚠: Pre-launch implementability check for unary_log1p_f32.
baracuda_kernels_unary_log1p_f32_run^⚠: Unary elementwise log1p, f32 dtype, contiguous fast path.
baracuda_kernels_unary_log1p_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_log1p_f32_strided.
baracuda_kernels_unary_log1p_f32_strided_run^⚠: Unary elementwise log1p, f32 dtype, strided path.
baracuda_kernels_unary_log1p_f64_can_implement^⚠: Pre-launch implementability check for unary_log1p_f64.
baracuda_kernels_unary_log1p_f64_run^⚠: Unary elementwise log1p, f64 dtype, contiguous fast path.
baracuda_kernels_unary_log1p_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_log1p_f64_strided.
baracuda_kernels_unary_log1p_f64_strided_run^⚠: Unary elementwise log1p, f64 dtype, strided path.
baracuda_kernels_unary_log2_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_log2_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log2_backward_bf16_run^⚠: Log2 backward, bf16.
baracuda_kernels_unary_log2_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_log2_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log2_backward_f16_run^⚠: Log2 backward, f16.
baracuda_kernels_unary_log2_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_log2_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_log2_backward_f32_run^⚠: Log2 backward, f32. dx = dy / (x * ln(2)).
baracuda_kernels_unary_log2_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_log2_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_log2_backward_f64_run^⚠: Log2 backward, f64.
baracuda_kernels_unary_log2_bf16_can_implement^⚠: Pre-launch implementability check for unary_log2_bf16.
baracuda_kernels_unary_log2_bf16_run^⚠: Unary elementwise log2, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_log2_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_log2_bf16_strided.
baracuda_kernels_unary_log2_bf16_strided_run^⚠: Unary elementwise log2, bf16 dtype, strided path.
baracuda_kernels_unary_log2_f16_can_implement^⚠: Pre-launch implementability check for unary_log2_f16.
baracuda_kernels_unary_log2_f16_run^⚠: Unary elementwise log2, f16 dtype, contiguous fast path.
baracuda_kernels_unary_log2_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_log2_f16_strided.
baracuda_kernels_unary_log2_f16_strided_run^⚠: Unary elementwise log2, f16 dtype, strided path.
baracuda_kernels_unary_log2_f32_can_implement^⚠: Pre-launch implementability check for unary_log2_f32.
baracuda_kernels_unary_log2_f32_run^⚠: Unary elementwise log2, f32 dtype, contiguous fast path.
baracuda_kernels_unary_log2_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_log2_f32_strided.
baracuda_kernels_unary_log2_f32_strided_run^⚠: Unary elementwise log2, f32 dtype, strided path.
baracuda_kernels_unary_log2_f64_can_implement^⚠: Pre-launch implementability check for unary_log2_f64.
baracuda_kernels_unary_log2_f64_run^⚠: Unary elementwise log2, f64 dtype, contiguous fast path.
baracuda_kernels_unary_log2_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_log2_f64_strided.
baracuda_kernels_unary_log2_f64_strided_run^⚠: Unary elementwise log2, f64 dtype, strided path.
baracuda_kernels_unary_log10_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_log10_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log10_backward_bf16_run^⚠: Log10 backward, bf16.
baracuda_kernels_unary_log10_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_log10_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log10_backward_f16_run^⚠: Log10 backward, f16.
baracuda_kernels_unary_log10_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_log10_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_log10_backward_f32_run^⚠: Log10 backward, f32. dx = dy / (x * ln(10)).
baracuda_kernels_unary_log10_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_log10_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_log10_backward_f64_run^⚠: Log10 backward, f64.
baracuda_kernels_unary_log10_bf16_can_implement^⚠: Pre-launch implementability check for unary_log10_bf16.
baracuda_kernels_unary_log10_bf16_run^⚠: Unary elementwise log10, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_log10_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_log10_bf16_strided.
baracuda_kernels_unary_log10_bf16_strided_run^⚠: Unary elementwise log10, bf16 dtype, strided path.
baracuda_kernels_unary_log10_f16_can_implement^⚠: Pre-launch implementability check for unary_log10_f16.
baracuda_kernels_unary_log10_f16_run^⚠: Unary elementwise log10, f16 dtype, contiguous fast path.
baracuda_kernels_unary_log10_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_log10_f16_strided.
baracuda_kernels_unary_log10_f16_strided_run^⚠: Unary elementwise log10, f16 dtype, strided path.
baracuda_kernels_unary_log10_f32_can_implement^⚠: Pre-launch implementability check for unary_log10_f32.
baracuda_kernels_unary_log10_f32_run^⚠: Unary elementwise log10, f32 dtype, contiguous fast path.
baracuda_kernels_unary_log10_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_log10_f32_strided.
baracuda_kernels_unary_log10_f32_strided_run^⚠: Unary elementwise log10, f32 dtype, strided path.
baracuda_kernels_unary_log10_f64_can_implement^⚠: Pre-launch implementability check for unary_log10_f64.
baracuda_kernels_unary_log10_f64_run^⚠: Unary elementwise log10, f64 dtype, contiguous fast path.
baracuda_kernels_unary_log10_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_log10_f64_strided.
baracuda_kernels_unary_log10_f64_strided_run^⚠: Unary elementwise log10, f64 dtype, strided path.
baracuda_kernels_unary_log_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_log_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log_backward_bf16_run^⚠: Log backward, bf16.
baracuda_kernels_unary_log_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_log_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_log_backward_f16_run^⚠: Log backward, f16.
baracuda_kernels_unary_log_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_log_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_log_backward_f32_run^⚠: Log backward, f32. dx = dy / x.
baracuda_kernels_unary_log_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_log_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_log_backward_f64_run^⚠: Log backward, f64.
baracuda_kernels_unary_log_bf16_can_implement^⚠: Pre-launch implementability check for unary_log_bf16.
baracuda_kernels_unary_log_bf16_run^⚠: Unary elementwise log, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_log_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_log_bf16_strided.
baracuda_kernels_unary_log_bf16_strided_run^⚠: Unary elementwise log, bf16 dtype, strided path.
baracuda_kernels_unary_log_f16_can_implement^⚠: Pre-launch implementability check for unary_log_f16.
baracuda_kernels_unary_log_f16_run^⚠: Unary elementwise log, f16 dtype, contiguous fast path.
baracuda_kernels_unary_log_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_log_f16_strided.
baracuda_kernels_unary_log_f16_strided_run^⚠: Unary elementwise log, f16 dtype, strided path.
baracuda_kernels_unary_log_f32_can_implement^⚠: Pre-launch implementability check for unary_log_f32.
baracuda_kernels_unary_log_f32_run^⚠: Unary elementwise log, f32 dtype, contiguous fast path.
baracuda_kernels_unary_log_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_log_f32_strided.
baracuda_kernels_unary_log_f32_strided_run^⚠: Unary elementwise log, f32 dtype, strided path.
baracuda_kernels_unary_log_f64_can_implement^⚠: Pre-launch implementability check for unary_log_f64.
baracuda_kernels_unary_log_f64_run^⚠: Unary elementwise log, f64 dtype, contiguous fast path.
baracuda_kernels_unary_log_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_log_f64_strided.
baracuda_kernels_unary_log_f64_strided_run^⚠: Unary elementwise log, f64 dtype, strided path.
baracuda_kernels_unary_logit_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_logit_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_logit_backward_bf16_run^⚠: Logit backward, bf16.
baracuda_kernels_unary_logit_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_logit_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_logit_backward_f16_run^⚠: Logit backward, f16.
baracuda_kernels_unary_logit_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_logit_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_logit_backward_f32_run^⚠: Logit backward, f32. dx = dy / (x * (1 - x)). Domain 0 < x < 1.
baracuda_kernels_unary_logit_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_logit_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_logit_backward_f64_run^⚠: Logit backward, f64.
baracuda_kernels_unary_logit_bf16_can_implement^⚠: Pre-launch implementability check for unary_logit_bf16.
baracuda_kernels_unary_logit_bf16_run^⚠: Unary elementwise logit, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_logit_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_logit_bf16_strided.
baracuda_kernels_unary_logit_bf16_strided_run^⚠: Unary elementwise logit, bf16 dtype, strided path.
baracuda_kernels_unary_logit_f16_can_implement^⚠: Pre-launch implementability check for unary_logit_f16.
baracuda_kernels_unary_logit_f16_run^⚠: Unary elementwise logit, f16 dtype, contiguous fast path.
baracuda_kernels_unary_logit_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_logit_f16_strided.
baracuda_kernels_unary_logit_f16_strided_run^⚠: Unary elementwise logit, f16 dtype, strided path.
baracuda_kernels_unary_logit_f32_can_implement^⚠: Pre-launch implementability check for unary_logit_f32.
baracuda_kernels_unary_logit_f32_run^⚠: Unary elementwise logit, f32 dtype, contiguous fast path.
baracuda_kernels_unary_logit_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_logit_f32_strided.
baracuda_kernels_unary_logit_f32_strided_run^⚠: Unary elementwise logit, f32 dtype, strided path.
baracuda_kernels_unary_logit_f64_can_implement^⚠: Pre-launch implementability check for unary_logit_f64.
baracuda_kernels_unary_logit_f64_run^⚠: Unary elementwise logit, f64 dtype, contiguous fast path.
baracuda_kernels_unary_logit_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_logit_f64_strided.
baracuda_kernels_unary_logit_f64_strided_run^⚠: Unary elementwise logit, f64 dtype, strided path.
baracuda_kernels_unary_mish_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_mish_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_mish_backward_bf16_run^⚠: Mish backward, bf16.
baracuda_kernels_unary_mish_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_mish_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_mish_backward_f16_run^⚠: Mish backward, f16.
baracuda_kernels_unary_mish_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_mish_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_mish_backward_f32_run^⚠: Mish backward, f32. dx = dy * (tanh(sp) + x*s*(1 - tanh(sp)^2)), sp = softplus(x). Saved-x.
baracuda_kernels_unary_mish_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_mish_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_mish_backward_f64_run^⚠: Mish backward, f64.
baracuda_kernels_unary_mish_bf16_can_implement^⚠: Pre-launch implementability check for unary_mish_bf16.
baracuda_kernels_unary_mish_bf16_run^⚠: Unary elementwise mish, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_mish_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_mish_bf16_strided.
baracuda_kernels_unary_mish_bf16_strided_run^⚠: Unary elementwise mish, bf16 dtype, strided path.
baracuda_kernels_unary_mish_f16_can_implement^⚠: Pre-launch implementability check for unary_mish_f16.
baracuda_kernels_unary_mish_f16_run^⚠: Unary elementwise mish, f16 dtype, contiguous fast path.
baracuda_kernels_unary_mish_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_mish_f16_strided.
baracuda_kernels_unary_mish_f16_strided_run^⚠: Unary elementwise mish, f16 dtype, strided path.
baracuda_kernels_unary_mish_f32_can_implement^⚠: Pre-launch implementability check for unary_mish_f32.
baracuda_kernels_unary_mish_f32_run^⚠: Unary elementwise mish, f32 dtype, contiguous fast path.
baracuda_kernels_unary_mish_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_mish_f32_strided.
baracuda_kernels_unary_mish_f32_strided_run^⚠: Unary elementwise mish, f32 dtype, strided path.
baracuda_kernels_unary_mish_f64_can_implement^⚠: Pre-launch implementability check for unary_mish_f64.
baracuda_kernels_unary_mish_f64_run^⚠: Unary elementwise mish, f64 dtype, contiguous fast path.
baracuda_kernels_unary_mish_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_mish_f64_strided.
baracuda_kernels_unary_mish_f64_strided_run^⚠: Unary elementwise mish, f64 dtype, strided path.
baracuda_kernels_unary_neg_bf16_can_implement^⚠: Pre-launch implementability check for unary_neg_bf16.
baracuda_kernels_unary_neg_bf16_run^⚠: Unary elementwise neg, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_neg_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_neg_bf16_strided.
baracuda_kernels_unary_neg_bf16_strided_run^⚠: Unary elementwise neg, bf16 dtype, strided path.
baracuda_kernels_unary_neg_f16_can_implement^⚠: Pre-launch implementability check for unary_neg_f16.
baracuda_kernels_unary_neg_f16_run^⚠: Unary elementwise neg, f16 dtype, contiguous fast path.
baracuda_kernels_unary_neg_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_neg_f16_strided.
baracuda_kernels_unary_neg_f16_strided_run^⚠: Unary elementwise neg, f16 dtype, strided path.
baracuda_kernels_unary_neg_f32_can_implement^⚠: Pre-launch implementability check for unary_neg_f32. Validates the problem size without launching a kernel. Returns the standard status code mapping.
baracuda_kernels_unary_neg_f32_run^⚠: Unary elementwise neg, f32 dtype, contiguous fast path. This is the unary-pointwise trailblazer — its safety contract carries over to every plain unary launcher (neg, abs, sqr, sqrt, rsqrt, recip, exp, log, sin, cos, tan, sign, floor, ceil, round, erf, relu, silu, gelu, tanh, sigmoid, etc.) AND every parameterized-unary launcher (unary_param_* family: powi, threshold, elu, prelu, lerp, etc.) across all dtypes. See also binary_add_f32_run for the binary contig aliasing contract and ternary_clamp_f32_run for the ternary one.
baracuda_kernels_unary_neg_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_neg_f32_strided.
baracuda_kernels_unary_neg_f32_strided_run^⚠: Unary elementwise neg, f32 dtype, strided path. This is the unary-strided trailblazer — its safety contract (including aliasing) carries over to every other unary strided launcher AND every parameterized-unary strided launcher (powi, threshold, elu, prelu, lerp) across all dtypes.
baracuda_kernels_unary_neg_f64_can_implement^⚠: Pre-launch implementability check for unary_neg_f64.
baracuda_kernels_unary_neg_f64_run^⚠: Unary elementwise neg, f64 dtype, contiguous fast path.
baracuda_kernels_unary_neg_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_neg_f64_strided.
baracuda_kernels_unary_neg_f64_strided_run^⚠: Unary elementwise neg, f64 dtype, strided path.
baracuda_kernels_unary_powf_bf16_can_implement^⚠: baracuda_kernels_unary_powf_bf16_can_implement (baracuda kernels unary powf bf16 can implement).
baracuda_kernels_unary_powf_bf16_run^⚠: unary_powf, bf16, contig.
baracuda_kernels_unary_powf_bf16_strided_can_implement^⚠: Implementability check for baracuda_kernels_unary_powf_bf16_strided. Host-side only.
baracuda_kernels_unary_powf_bf16_strided_run^⚠: baracuda_kernels_unary_powf_bf16_strided_run (baracuda kernels unary powf bf16 strided run).
baracuda_kernels_unary_powf_f16_can_implement^⚠: baracuda_kernels_unary_powf_f16_can_implement (baracuda kernels unary powf f16 can implement).
baracuda_kernels_unary_powf_f16_run^⚠: unary_powf, f16, contig. f32 detour.
baracuda_kernels_unary_powf_f16_strided_can_implement^⚠: Implementability check for baracuda_kernels_unary_powf_f16_strided. Host-side only.
baracuda_kernels_unary_powf_f16_strided_run^⚠: baracuda_kernels_unary_powf_f16_strided_run (baracuda kernels unary powf f16 strided run).
baracuda_kernels_unary_powf_f32_can_implement^⚠: Implementability check for unary_powf_f32.
baracuda_kernels_unary_powf_f32_run^⚠: Unary elementwise pow(x, exponent), f32, contig.
baracuda_kernels_unary_powf_f32_strided_can_implement^⚠: Implementability check for baracuda_kernels_unary_powf_f32_strided. Host-side only.
baracuda_kernels_unary_powf_f32_strided_run^⚠: unary_powf, f32, strided sibling.
baracuda_kernels_unary_powf_f64_can_implement^⚠: baracuda_kernels_unary_powf_f64_can_implement (baracuda kernels unary powf f64 can implement).
baracuda_kernels_unary_powf_f64_run^⚠: unary_powf, f64, contig. pow (libdevice) is full-double precision; the f32 exponent is widened once at kernel entry.
baracuda_kernels_unary_powf_f64_strided_can_implement^⚠: Implementability check for baracuda_kernels_unary_powf_f64_strided. Host-side only.
baracuda_kernels_unary_powf_f64_strided_run^⚠: baracuda_kernels_unary_powf_f64_strided_run (baracuda kernels unary powf f64 strided run).
baracuda_kernels_unary_powi_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_powi_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_bf16_run^⚠: powi BW, bf16.
baracuda_kernels_unary_powi_backward_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_powi_backward_bf16_strided. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_bf16_strided_run^⚠: powi BW, bf16, strided.
baracuda_kernels_unary_powi_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_powi_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_f16_run^⚠: powi BW, f16.
baracuda_kernels_unary_powi_backward_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_powi_backward_f16_strided. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_f16_strided_run^⚠: powi BW, f16, strided.
baracuda_kernels_unary_powi_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_powi_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_f32_run^⚠: powi backward: dx = n · x^(n-1) · dy, f32. Saved-x.
baracuda_kernels_unary_powi_backward_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_powi_backward_f32_strided. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_f32_strided_run^⚠: powi BW, f32, strided.
baracuda_kernels_unary_powi_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_powi_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_f64_run^⚠: powi BW, f64.
baracuda_kernels_unary_powi_backward_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_powi_backward_f64_strided. Host-side validation; no kernel launch.
baracuda_kernels_unary_powi_backward_f64_strided_run^⚠: powi BW, f64, strided.
baracuda_kernels_unary_powi_bf16_can_implement^⚠: Implementability check for baracuda_kernels_unary_powi_bf16. Host-side only.
baracuda_kernels_unary_powi_bf16_run^⚠: powi FW, bf16.
baracuda_kernels_unary_powi_bf16_strided_can_implement^⚠: Implementability check for baracuda_kernels_unary_powi_bf16_strided. Host-side only.
baracuda_kernels_unary_powi_bf16_strided_run^⚠: powi FW, bf16, strided.
baracuda_kernels_unary_powi_f16_can_implement^⚠: Implementability check for baracuda_kernels_unary_powi_f16. Host-side only.
baracuda_kernels_unary_powi_f16_run^⚠: powi FW, f16.
baracuda_kernels_unary_powi_f16_strided_can_implement^⚠: Implementability check for baracuda_kernels_unary_powi_f16_strided. Host-side only.
baracuda_kernels_unary_powi_f16_strided_run^⚠: powi FW, f16, strided.
baracuda_kernels_unary_powi_f32_can_implement^⚠: Implementability check for baracuda_kernels_unary_powi_f32. Host-side only.
baracuda_kernels_unary_powi_f32_run^⚠: Unary elementwise powi(x; n) = x^n (integer exponent), f32, contig.
baracuda_kernels_unary_powi_f32_strided_can_implement^⚠: Implementability check for baracuda_kernels_unary_powi_f32_strided. Host-side only.
baracuda_kernels_unary_powi_f32_strided_run^⚠: powi FW, f32, strided.
baracuda_kernels_unary_powi_f64_can_implement^⚠: Implementability check for baracuda_kernels_unary_powi_f64. Host-side only.
baracuda_kernels_unary_powi_f64_run^⚠: powi FW, f64.
baracuda_kernels_unary_powi_f64_strided_can_implement^⚠: Implementability check for baracuda_kernels_unary_powi_f64_strided. Host-side only.
baracuda_kernels_unary_powi_f64_strided_run^⚠: powi FW, f64, strided.
baracuda_kernels_unary_reciprocal_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_reciprocal_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_reciprocal_backward_bf16_run^⚠: Reciprocal backward, bf16.
baracuda_kernels_unary_reciprocal_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_reciprocal_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_reciprocal_backward_f16_run^⚠: Reciprocal backward, f16.
baracuda_kernels_unary_reciprocal_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_reciprocal_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_reciprocal_backward_f32_run^⚠: Reciprocal backward, f32. dx = -dy / x². Domain x != 0.
baracuda_kernels_unary_reciprocal_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_reciprocal_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_reciprocal_backward_f64_run^⚠: Reciprocal backward, f64.
baracuda_kernels_unary_reciprocal_bf16_can_implement^⚠: Pre-launch implementability check for unary_reciprocal_bf16.
baracuda_kernels_unary_reciprocal_bf16_run^⚠: Unary elementwise reciprocal, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_reciprocal_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_reciprocal_bf16_strided.
baracuda_kernels_unary_reciprocal_bf16_strided_run^⚠: Unary elementwise reciprocal, bf16 dtype, strided path.
baracuda_kernels_unary_reciprocal_f16_can_implement^⚠: Pre-launch implementability check for unary_reciprocal_f16.
baracuda_kernels_unary_reciprocal_f16_run^⚠: Unary elementwise reciprocal, f16 dtype, contiguous fast path.
baracuda_kernels_unary_reciprocal_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_reciprocal_f16_strided.
baracuda_kernels_unary_reciprocal_f16_strided_run^⚠: Unary elementwise reciprocal, f16 dtype, strided path.
baracuda_kernels_unary_reciprocal_f32_can_implement^⚠: Pre-launch implementability check for unary_reciprocal_f32.
baracuda_kernels_unary_reciprocal_f32_run^⚠: Unary elementwise reciprocal, f32 dtype, contiguous fast path.
baracuda_kernels_unary_reciprocal_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_reciprocal_f32_strided.
baracuda_kernels_unary_reciprocal_f32_strided_run^⚠: Unary elementwise reciprocal, f32 dtype, strided path.
baracuda_kernels_unary_reciprocal_f64_can_implement^⚠: Pre-launch implementability check for unary_reciprocal_f64.
baracuda_kernels_unary_reciprocal_f64_run^⚠: Unary elementwise reciprocal, f64 dtype, contiguous fast path.
baracuda_kernels_unary_reciprocal_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_reciprocal_f64_strided.
baracuda_kernels_unary_reciprocal_f64_strided_run^⚠: Unary elementwise reciprocal, f64 dtype, strided path.
baracuda_kernels_unary_relu6_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_relu6_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu6_backward_bf16_run^⚠: ReLU6 backward, bf16.
baracuda_kernels_unary_relu6_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_relu6_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu6_backward_f16_run^⚠: ReLU6 backward, f16.
baracuda_kernels_unary_relu6_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_relu6_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu6_backward_f32_run^⚠: ReLU6 backward, f32. dx = (0 < x < 6) ? dy : 0. Saved-x.
baracuda_kernels_unary_relu6_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_relu6_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu6_backward_f64_run^⚠: ReLU6 backward, f64.
baracuda_kernels_unary_relu6_bf16_can_implement^⚠: Pre-launch implementability check for unary_relu6_bf16.
baracuda_kernels_unary_relu6_bf16_run^⚠: Unary elementwise relu6, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_relu6_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_relu6_bf16_strided.
baracuda_kernels_unary_relu6_bf16_strided_run^⚠: Unary elementwise relu6, bf16 dtype, strided path.
baracuda_kernels_unary_relu6_f16_can_implement^⚠: Pre-launch implementability check for unary_relu6_f16.
baracuda_kernels_unary_relu6_f16_run^⚠: Unary elementwise relu6, f16 dtype, contiguous fast path.
baracuda_kernels_unary_relu6_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_relu6_f16_strided.
baracuda_kernels_unary_relu6_f16_strided_run^⚠: Unary elementwise relu6, f16 dtype, strided path.
baracuda_kernels_unary_relu6_f32_can_implement^⚠: Pre-launch implementability check for unary_relu6_f32.
baracuda_kernels_unary_relu6_f32_run^⚠: Unary elementwise relu6, f32 dtype, contiguous fast path.
baracuda_kernels_unary_relu6_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_relu6_f32_strided.
baracuda_kernels_unary_relu6_f32_strided_run^⚠: Unary elementwise relu6, f32 dtype, strided path.
baracuda_kernels_unary_relu6_f64_can_implement^⚠: Pre-launch implementability check for unary_relu6_f64.
baracuda_kernels_unary_relu6_f64_run^⚠: Unary elementwise relu6, f64 dtype, contiguous fast path.
baracuda_kernels_unary_relu6_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_relu6_f64_strided.
baracuda_kernels_unary_relu6_f64_strided_run^⚠: Unary elementwise relu6, f64 dtype, strided path.
baracuda_kernels_unary_relu_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_relu_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu_backward_bf16_run^⚠: ReLU backward, bf16.
baracuda_kernels_unary_relu_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_relu_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu_backward_f16_run^⚠: ReLU backward, f16.
baracuda_kernels_unary_relu_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_relu_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu_backward_f32_run^⚠: ReLU backward, f32. dx = (x > 0) ? dy : 0. Saved-x. This is the activation-BW trailblazer — its aliasing contract carries over to every other unary_<op>_backward_<dt>_run (gelu, silu, tanh, sigmoid, elu, leaky_relu, mish, hardswish, hardsigmoid, gelu_tanh, erf, erfc, etc.) across all dtypes, both saved-x and saved-y variants.
baracuda_kernels_unary_relu_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_relu_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_relu_backward_f64_run^⚠: ReLU backward, f64.
baracuda_kernels_unary_relu_bf16_can_implement^⚠: Pre-launch implementability check for unary_relu_bf16.
baracuda_kernels_unary_relu_bf16_run^⚠: Unary elementwise relu, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_relu_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_relu_bf16_strided.
baracuda_kernels_unary_relu_bf16_strided_run^⚠: Unary elementwise relu, bf16 dtype, strided path.
baracuda_kernels_unary_relu_f16_can_implement^⚠: Pre-launch implementability check for unary_relu_f16.
baracuda_kernels_unary_relu_f16_run^⚠: Unary elementwise relu, f16 dtype, contiguous fast path.
baracuda_kernels_unary_relu_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_relu_f16_strided.
baracuda_kernels_unary_relu_f16_strided_run^⚠: Unary elementwise relu, f16 dtype, strided path.
baracuda_kernels_unary_relu_f32_can_implement^⚠: Pre-launch implementability check for unary_relu_f32.
baracuda_kernels_unary_relu_f32_run^⚠: Unary elementwise relu, f32 dtype, contiguous fast path.
baracuda_kernels_unary_relu_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_relu_f32_strided.
baracuda_kernels_unary_relu_f32_strided_run^⚠: Unary elementwise relu, f32 dtype, strided path.
baracuda_kernels_unary_relu_f64_can_implement^⚠: Pre-launch implementability check for unary_relu_f64.
baracuda_kernels_unary_relu_f64_run^⚠: Unary elementwise relu, f64 dtype, contiguous fast path.
baracuda_kernels_unary_relu_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_relu_f64_strided.
baracuda_kernels_unary_relu_f64_strided_run^⚠: Unary elementwise relu, f64 dtype, strided path.
baracuda_kernels_unary_round_bf16_can_implement^⚠: Pre-launch implementability check for unary_round_bf16.
baracuda_kernels_unary_round_bf16_run^⚠: Unary elementwise round, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_round_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_round_bf16_strided.
baracuda_kernels_unary_round_bf16_strided_run^⚠: Unary elementwise round, bf16 dtype, strided path.
baracuda_kernels_unary_round_f16_can_implement^⚠: Pre-launch implementability check for unary_round_f16.
baracuda_kernels_unary_round_f16_run^⚠: Unary elementwise round, f16 dtype, contiguous fast path.
baracuda_kernels_unary_round_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_round_f16_strided.
baracuda_kernels_unary_round_f16_strided_run^⚠: Unary elementwise round, f16 dtype, strided path.
baracuda_kernels_unary_round_f32_can_implement^⚠: Pre-launch implementability check for unary_round_f32.
baracuda_kernels_unary_round_f32_run^⚠: Unary elementwise round, f32 dtype, contiguous fast path.
baracuda_kernels_unary_round_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_round_f32_strided.
baracuda_kernels_unary_round_f32_strided_run^⚠: Unary elementwise round, f32 dtype, strided path.
baracuda_kernels_unary_round_f64_can_implement^⚠: Pre-launch implementability check for unary_round_f64.
baracuda_kernels_unary_round_f64_run^⚠: Unary elementwise round, f64 dtype, contiguous fast path.
baracuda_kernels_unary_round_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_round_f64_strided.
baracuda_kernels_unary_round_f64_strided_run^⚠: Unary elementwise round, f64 dtype, strided path.
baracuda_kernels_unary_rsqrt_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_rsqrt_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_rsqrt_backward_bf16_run^⚠: Rsqrt backward, bf16.
baracuda_kernels_unary_rsqrt_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_rsqrt_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_rsqrt_backward_f16_run^⚠: Rsqrt backward, f16.
baracuda_kernels_unary_rsqrt_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_rsqrt_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_rsqrt_backward_f32_run^⚠: Rsqrt backward, f32. dx = -0.5 * dy * y³. Saved-y.
baracuda_kernels_unary_rsqrt_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_rsqrt_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_rsqrt_backward_f64_run^⚠: Rsqrt backward, f64.
baracuda_kernels_unary_rsqrt_bf16_can_implement^⚠: Pre-launch implementability check for unary_rsqrt_bf16.
baracuda_kernels_unary_rsqrt_bf16_run^⚠: Unary elementwise rsqrt, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_rsqrt_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_rsqrt_bf16_strided.
baracuda_kernels_unary_rsqrt_bf16_strided_run^⚠: Unary elementwise rsqrt, bf16 dtype, strided path.
baracuda_kernels_unary_rsqrt_f16_can_implement^⚠: Pre-launch implementability check for unary_rsqrt_f16.
baracuda_kernels_unary_rsqrt_f16_run^⚠: Unary elementwise rsqrt, f16 dtype, contiguous fast path.
baracuda_kernels_unary_rsqrt_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_rsqrt_f16_strided.
baracuda_kernels_unary_rsqrt_f16_strided_run^⚠: Unary elementwise rsqrt, f16 dtype, strided path.
baracuda_kernels_unary_rsqrt_f32_can_implement^⚠: Pre-launch implementability check for unary_rsqrt_f32.
baracuda_kernels_unary_rsqrt_f32_run^⚠: Unary elementwise rsqrt, f32 dtype, contiguous fast path.
baracuda_kernels_unary_rsqrt_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_rsqrt_f32_strided.
baracuda_kernels_unary_rsqrt_f32_strided_run^⚠: Unary elementwise rsqrt, f32 dtype, strided path.
baracuda_kernels_unary_rsqrt_f64_can_implement^⚠: Pre-launch implementability check for unary_rsqrt_f64.
baracuda_kernels_unary_rsqrt_f64_run^⚠: Unary elementwise rsqrt, f64 dtype, contiguous fast path.
baracuda_kernels_unary_rsqrt_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_rsqrt_f64_strided.
baracuda_kernels_unary_rsqrt_f64_strided_run^⚠: Unary elementwise rsqrt, f64 dtype, strided path.
baracuda_kernels_unary_selu_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_selu_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_selu_backward_bf16_run^⚠: SELU backward, bf16.
baracuda_kernels_unary_selu_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_selu_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_selu_backward_f16_run^⚠: SELU backward, f16.
baracuda_kernels_unary_selu_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_selu_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_selu_backward_f32_run^⚠: SELU backward, f32. x>0 → dy*scale; x<=0 → dy*scale*alpha*exp(x). Saved-x.
baracuda_kernels_unary_selu_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_selu_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_selu_backward_f64_run^⚠: SELU backward, f64.
baracuda_kernels_unary_selu_bf16_can_implement^⚠: Pre-launch implementability check for unary_selu_bf16.
baracuda_kernels_unary_selu_bf16_run^⚠: Unary elementwise selu, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_selu_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_selu_bf16_strided.
baracuda_kernels_unary_selu_bf16_strided_run^⚠: Unary elementwise selu, bf16 dtype, strided path.
baracuda_kernels_unary_selu_f16_can_implement^⚠: Pre-launch implementability check for unary_selu_f16.
baracuda_kernels_unary_selu_f16_run^⚠: Unary elementwise selu, f16 dtype, contiguous fast path.
baracuda_kernels_unary_selu_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_selu_f16_strided.
baracuda_kernels_unary_selu_f16_strided_run^⚠: Unary elementwise selu, f16 dtype, strided path.
baracuda_kernels_unary_selu_f32_can_implement^⚠: Pre-launch implementability check for unary_selu_f32.
baracuda_kernels_unary_selu_f32_run^⚠: Unary elementwise selu, f32 dtype, contiguous fast path.
baracuda_kernels_unary_selu_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_selu_f32_strided.
baracuda_kernels_unary_selu_f32_strided_run^⚠: Unary elementwise selu, f32 dtype, strided path.
baracuda_kernels_unary_selu_f64_can_implement^⚠: Pre-launch implementability check for unary_selu_f64.
baracuda_kernels_unary_selu_f64_run^⚠: Unary elementwise selu, f64 dtype, contiguous fast path.
baracuda_kernels_unary_selu_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_selu_f64_strided.
baracuda_kernels_unary_selu_f64_strided_run^⚠: Unary elementwise selu, f64 dtype, strided path.
baracuda_kernels_unary_sigmoid_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_sigmoid_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sigmoid_backward_bf16_run^⚠: Sigmoid backward, bf16.
baracuda_kernels_unary_sigmoid_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_sigmoid_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sigmoid_backward_f16_run^⚠: Sigmoid backward, f16.
baracuda_kernels_unary_sigmoid_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_sigmoid_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_sigmoid_backward_f32_run^⚠: Sigmoid backward, f32. dx = dy * y * (1 - y). Saved-y.
baracuda_kernels_unary_sigmoid_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_sigmoid_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_sigmoid_backward_f64_run^⚠: Sigmoid backward, f64.
baracuda_kernels_unary_sigmoid_bf16_can_implement^⚠: Pre-launch implementability check for unary_sigmoid_bf16.
baracuda_kernels_unary_sigmoid_bf16_run^⚠: Unary elementwise sigmoid, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_sigmoid_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_sigmoid_bf16_strided.
baracuda_kernels_unary_sigmoid_bf16_strided_run^⚠: Unary elementwise sigmoid, bf16 dtype, strided path.
baracuda_kernels_unary_sigmoid_f16_can_implement^⚠: Pre-launch implementability check for unary_sigmoid_f16.
baracuda_kernels_unary_sigmoid_f16_run^⚠: Unary elementwise sigmoid, f16 dtype, contiguous fast path.
baracuda_kernels_unary_sigmoid_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_sigmoid_f16_strided.
baracuda_kernels_unary_sigmoid_f16_strided_run^⚠: Unary elementwise sigmoid, f16 dtype, strided path.
baracuda_kernels_unary_sigmoid_f32_can_implement^⚠: Pre-launch implementability check for unary_sigmoid_f32.
baracuda_kernels_unary_sigmoid_f32_run^⚠: Unary elementwise sigmoid, f32 dtype, contiguous fast path.
baracuda_kernels_unary_sigmoid_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_sigmoid_f32_strided.
baracuda_kernels_unary_sigmoid_f32_strided_run^⚠: Unary elementwise sigmoid, f32 dtype, strided path.
baracuda_kernels_unary_sigmoid_f64_can_implement^⚠: Pre-launch implementability check for unary_sigmoid_f64.
baracuda_kernels_unary_sigmoid_f64_run^⚠: Unary elementwise sigmoid, f64 dtype, contiguous fast path.
baracuda_kernels_unary_sigmoid_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_sigmoid_f64_strided.
baracuda_kernels_unary_sigmoid_f64_strided_run^⚠: Unary elementwise sigmoid, f64 dtype, strided path.
baracuda_kernels_unary_sign_bf16_can_implement^⚠: Pre-launch implementability check for unary_sign_bf16.
baracuda_kernels_unary_sign_bf16_run^⚠: Unary elementwise sign, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_sign_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_sign_bf16_strided.
baracuda_kernels_unary_sign_bf16_strided_run^⚠: Unary elementwise sign, bf16 dtype, strided path.
baracuda_kernels_unary_sign_f16_can_implement^⚠: Pre-launch implementability check for unary_sign_f16.
baracuda_kernels_unary_sign_f16_run^⚠: Unary elementwise sign, f16 dtype, contiguous fast path.
baracuda_kernels_unary_sign_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_sign_f16_strided.
baracuda_kernels_unary_sign_f16_strided_run^⚠: Unary elementwise sign, f16 dtype, strided path.
baracuda_kernels_unary_sign_f32_can_implement^⚠: Pre-launch implementability check for unary_sign_f32.
baracuda_kernels_unary_sign_f32_run^⚠: Unary elementwise sign, f32 dtype, contiguous fast path.
baracuda_kernels_unary_sign_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_sign_f32_strided.
baracuda_kernels_unary_sign_f32_strided_run^⚠: Unary elementwise sign, f32 dtype, strided path.
baracuda_kernels_unary_sign_f64_can_implement^⚠: Pre-launch implementability check for unary_sign_f64.
baracuda_kernels_unary_sign_f64_run^⚠: Unary elementwise sign, f64 dtype, contiguous fast path.
baracuda_kernels_unary_sign_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_sign_f64_strided.
baracuda_kernels_unary_sign_f64_strided_run^⚠: Unary elementwise sign, f64 dtype, strided path.
baracuda_kernels_unary_silu_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_silu_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_silu_backward_bf16_run^⚠: SiLU backward, bf16.
baracuda_kernels_unary_silu_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_silu_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_silu_backward_f16_run^⚠: SiLU backward, f16.
baracuda_kernels_unary_silu_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_silu_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_silu_backward_f32_run^⚠: SiLU (Swish) backward, f32. dx = dy * s * (1 + x*(1-s)) with s = sigmoid(x). Saved-x.
baracuda_kernels_unary_silu_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_silu_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_silu_backward_f64_run^⚠: SiLU backward, f64.
baracuda_kernels_unary_silu_bf16_can_implement^⚠: Pre-launch implementability check for unary_silu_bf16.
baracuda_kernels_unary_silu_bf16_run^⚠: Unary elementwise silu, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_silu_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_silu_bf16_strided.
baracuda_kernels_unary_silu_bf16_strided_run^⚠: Unary elementwise silu, bf16 dtype, strided path.
baracuda_kernels_unary_silu_f16_can_implement^⚠: Pre-launch implementability check for unary_silu_f16.
baracuda_kernels_unary_silu_f16_run^⚠: Unary elementwise silu, f16 dtype, contiguous fast path.
baracuda_kernels_unary_silu_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_silu_f16_strided.
baracuda_kernels_unary_silu_f16_strided_run^⚠: Unary elementwise silu, f16 dtype, strided path.
baracuda_kernels_unary_silu_f32_can_implement^⚠: Pre-launch implementability check for unary_silu_f32.
baracuda_kernels_unary_silu_f32_run^⚠: Unary elementwise silu, f32 dtype, contiguous fast path.
baracuda_kernels_unary_silu_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_silu_f32_strided.
baracuda_kernels_unary_silu_f32_strided_run^⚠: Unary elementwise silu, f32 dtype, strided path.
baracuda_kernels_unary_silu_f64_can_implement^⚠: Pre-launch implementability check for unary_silu_f64.
baracuda_kernels_unary_silu_f64_run^⚠: Unary elementwise silu, f64 dtype, contiguous fast path.
baracuda_kernels_unary_silu_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_silu_f64_strided.
baracuda_kernels_unary_silu_f64_strided_run^⚠: Unary elementwise silu, f64 dtype, strided path.
baracuda_kernels_unary_sin_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_sin_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sin_backward_bf16_run^⚠: Sin backward, bf16.
baracuda_kernels_unary_sin_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_sin_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sin_backward_f16_run^⚠: Sin backward, f16.
baracuda_kernels_unary_sin_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_sin_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_sin_backward_f32_run^⚠: Sin backward, f32. dx = dy * cos(x). Caller must pass the forward input x as saved.
baracuda_kernels_unary_sin_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_sin_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_sin_backward_f64_run^⚠: Sin backward, f64.
baracuda_kernels_unary_sin_bf16_can_implement^⚠: Pre-launch implementability check for unary_sin_bf16.
baracuda_kernels_unary_sin_bf16_run^⚠: Unary elementwise sin, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_sin_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_sin_bf16_strided.
baracuda_kernels_unary_sin_bf16_strided_run^⚠: Unary elementwise sin, bf16 dtype, strided path.
baracuda_kernels_unary_sin_f16_can_implement^⚠: Pre-launch implementability check for unary_sin_f16.
baracuda_kernels_unary_sin_f16_run^⚠: Unary elementwise sin, f16 dtype, contiguous fast path.
baracuda_kernels_unary_sin_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_sin_f16_strided.
baracuda_kernels_unary_sin_f16_strided_run^⚠: Unary elementwise sin, f16 dtype, strided path.
baracuda_kernels_unary_sin_f32_can_implement^⚠: Pre-launch implementability check for unary_sin_f32.
baracuda_kernels_unary_sin_f32_run^⚠: Unary elementwise sin, f32 dtype, contiguous fast path.
baracuda_kernels_unary_sin_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_sin_f32_strided.
baracuda_kernels_unary_sin_f32_strided_run^⚠: Unary elementwise sin, f32 dtype, strided path.
baracuda_kernels_unary_sin_f64_can_implement^⚠: Pre-launch implementability check for unary_sin_f64.
baracuda_kernels_unary_sin_f64_run^⚠: Unary elementwise sin, f64 dtype, contiguous fast path.
baracuda_kernels_unary_sin_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_sin_f64_strided.
baracuda_kernels_unary_sin_f64_strided_run^⚠: Unary elementwise sin, f64 dtype, strided path.
baracuda_kernels_unary_sinh_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_sinh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sinh_backward_bf16_run^⚠: Sinh backward, bf16.
baracuda_kernels_unary_sinh_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_sinh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sinh_backward_f16_run^⚠: Sinh backward, f16.
baracuda_kernels_unary_sinh_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_sinh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_sinh_backward_f32_run^⚠: Sinh backward, f32. dx = dy * cosh(x). Saved-x.
baracuda_kernels_unary_sinh_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_sinh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_sinh_backward_f64_run^⚠: Sinh backward, f64.
baracuda_kernels_unary_sinh_bf16_can_implement^⚠: Pre-launch implementability check for unary_sinh_bf16.
baracuda_kernels_unary_sinh_bf16_run^⚠: Unary elementwise sinh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_sinh_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_sinh_bf16_strided.
baracuda_kernels_unary_sinh_bf16_strided_run^⚠: Unary elementwise sinh, bf16 dtype, strided path.
baracuda_kernels_unary_sinh_f16_can_implement^⚠: Pre-launch implementability check for unary_sinh_f16.
baracuda_kernels_unary_sinh_f16_run^⚠: Unary elementwise sinh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_sinh_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_sinh_f16_strided.
baracuda_kernels_unary_sinh_f16_strided_run^⚠: Unary elementwise sinh, f16 dtype, strided path.
baracuda_kernels_unary_sinh_f32_can_implement^⚠: Pre-launch implementability check for unary_sinh_f32.
baracuda_kernels_unary_sinh_f32_run^⚠: Unary elementwise sinh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_sinh_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_sinh_f32_strided.
baracuda_kernels_unary_sinh_f32_strided_run^⚠: Unary elementwise sinh, f32 dtype, strided path.
baracuda_kernels_unary_sinh_f64_can_implement^⚠: Pre-launch implementability check for unary_sinh_f64.
baracuda_kernels_unary_sinh_f64_run^⚠: Unary elementwise sinh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_sinh_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_sinh_f64_strided.
baracuda_kernels_unary_sinh_f64_strided_run^⚠: Unary elementwise sinh, f64 dtype, strided path.
baracuda_kernels_unary_softplus_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_softplus_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_softplus_backward_bf16_run^⚠: Softplus backward, bf16.
baracuda_kernels_unary_softplus_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_softplus_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_softplus_backward_f16_run^⚠: Softplus backward, f16.
baracuda_kernels_unary_softplus_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_softplus_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_softplus_backward_f32_run^⚠: Softplus backward, f32. dx = dy / (1 + exp(-x)). Saved-x.
baracuda_kernels_unary_softplus_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_softplus_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_softplus_backward_f64_run^⚠: Softplus backward, f64.
baracuda_kernels_unary_softplus_bf16_can_implement^⚠: Pre-launch implementability check for unary_softplus_bf16.
baracuda_kernels_unary_softplus_bf16_run^⚠: Unary elementwise softplus, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_softplus_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_softplus_bf16_strided.
baracuda_kernels_unary_softplus_bf16_strided_run^⚠: Unary elementwise softplus, bf16 dtype, strided path.
baracuda_kernels_unary_softplus_f16_can_implement^⚠: Pre-launch implementability check for unary_softplus_f16.
baracuda_kernels_unary_softplus_f16_run^⚠: Unary elementwise softplus, f16 dtype, contiguous fast path.
baracuda_kernels_unary_softplus_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_softplus_f16_strided.
baracuda_kernels_unary_softplus_f16_strided_run^⚠: Unary elementwise softplus, f16 dtype, strided path.
baracuda_kernels_unary_softplus_f32_can_implement^⚠: Pre-launch implementability check for unary_softplus_f32.
baracuda_kernels_unary_softplus_f32_run^⚠: Unary elementwise softplus, f32 dtype, contiguous fast path.
baracuda_kernels_unary_softplus_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_softplus_f32_strided.
baracuda_kernels_unary_softplus_f32_strided_run^⚠: Unary elementwise softplus, f32 dtype, strided path.
baracuda_kernels_unary_softplus_f64_can_implement^⚠: Pre-launch implementability check for unary_softplus_f64.
baracuda_kernels_unary_softplus_f64_run^⚠: Unary elementwise softplus, f64 dtype, contiguous fast path.
baracuda_kernels_unary_softplus_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_softplus_f64_strided.
baracuda_kernels_unary_softplus_f64_strided_run^⚠: Unary elementwise softplus, f64 dtype, strided path.
baracuda_kernels_unary_softshrink_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_softshrink_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_softshrink_backward_bf16_run^⚠: Softshrink backward, bf16.
baracuda_kernels_unary_softshrink_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_softshrink_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_softshrink_backward_f16_run^⚠: Softshrink backward, f16.
baracuda_kernels_unary_softshrink_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_softshrink_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_softshrink_backward_f32_run^⚠: Softshrink backward, f32. dx = (|x| > λ) ? dy : 0 with λ=0.5. Saved-x.
baracuda_kernels_unary_softshrink_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_softshrink_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_softshrink_backward_f64_run^⚠: Softshrink backward, f64.
baracuda_kernels_unary_softshrink_bf16_can_implement^⚠: Pre-launch implementability check for unary_softshrink_bf16.
baracuda_kernels_unary_softshrink_bf16_run^⚠: Unary elementwise softshrink (λ=0.5), bf16, contig.
baracuda_kernels_unary_softshrink_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_softshrink_bf16_strided.
baracuda_kernels_unary_softshrink_bf16_strided_run^⚠: Unary elementwise softshrink (λ=0.5), bf16, strided.
baracuda_kernels_unary_softshrink_f16_can_implement^⚠: Pre-launch implementability check for unary_softshrink_f16.
baracuda_kernels_unary_softshrink_f16_run^⚠: Unary elementwise softshrink (λ=0.5), f16, contig.
baracuda_kernels_unary_softshrink_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_softshrink_f16_strided.
baracuda_kernels_unary_softshrink_f16_strided_run^⚠: Unary elementwise softshrink (λ=0.5), f16, strided.
baracuda_kernels_unary_softshrink_f32_can_implement^⚠: Pre-launch implementability check for unary_softshrink_f32.
baracuda_kernels_unary_softshrink_f32_run^⚠: Unary elementwise softshrink (λ=0.5), f32, contig.
baracuda_kernels_unary_softshrink_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_softshrink_f32_strided.
baracuda_kernels_unary_softshrink_f32_strided_run^⚠: Unary elementwise softshrink (λ=0.5), f32, strided.
baracuda_kernels_unary_softshrink_f64_can_implement^⚠: Pre-launch implementability check for unary_softshrink_f64.
baracuda_kernels_unary_softshrink_f64_run^⚠: Unary elementwise softshrink (λ=0.5), f64, contig.
baracuda_kernels_unary_softshrink_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_softshrink_f64_strided.
baracuda_kernels_unary_softshrink_f64_strided_run^⚠: Unary elementwise softshrink (λ=0.5), f64, strided.
baracuda_kernels_unary_softsign_bf16_can_implement^⚠: Pre-launch implementability check for unary_softsign_bf16.
baracuda_kernels_unary_softsign_bf16_run^⚠: Unary elementwise softsign, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_softsign_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_softsign_bf16_strided.
baracuda_kernels_unary_softsign_bf16_strided_run^⚠: Unary elementwise softsign, bf16 dtype, strided path.
baracuda_kernels_unary_softsign_f16_can_implement^⚠: Pre-launch implementability check for unary_softsign_f16.
baracuda_kernels_unary_softsign_f16_run^⚠: Unary elementwise softsign, f16 dtype, contiguous fast path.
baracuda_kernels_unary_softsign_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_softsign_f16_strided.
baracuda_kernels_unary_softsign_f16_strided_run^⚠: Unary elementwise softsign, f16 dtype, strided path.
baracuda_kernels_unary_softsign_f32_can_implement^⚠: Pre-launch implementability check for unary_softsign_f32.
baracuda_kernels_unary_softsign_f32_run^⚠: Unary elementwise softsign, f32 dtype, contiguous fast path.
baracuda_kernels_unary_softsign_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_softsign_f32_strided.
baracuda_kernels_unary_softsign_f32_strided_run^⚠: Unary elementwise softsign, f32 dtype, strided path.
baracuda_kernels_unary_softsign_f64_can_implement^⚠: Pre-launch implementability check for unary_softsign_f64.
baracuda_kernels_unary_softsign_f64_run^⚠: Unary elementwise softsign, f64 dtype, contiguous fast path.
baracuda_kernels_unary_softsign_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_softsign_f64_strided.
baracuda_kernels_unary_softsign_f64_strided_run^⚠: Unary elementwise softsign, f64 dtype, strided path.
baracuda_kernels_unary_sqrt_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_sqrt_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sqrt_backward_bf16_run^⚠: Sqrt backward, bf16.
baracuda_kernels_unary_sqrt_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_sqrt_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_sqrt_backward_f16_run^⚠: Sqrt backward, f16.
baracuda_kernels_unary_sqrt_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_sqrt_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_sqrt_backward_f32_run^⚠: Sqrt backward, f32. dx = dy / (2 * y). Saved-y. Callers must ensure y[i] != 0.
baracuda_kernels_unary_sqrt_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_sqrt_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_sqrt_backward_f64_run^⚠: Sqrt backward, f64.
baracuda_kernels_unary_sqrt_bf16_can_implement^⚠: Pre-launch implementability check for unary_sqrt_bf16.
baracuda_kernels_unary_sqrt_bf16_run^⚠: Unary elementwise sqrt, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_sqrt_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_sqrt_bf16_strided.
baracuda_kernels_unary_sqrt_bf16_strided_run^⚠: Unary elementwise sqrt, bf16 dtype, strided path.
baracuda_kernels_unary_sqrt_f16_can_implement^⚠: Pre-launch implementability check for unary_sqrt_f16.
baracuda_kernels_unary_sqrt_f16_run^⚠: Unary elementwise sqrt, f16 dtype, contiguous fast path.
baracuda_kernels_unary_sqrt_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_sqrt_f16_strided.
baracuda_kernels_unary_sqrt_f16_strided_run^⚠: Unary elementwise sqrt, f16 dtype, strided path.
baracuda_kernels_unary_sqrt_f32_can_implement^⚠: Pre-launch implementability check for unary_sqrt_f32.
baracuda_kernels_unary_sqrt_f32_run^⚠: Unary elementwise sqrt, f32 dtype, contiguous fast path.
baracuda_kernels_unary_sqrt_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_sqrt_f32_strided.
baracuda_kernels_unary_sqrt_f32_strided_run^⚠: Unary elementwise sqrt, f32 dtype, strided path.
baracuda_kernels_unary_sqrt_f64_can_implement^⚠: Pre-launch implementability check for unary_sqrt_f64.
baracuda_kernels_unary_sqrt_f64_run^⚠: Unary elementwise sqrt, f64 dtype, contiguous fast path.
baracuda_kernels_unary_sqrt_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_sqrt_f64_strided.
baracuda_kernels_unary_sqrt_f64_strided_run^⚠: Unary elementwise sqrt, f64 dtype, strided path.
baracuda_kernels_unary_square_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_square_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_square_backward_bf16_run^⚠: Square backward, bf16.
baracuda_kernels_unary_square_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_square_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_square_backward_f16_run^⚠: Square backward, f16.
baracuda_kernels_unary_square_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_square_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_square_backward_f32_run^⚠: Square backward, f32. dx = dy * 2 * x.
baracuda_kernels_unary_square_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_square_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_square_backward_f64_run^⚠: Square backward, f64.
baracuda_kernels_unary_square_bf16_can_implement^⚠: Pre-launch implementability check for unary_square_bf16.
baracuda_kernels_unary_square_bf16_run^⚠: Unary elementwise square, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_square_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_square_bf16_strided.
baracuda_kernels_unary_square_bf16_strided_run^⚠: Unary elementwise square, bf16 dtype, strided path.
baracuda_kernels_unary_square_f16_can_implement^⚠: Pre-launch implementability check for unary_square_f16.
baracuda_kernels_unary_square_f16_run^⚠: Unary elementwise square, f16 dtype, contiguous fast path.
baracuda_kernels_unary_square_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_square_f16_strided.
baracuda_kernels_unary_square_f16_strided_run^⚠: Unary elementwise square, f16 dtype, strided path.
baracuda_kernels_unary_square_f32_can_implement^⚠: Pre-launch implementability check for unary_square_f32.
baracuda_kernels_unary_square_f32_run^⚠: Unary elementwise square, f32 dtype, contiguous fast path.
baracuda_kernels_unary_square_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_square_f32_strided.
baracuda_kernels_unary_square_f32_strided_run^⚠: Unary elementwise square, f32 dtype, strided path.
baracuda_kernels_unary_square_f64_can_implement^⚠: Pre-launch implementability check for unary_square_f64.
baracuda_kernels_unary_square_f64_run^⚠: Unary elementwise square, f64 dtype, contiguous fast path.
baracuda_kernels_unary_square_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_square_f64_strided.
baracuda_kernels_unary_square_f64_strided_run^⚠: Unary elementwise square, f64 dtype, strided path.
baracuda_kernels_unary_step_bf16_can_implement^⚠: baracuda_kernels_unary_step_bf16_can_implement (baracuda kernels unary step bf16 can implement).
baracuda_kernels_unary_step_bf16_run^⚠: unary_step, bf16, contig.
baracuda_kernels_unary_step_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_step_bf16_strided.
baracuda_kernels_unary_step_bf16_strided_run^⚠: baracuda_kernels_unary_step_bf16_strided_run (baracuda kernels unary step bf16 strided run).
baracuda_kernels_unary_step_f16_can_implement^⚠: baracuda_kernels_unary_step_f16_can_implement (baracuda kernels unary step f16 can implement).
baracuda_kernels_unary_step_f16_run^⚠: unary_step, f16, contig.
baracuda_kernels_unary_step_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_step_f16_strided.
baracuda_kernels_unary_step_f16_strided_run^⚠: baracuda_kernels_unary_step_f16_strided_run (baracuda kernels unary step f16 strided run).
baracuda_kernels_unary_step_f32_can_implement^⚠: baracuda_kernels_unary_step_f32_can_implement (baracuda kernels unary step f32 can implement).
baracuda_kernels_unary_step_f32_run^⚠: unary_step, f32, contig.
baracuda_kernels_unary_step_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_step_f32_strided.
baracuda_kernels_unary_step_f32_strided_run^⚠: baracuda_kernels_unary_step_f32_strided_run (baracuda kernels unary step f32 strided run).
baracuda_kernels_unary_step_f64_can_implement^⚠: baracuda_kernels_unary_step_f64_can_implement (baracuda kernels unary step f64 can implement).
baracuda_kernels_unary_step_f64_run^⚠: unary_step, f64, contig.
baracuda_kernels_unary_step_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_step_f64_strided.
baracuda_kernels_unary_step_f64_strided_run^⚠: baracuda_kernels_unary_step_f64_strided_run (baracuda kernels unary step f64 strided run).
baracuda_kernels_unary_tan_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_tan_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_tan_backward_bf16_run^⚠: Tan backward, bf16.
baracuda_kernels_unary_tan_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_tan_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_tan_backward_f16_run^⚠: Tan backward, f16.
baracuda_kernels_unary_tan_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_tan_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_tan_backward_f32_run^⚠: Tan backward, f32. dx = dy * (1 + tan(x)²). Saved-x.
baracuda_kernels_unary_tan_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_tan_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_tan_backward_f64_run^⚠: Tan backward, f64.
baracuda_kernels_unary_tan_bf16_can_implement^⚠: Pre-launch implementability check for unary_tan_bf16.
baracuda_kernels_unary_tan_bf16_run^⚠: Unary elementwise tan, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_tan_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_tan_bf16_strided.
baracuda_kernels_unary_tan_bf16_strided_run^⚠: Unary elementwise tan, bf16 dtype, strided path.
baracuda_kernels_unary_tan_f16_can_implement^⚠: Pre-launch implementability check for unary_tan_f16.
baracuda_kernels_unary_tan_f16_run^⚠: Unary elementwise tan, f16 dtype, contiguous fast path.
baracuda_kernels_unary_tan_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_tan_f16_strided.
baracuda_kernels_unary_tan_f16_strided_run^⚠: Unary elementwise tan, f16 dtype, strided path.
baracuda_kernels_unary_tan_f32_can_implement^⚠: Pre-launch implementability check for unary_tan_f32.
baracuda_kernels_unary_tan_f32_run^⚠: Unary elementwise tan, f32 dtype, contiguous fast path.
baracuda_kernels_unary_tan_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_tan_f32_strided.
baracuda_kernels_unary_tan_f32_strided_run^⚠: Unary elementwise tan, f32 dtype, strided path.
baracuda_kernels_unary_tan_f64_can_implement^⚠: Pre-launch implementability check for unary_tan_f64.
baracuda_kernels_unary_tan_f64_run^⚠: Unary elementwise tan, f64 dtype, contiguous fast path.
baracuda_kernels_unary_tan_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_tan_f64_strided.
baracuda_kernels_unary_tan_f64_strided_run^⚠: Unary elementwise tan, f64 dtype, strided path.
baracuda_kernels_unary_tanh_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_tanh_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanh_backward_bf16_run^⚠: Tanh backward, bf16.
baracuda_kernels_unary_tanh_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_tanh_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanh_backward_f16_run^⚠: Tanh backward, f16.
baracuda_kernels_unary_tanh_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_tanh_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanh_backward_f32_run^⚠: Tanh backward, f32. dx = dy * (1 - y²). Saved-y.
baracuda_kernels_unary_tanh_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_tanh_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanh_backward_f64_run^⚠: Tanh backward, f64.
baracuda_kernels_unary_tanh_bf16_can_implement^⚠: Pre-launch implementability check for unary_tanh_bf16.
baracuda_kernels_unary_tanh_bf16_run^⚠: Unary elementwise tanh, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_tanh_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_tanh_bf16_strided.
baracuda_kernels_unary_tanh_bf16_strided_run^⚠: Unary elementwise tanh, bf16 dtype, strided path.
baracuda_kernels_unary_tanh_f16_can_implement^⚠: Pre-launch implementability check for unary_tanh_f16.
baracuda_kernels_unary_tanh_f16_run^⚠: Unary elementwise tanh, f16 dtype, contiguous fast path.
baracuda_kernels_unary_tanh_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_tanh_f16_strided.
baracuda_kernels_unary_tanh_f16_strided_run^⚠: Unary elementwise tanh, f16 dtype, strided path.
baracuda_kernels_unary_tanh_f32_can_implement^⚠: Pre-launch implementability check for unary_tanh_f32.
baracuda_kernels_unary_tanh_f32_run^⚠: Unary elementwise tanh, f32 dtype, contiguous fast path.
baracuda_kernels_unary_tanh_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_tanh_f32_strided.
baracuda_kernels_unary_tanh_f32_strided_run^⚠: Unary elementwise tanh, f32 dtype, strided path.
baracuda_kernels_unary_tanh_f64_can_implement^⚠: Pre-launch implementability check for unary_tanh_f64.
baracuda_kernels_unary_tanh_f64_run^⚠: Unary elementwise tanh, f64 dtype, contiguous fast path.
baracuda_kernels_unary_tanh_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_tanh_f64_strided.
baracuda_kernels_unary_tanh_f64_strided_run^⚠: Unary elementwise tanh, f64 dtype, strided path.
baracuda_kernels_unary_tanhshrink_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_tanhshrink_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanhshrink_backward_bf16_run^⚠: Tanhshrink backward, bf16.
baracuda_kernels_unary_tanhshrink_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_tanhshrink_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanhshrink_backward_f16_run^⚠: Tanhshrink backward, f16.
baracuda_kernels_unary_tanhshrink_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_tanhshrink_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanhshrink_backward_f32_run^⚠: Tanhshrink backward, f32. dx = dy * tanh(x)².
baracuda_kernels_unary_tanhshrink_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_tanhshrink_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_tanhshrink_backward_f64_run^⚠: Tanhshrink backward, f64.
baracuda_kernels_unary_tanhshrink_bf16_can_implement^⚠: Pre-launch implementability check for unary_tanhshrink_bf16.
baracuda_kernels_unary_tanhshrink_bf16_run^⚠: Unary elementwise tanhshrink, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_tanhshrink_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_tanhshrink_bf16_strided.
baracuda_kernels_unary_tanhshrink_bf16_strided_run^⚠: Unary elementwise tanhshrink, bf16 dtype, strided path.
baracuda_kernels_unary_tanhshrink_f16_can_implement^⚠: Pre-launch implementability check for unary_tanhshrink_f16.
baracuda_kernels_unary_tanhshrink_f16_run^⚠: Unary elementwise tanhshrink, f16 dtype, contiguous fast path.
baracuda_kernels_unary_tanhshrink_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_tanhshrink_f16_strided.
baracuda_kernels_unary_tanhshrink_f16_strided_run^⚠: Unary elementwise tanhshrink, f16 dtype, strided path.
baracuda_kernels_unary_tanhshrink_f32_can_implement^⚠: Pre-launch implementability check for unary_tanhshrink_f32.
baracuda_kernels_unary_tanhshrink_f32_run^⚠: Unary elementwise tanhshrink, f32 dtype, contiguous fast path.
baracuda_kernels_unary_tanhshrink_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_tanhshrink_f32_strided.
baracuda_kernels_unary_tanhshrink_f32_strided_run^⚠: Unary elementwise tanhshrink, f32 dtype, strided path.
baracuda_kernels_unary_tanhshrink_f64_can_implement^⚠: Pre-launch implementability check for unary_tanhshrink_f64.
baracuda_kernels_unary_tanhshrink_f64_run^⚠: Unary elementwise tanhshrink, f64 dtype, contiguous fast path.
baracuda_kernels_unary_tanhshrink_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_tanhshrink_f64_strided.
baracuda_kernels_unary_tanhshrink_f64_strided_run^⚠: Unary elementwise tanhshrink, f64 dtype, strided path.
baracuda_kernels_unary_threshold_backward_bf16_can_implement^⚠: Pre-launch implementability check for unary_threshold_backward_bf16. Host-side validation; no kernel launch.
baracuda_kernels_unary_threshold_backward_bf16_run^⚠: threshold BW, bf16.
baracuda_kernels_unary_threshold_backward_f16_can_implement^⚠: Pre-launch implementability check for unary_threshold_backward_f16. Host-side validation; no kernel launch.
baracuda_kernels_unary_threshold_backward_f16_run^⚠: threshold BW, f16.
baracuda_kernels_unary_threshold_backward_f32_can_implement^⚠: Pre-launch implementability check for unary_threshold_backward_f32. Host-side validation; no kernel launch.
baracuda_kernels_unary_threshold_backward_f32_run^⚠: threshold backward: dx = (x > t) ? dy : 0, f32. Saved-x.
baracuda_kernels_unary_threshold_backward_f64_can_implement^⚠: Pre-launch implementability check for unary_threshold_backward_f64. Host-side validation; no kernel launch.
baracuda_kernels_unary_threshold_backward_f64_run^⚠: threshold BW, f64.
baracuda_kernels_unary_threshold_bf16_can_implement^⚠: Implementability check for baracuda_kernels_unary_threshold_bf16. Host-side only.
baracuda_kernels_unary_threshold_bf16_run^⚠: threshold FW, bf16.
baracuda_kernels_unary_threshold_f16_can_implement^⚠: Implementability check for baracuda_kernels_unary_threshold_f16. Host-side only.
baracuda_kernels_unary_threshold_f16_run^⚠: threshold FW, f16.
baracuda_kernels_unary_threshold_f32_can_implement^⚠: Implementability check for baracuda_kernels_unary_threshold_f32. Host-side only.
baracuda_kernels_unary_threshold_f32_run^⚠: Unary elementwise threshold(x; t, v) = (x > t) ? x : v, f32, contig.
baracuda_kernels_unary_threshold_f64_can_implement^⚠: Implementability check for baracuda_kernels_unary_threshold_f64. Host-side only.
baracuda_kernels_unary_threshold_f64_run^⚠: threshold FW, f64. The f32 params widen to f64 losslessly.
baracuda_kernels_unary_trunc_bf16_can_implement^⚠: Pre-launch implementability check for unary_trunc_bf16.
baracuda_kernels_unary_trunc_bf16_run^⚠: Unary elementwise trunc, bf16 dtype, contiguous fast path.
baracuda_kernels_unary_trunc_bf16_strided_can_implement^⚠: Pre-launch implementability check for unary_trunc_bf16_strided.
baracuda_kernels_unary_trunc_bf16_strided_run^⚠: Unary elementwise trunc, bf16 dtype, strided path.
baracuda_kernels_unary_trunc_f16_can_implement^⚠: Pre-launch implementability check for unary_trunc_f16.
baracuda_kernels_unary_trunc_f16_run^⚠: Unary elementwise trunc, f16 dtype, contiguous fast path.
baracuda_kernels_unary_trunc_f16_strided_can_implement^⚠: Pre-launch implementability check for unary_trunc_f16_strided.
baracuda_kernels_unary_trunc_f16_strided_run^⚠: Unary elementwise trunc, f16 dtype, strided path.
baracuda_kernels_unary_trunc_f32_can_implement^⚠: Pre-launch implementability check for unary_trunc_f32.
baracuda_kernels_unary_trunc_f32_run^⚠: Unary elementwise trunc, f32 dtype, contiguous fast path.
baracuda_kernels_unary_trunc_f32_strided_can_implement^⚠: Pre-launch implementability check for unary_trunc_f32_strided.
baracuda_kernels_unary_trunc_f32_strided_run^⚠: Unary elementwise trunc, f32 dtype, strided path.
baracuda_kernels_unary_trunc_f64_can_implement^⚠: Pre-launch implementability check for unary_trunc_f64.
baracuda_kernels_unary_trunc_f64_run^⚠: Unary elementwise trunc, f64 dtype, contiguous fast path.
baracuda_kernels_unary_trunc_f64_strided_can_implement^⚠: Pre-launch implementability check for unary_trunc_f64_strided.
baracuda_kernels_unary_trunc_f64_strided_run^⚠: Unary elementwise trunc, f64 dtype, strided path.
baracuda_kernels_unique_consecutive_f32_can_implement^⚠: baracuda_kernels_unique_consecutive_f32_can_implement (baracuda kernels unique consecutive f32 can implement).
baracuda_kernels_unique_consecutive_f32_run^⚠: Unique-consecutive, f32. Emits one cell per run-start; output slot order is atomic-counter race order. counter[row] holds the actual unique count post-launch.
baracuda_kernels_unique_consecutive_f64_can_implement^⚠: baracuda_kernels_unique_consecutive_f64_can_implement (baracuda kernels unique consecutive f64 can implement).
baracuda_kernels_unique_consecutive_f64_run^⚠: Unique-consecutive, f64.
baracuda_kernels_unique_consecutive_i32_can_implement^⚠: baracuda_kernels_unique_consecutive_i32_can_implement (baracuda kernels unique consecutive i32 can implement).
baracuda_kernels_unique_consecutive_i32_run^⚠: Unique-consecutive, i32.
baracuda_kernels_unsorted_segment_max_backward_f32_can_implement^⚠: Implementability check for unsorted_segment_max_backward_f32.
baracuda_kernels_unsorted_segment_max_backward_f32_run^⚠: unsorted_segment_max_backward — f32.
baracuda_kernels_unsorted_segment_max_backward_f64_can_implement^⚠: Implementability check for unsorted_segment_max_backward_f64.
baracuda_kernels_unsorted_segment_max_backward_f64_run^⚠: unsorted_segment_max_backward — f64.
baracuda_kernels_unsorted_segment_max_f32_can_implement^⚠: Implementability check for unsorted_segment_max_f32.
baracuda_kernels_unsorted_segment_max_f32_run^⚠: out[s, d] = max_{n : seg[n] == s} input[n, d] — unsorted; atomicMax-via-CAS. Output pre-initialized to -inf by the launcher. f32.
baracuda_kernels_unsorted_segment_max_f64_can_implement^⚠: Implementability check for unsorted_segment_max_f64.
baracuda_kernels_unsorted_segment_max_f64_run^⚠: unsorted_segment_max — f64.
baracuda_kernels_unsorted_segment_max_i64idx_f32_can_implement^⚠: baracuda_kernels_unsorted_segment_max_i64idx_f32_can_implement (baracuda kernels unsorted segment max i64idx f32 can implement).
baracuda_kernels_unsorted_segment_max_i64idx_f32_run^⚠: baracuda_kernels_unsorted_segment_max_i64idx_f32_run (baracuda kernels unsorted segment max i64idx f32 run).
baracuda_kernels_unsorted_segment_max_i64idx_f64_can_implement^⚠: baracuda_kernels_unsorted_segment_max_i64idx_f64_can_implement (baracuda kernels unsorted segment max i64idx f64 can implement).
baracuda_kernels_unsorted_segment_max_i64idx_f64_run^⚠: baracuda_kernels_unsorted_segment_max_i64idx_f64_run (baracuda kernels unsorted segment max i64idx f64 run).
baracuda_kernels_unsorted_segment_mean_backward_f32_can_implement^⚠: Implementability check for unsorted_segment_mean_backward_f32.
baracuda_kernels_unsorted_segment_mean_backward_f32_run^⚠: unsorted_segment_mean_backward — f32.
baracuda_kernels_unsorted_segment_mean_backward_f64_can_implement^⚠: Implementability check for unsorted_segment_mean_backward_f64.
baracuda_kernels_unsorted_segment_mean_backward_f64_run^⚠: unsorted_segment_mean_backward — f64.
baracuda_kernels_unsorted_segment_mean_backward_i64idx_f32_can_implement^⚠: baracuda_kernels_unsorted_segment_mean_backward_i64idx_f32_can_implement (baracuda kernels unsorted segment mean backward i64idx f32 can implement).
baracuda_kernels_unsorted_segment_mean_backward_i64idx_f32_run^⚠: baracuda_kernels_unsorted_segment_mean_backward_i64idx_f32_run (baracuda kernels unsorted segment mean backward i64idx f32 run).
baracuda_kernels_unsorted_segment_mean_backward_i64idx_f64_can_implement^⚠: baracuda_kernels_unsorted_segment_mean_backward_i64idx_f64_can_implement (baracuda kernels unsorted segment mean backward i64idx f64 can implement).
baracuda_kernels_unsorted_segment_mean_backward_i64idx_f64_run^⚠: baracuda_kernels_unsorted_segment_mean_backward_i64idx_f64_run (baracuda kernels unsorted segment mean backward i64idx f64 run).
baracuda_kernels_unsorted_segment_mean_f32_can_implement^⚠: Implementability check for unsorted_segment_mean_f32.
baracuda_kernels_unsorted_segment_mean_f32_run^⚠: out[s, d] = mean_{n : seg[n] == s} input[n, d] — unsorted. Workspace: num_segments * sizeof(i32) for per-segment counts. f32.
baracuda_kernels_unsorted_segment_mean_f64_can_implement^⚠: Implementability check for unsorted_segment_mean_f64.
baracuda_kernels_unsorted_segment_mean_f64_run^⚠: unsorted_segment_mean — f64.
baracuda_kernels_unsorted_segment_mean_i64idx_f32_can_implement^⚠: baracuda_kernels_unsorted_segment_mean_i64idx_f32_can_implement (baracuda kernels unsorted segment mean i64idx f32 can implement).
baracuda_kernels_unsorted_segment_mean_i64idx_f32_run^⚠: baracuda_kernels_unsorted_segment_mean_i64idx_f32_run (baracuda kernels unsorted segment mean i64idx f32 run).
baracuda_kernels_unsorted_segment_mean_i64idx_f64_can_implement^⚠: baracuda_kernels_unsorted_segment_mean_i64idx_f64_can_implement (baracuda kernels unsorted segment mean i64idx f64 can implement).
baracuda_kernels_unsorted_segment_mean_i64idx_f64_run^⚠: baracuda_kernels_unsorted_segment_mean_i64idx_f64_run (baracuda kernels unsorted segment mean i64idx f64 run).
baracuda_kernels_unsorted_segment_min_backward_f32_can_implement^⚠: Implementability check for unsorted_segment_min_backward_f32.
baracuda_kernels_unsorted_segment_min_backward_f32_run^⚠: unsorted_segment_min_backward — f32.
baracuda_kernels_unsorted_segment_min_backward_f64_can_implement^⚠: Implementability check for unsorted_segment_min_backward_f64.
baracuda_kernels_unsorted_segment_min_backward_f64_run^⚠: unsorted_segment_min_backward — f64.
baracuda_kernels_unsorted_segment_min_f32_can_implement^⚠: Implementability check for unsorted_segment_min_f32.
baracuda_kernels_unsorted_segment_min_f32_run^⚠: out[s, d] = min_{n : seg[n] == s} input[n, d] — unsorted. f32.
baracuda_kernels_unsorted_segment_min_f64_can_implement^⚠: Implementability check for unsorted_segment_min_f64.
baracuda_kernels_unsorted_segment_min_f64_run^⚠: unsorted_segment_min — f64.
baracuda_kernels_unsorted_segment_min_i64idx_f32_can_implement^⚠: baracuda_kernels_unsorted_segment_min_i64idx_f32_can_implement (baracuda kernels unsorted segment min i64idx f32 can implement).
baracuda_kernels_unsorted_segment_min_i64idx_f32_run^⚠: baracuda_kernels_unsorted_segment_min_i64idx_f32_run (baracuda kernels unsorted segment min i64idx f32 run).
baracuda_kernels_unsorted_segment_min_i64idx_f64_can_implement^⚠: baracuda_kernels_unsorted_segment_min_i64idx_f64_can_implement (baracuda kernels unsorted segment min i64idx f64 can implement).
baracuda_kernels_unsorted_segment_min_i64idx_f64_run^⚠: baracuda_kernels_unsorted_segment_min_i64idx_f64_run (baracuda kernels unsorted segment min i64idx f64 run).
baracuda_kernels_unsorted_segment_prod_backward_f32_can_implement^⚠: Implementability check for unsorted_segment_prod_backward_f32.
baracuda_kernels_unsorted_segment_prod_backward_f32_run^⚠: unsorted_segment_prod_backward — f32. Shares the kernel with the sorted variant; distinct symbol for SKU tagging.
baracuda_kernels_unsorted_segment_prod_backward_f64_can_implement^⚠: Implementability check for unsorted_segment_prod_backward_f64.
baracuda_kernels_unsorted_segment_prod_backward_f64_run^⚠: unsorted_segment_prod_backward — f64.
baracuda_kernels_unsorted_segment_prod_f32_can_implement^⚠: Implementability check for unsorted_segment_prod_f32.
baracuda_kernels_unsorted_segment_prod_f32_run^⚠: unsorted_segment_prod FW — f32.
baracuda_kernels_unsorted_segment_prod_f64_can_implement^⚠: Implementability check for unsorted_segment_prod_f64.
baracuda_kernels_unsorted_segment_prod_f64_run^⚠: unsorted_segment_prod FW — f64.
baracuda_kernels_unsorted_segment_sum_backward_f32_can_implement^⚠: Implementability check for unsorted_segment_sum_backward_f32.
baracuda_kernels_unsorted_segment_sum_backward_f32_run^⚠: Same kernel as segment_sum_backward_f32; distinct symbol for SKU-tagging differentiation.
baracuda_kernels_unsorted_segment_sum_backward_f64_can_implement^⚠: Implementability check for unsorted_segment_sum_backward_f64.
baracuda_kernels_unsorted_segment_sum_backward_f64_run^⚠: unsorted_segment_sum_backward — f64.
baracuda_kernels_unsorted_segment_sum_backward_i64idx_f32_can_implement^⚠: baracuda_kernels_unsorted_segment_sum_backward_i64idx_f32_can_implement (baracuda kernels unsorted segment sum backward i64idx f32 can implement).
baracuda_kernels_unsorted_segment_sum_backward_i64idx_f32_run^⚠: baracuda_kernels_unsorted_segment_sum_backward_i64idx_f32_run (baracuda kernels unsorted segment sum backward i64idx f32 run).
baracuda_kernels_unsorted_segment_sum_backward_i64idx_f64_can_implement^⚠: baracuda_kernels_unsorted_segment_sum_backward_i64idx_f64_can_implement (baracuda kernels unsorted segment sum backward i64idx f64 can implement).
baracuda_kernels_unsorted_segment_sum_backward_i64idx_f64_run^⚠: baracuda_kernels_unsorted_segment_sum_backward_i64idx_f64_run (baracuda kernels unsorted segment sum backward i64idx f64 run).
baracuda_kernels_unsorted_segment_sum_f32_can_implement^⚠: Implementability check for unsorted_segment_sum_f32.
baracuda_kernels_unsorted_segment_sum_f32_run^⚠: out[s, d] = Σ_{n : seg[n] == s} input[n, d] — unsorted seg ids; atomicAdd into output. Output pre-zeroed by the launcher. f32.
baracuda_kernels_unsorted_segment_sum_f64_can_implement^⚠: Implementability check for unsorted_segment_sum_f64.
baracuda_kernels_unsorted_segment_sum_f64_run^⚠: unsorted_segment_sum — f64.
baracuda_kernels_unsorted_segment_sum_i64idx_f32_can_implement^⚠: baracuda_kernels_unsorted_segment_sum_i64idx_f32_can_implement (baracuda kernels unsorted segment sum i64idx f32 can implement).
baracuda_kernels_unsorted_segment_sum_i64idx_f32_run^⚠: baracuda_kernels_unsorted_segment_sum_i64idx_f32_run (baracuda kernels unsorted segment sum i64idx f32 run).
baracuda_kernels_unsorted_segment_sum_i64idx_f64_can_implement^⚠: baracuda_kernels_unsorted_segment_sum_i64idx_f64_can_implement (baracuda kernels unsorted segment sum i64idx f64 can implement).
baracuda_kernels_unsorted_segment_sum_i64idx_f64_run^⚠: baracuda_kernels_unsorted_segment_sum_i64idx_f64_run (baracuda kernels unsorted segment sum i64idx f64 run).
baracuda_kernels_upsample_bilinear_2d_bw_bf16_run^⚠: Alias for baracuda_kernels_interpolate_bilinear_2d_backward_bf16_run.
baracuda_kernels_upsample_bilinear_2d_bw_f16_run^⚠: Alias for baracuda_kernels_interpolate_bilinear_2d_backward_f16_run.
baracuda_kernels_upsample_bilinear_2d_bw_f32_run^⚠: Alias for baracuda_kernels_interpolate_bilinear_2d_backward_f32_run.
baracuda_kernels_upsample_bilinear_2d_bw_f64_run^⚠: Alias for baracuda_kernels_interpolate_bilinear_2d_backward_f64_run.
baracuda_kernels_upsample_bilinear_2d_fw_bf16_run^⚠: Alias for baracuda_kernels_interpolate_bilinear_2d_bf16_run.
baracuda_kernels_upsample_bilinear_2d_fw_f16_run^⚠: Alias for baracuda_kernels_interpolate_bilinear_2d_f16_run.
baracuda_kernels_upsample_bilinear_2d_fw_f32_run^⚠: Alias for baracuda_kernels_interpolate_bilinear_2d_f32_run under the new Phase 19.2 upsample_* naming convention.
baracuda_kernels_upsample_bilinear_2d_fw_f64_run^⚠: Alias for baracuda_kernels_interpolate_bilinear_2d_f64_run.
baracuda_kernels_upsample_nearest_2d_bw_bf16_can_implement^⚠: baracuda_kernels_upsample_nearest_2d_bw_bf16_can_implement (baracuda kernels upsample nearest 2d bw bf16 can implement).
baracuda_kernels_upsample_nearest_2d_bw_bf16_run^⚠: upsample_nearest_2d BW, bf16. # Safety: as f32 BW.
baracuda_kernels_upsample_nearest_2d_bw_f16_can_implement^⚠: baracuda_kernels_upsample_nearest_2d_bw_f16_can_implement (baracuda kernels upsample nearest 2d bw f16 can implement).
baracuda_kernels_upsample_nearest_2d_bw_f16_run^⚠: upsample_nearest_2d BW, f16. # Safety: as f32 BW. Uses the baracuda::atomic::add<__half> (CAS-based) helper.
baracuda_kernels_upsample_nearest_2d_bw_f32_can_implement^⚠: baracuda_kernels_upsample_nearest_2d_bw_f32_can_implement (baracuda kernels upsample nearest 2d bw f32 can implement).
baracuda_kernels_upsample_nearest_2d_bw_f32_run^⚠: upsample_nearest_2d BW, f32. Caller pre-zeros dinput.
baracuda_kernels_upsample_nearest_2d_bw_f64_can_implement^⚠: baracuda_kernels_upsample_nearest_2d_bw_f64_can_implement (baracuda kernels upsample nearest 2d bw f64 can implement).
baracuda_kernels_upsample_nearest_2d_bw_f64_run^⚠: upsample_nearest_2d BW, f64. # Safety: as f32 BW.
baracuda_kernels_upsample_nearest_2d_fw_bf16_can_implement^⚠: baracuda_kernels_upsample_nearest_2d_fw_bf16_can_implement (baracuda kernels upsample nearest 2d fw bf16 can implement).
baracuda_kernels_upsample_nearest_2d_fw_bf16_run^⚠: upsample_nearest_2d FW, bf16. # Safety: as f32.
baracuda_kernels_upsample_nearest_2d_fw_f16_can_implement^⚠: baracuda_kernels_upsample_nearest_2d_fw_f16_can_implement (baracuda kernels upsample nearest 2d fw f16 can implement).
baracuda_kernels_upsample_nearest_2d_fw_f16_run^⚠: upsample_nearest_2d FW, f16. # Safety: as f32.
baracuda_kernels_upsample_nearest_2d_fw_f32_can_implement^⚠: baracuda_kernels_upsample_nearest_2d_fw_f32_can_implement (baracuda kernels upsample nearest 2d fw f32 can implement).
baracuda_kernels_upsample_nearest_2d_fw_f32_run^⚠: upsample(x, mode='nearest') FW, f32. input: [N, C, IH, IW]; output: [N, C, OH, OW]. NCHW. Coordinate mapping: nearest under align_corners=false.
baracuda_kernels_upsample_nearest_2d_fw_f64_can_implement^⚠: baracuda_kernels_upsample_nearest_2d_fw_f64_can_implement (baracuda kernels upsample nearest 2d fw f64 can implement).
baracuda_kernels_upsample_nearest_2d_fw_f64_run^⚠: upsample_nearest_2d FW, f64. # Safety: as f32.
baracuda_kernels_where_backward_bf16_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_backward_bf16_run^⚠: where backward, bf16.
baracuda_kernels_where_backward_f16_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_backward_f16_run^⚠: where backward, f16.
baracuda_kernels_where_backward_f32_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_backward_f32_run^⚠: where backward, f32. Writes da = cond ? dy : 0 and db = cond ? 0 : dy.
baracuda_kernels_where_backward_f64_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_backward_f64_run^⚠: where backward, f64.
baracuda_kernels_where_bf16_can_implement^⚠: Pre-launch check for where_bf16.
baracuda_kernels_where_bf16_run^⚠: where(cond, a, b), bf16 values + u8 cond, contig fast path.
baracuda_kernels_where_bf16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_bf16_strided_run^⚠: where(cond, a, b), bf16 values, strided / broadcast path.
baracuda_kernels_where_f16_can_implement^⚠: Pre-launch check for where_f16.
baracuda_kernels_where_f16_run^⚠: where(cond, a, b), f16 values + u8 cond, contig fast path.
baracuda_kernels_where_f16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_f16_strided_run^⚠: where(cond, a, b), f16 values, strided / broadcast path.
baracuda_kernels_where_f32_can_implement^⚠: Pre-launch check for where_f32.
baracuda_kernels_where_f32_run^⚠: where(cond, a, b), f32 values + u8 cond, contig fast path. This is the where-ternary trailblazer — its safety + aliasing contract carries over to every other where-family launcher across all value dtypes and cond-dtype variants (where_u32cond_*, where_i64cond_*).
baracuda_kernels_where_f32_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_f32_strided_run^⚠: where(cond, a, b), f32 values, strided / broadcast path.
baracuda_kernels_where_f64_can_implement^⚠: Pre-launch check for where_f64.
baracuda_kernels_where_f64_run^⚠: where(cond, a, b), f64 values + u8 cond, contig fast path.
baracuda_kernels_where_f64_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_f64_strided_run^⚠: where(cond, a, b), f64 values, strided / broadcast path.
baracuda_kernels_where_i64cond_bf16_can_implement^⚠: Pre-launch check for where_i64cond_bf16.
baracuda_kernels_where_i64cond_bf16_run^⚠: where(cond, a, b), i64 cond + bf16 values, contig fast path.
baracuda_kernels_where_i64cond_bf16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_i64cond_bf16_strided_run^⚠: where(cond, a, b), i64 cond + bf16 values, strided / broadcast.
baracuda_kernels_where_i64cond_f16_can_implement^⚠: Pre-launch check for where_i64cond_f16.
baracuda_kernels_where_i64cond_f16_run^⚠: where(cond, a, b), i64 cond + f16 values, contig fast path.
baracuda_kernels_where_i64cond_f16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_i64cond_f16_strided_run^⚠: where(cond, a, b), i64 cond + f16 values, strided / broadcast.
baracuda_kernels_where_i64cond_f32_can_implement^⚠: Pre-launch check for where_i64cond_f32.
baracuda_kernels_where_i64cond_f32_run^⚠: where(cond, a, b), i64 cond + f32 values, contig fast path.
baracuda_kernels_where_i64cond_f32_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_i64cond_f32_strided_run^⚠: where(cond, a, b), i64 cond + f32 values, strided / broadcast.
baracuda_kernels_where_i64cond_f64_can_implement^⚠: Pre-launch check for where_i64cond_f64.
baracuda_kernels_where_i64cond_f64_run^⚠: where(cond, a, b), i64 cond + f64 values, contig fast path.
baracuda_kernels_where_i64cond_f64_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_i64cond_f64_strided_run^⚠: where(cond, a, b), i64 cond + f64 values, strided / broadcast.
baracuda_kernels_where_i64cond_fp8e4m3_can_implement^⚠: baracuda_kernels_where_i64cond_fp8e4m3_can_implement (baracuda kernels where i64cond fp8e4m3 can implement).
baracuda_kernels_where_i64cond_fp8e4m3_run^⚠: where(cond, a, b), i64 cond + Fp8E4M3 values, contig fast path.
baracuda_kernels_where_i64cond_fp8e4m3_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_i64cond_fp8e4m3_strided_run^⚠: baracuda_kernels_where_i64cond_fp8e4m3_strided_run (baracuda kernels where i64cond fp8e4m3 strided run).
baracuda_kernels_where_i64cond_i8_can_implement^⚠: baracuda_kernels_where_i64cond_i8_can_implement (baracuda kernels where i64cond i8 can implement).
baracuda_kernels_where_i64cond_i8_run^⚠: where(cond, a, b), i64 cond + i8 values, contig fast path.
baracuda_kernels_where_i64cond_i8_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_i64cond_i8_strided_run^⚠: baracuda_kernels_where_i64cond_i8_strided_run (baracuda kernels where i64cond i8 strided run).
baracuda_kernels_where_i64cond_i16_can_implement^⚠: baracuda_kernels_where_i64cond_i16_can_implement (baracuda kernels where i64cond i16 can implement).
baracuda_kernels_where_i64cond_i16_run^⚠: where(cond, a, b), i64 cond + i16 values, contig fast path.
baracuda_kernels_where_i64cond_i16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_i64cond_i16_strided_run^⚠: baracuda_kernels_where_i64cond_i16_strided_run (baracuda kernels where i64cond i16 strided run).
baracuda_kernels_where_i64cond_i32_can_implement^⚠: baracuda_kernels_where_i64cond_i32_can_implement (baracuda kernels where i64cond i32 can implement).
baracuda_kernels_where_i64cond_i32_run^⚠: where(cond, a, b), i64 cond + i32 values, contig fast path.
baracuda_kernels_where_i64cond_i32_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_i64cond_i32_strided_run^⚠: baracuda_kernels_where_i64cond_i32_strided_run (baracuda kernels where i64cond i32 strided run).
baracuda_kernels_where_i64cond_i64_can_implement^⚠: baracuda_kernels_where_i64cond_i64_can_implement (baracuda kernels where i64cond i64 can implement).
baracuda_kernels_where_i64cond_i64_run^⚠: where(cond, a, b), i64 cond + i64 values, contig fast path.
baracuda_kernels_where_i64cond_i64_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_i64cond_i64_strided_run^⚠: baracuda_kernels_where_i64cond_i64_strided_run (baracuda kernels where i64cond i64 strided run).
baracuda_kernels_where_i64cond_u8_can_implement^⚠: baracuda_kernels_where_i64cond_u8_can_implement (baracuda kernels where i64cond u8 can implement).
baracuda_kernels_where_i64cond_u8_run^⚠: where(cond, a, b), i64 cond + u8 values, contig fast path.
baracuda_kernels_where_i64cond_u8_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_i64cond_u8_strided_run^⚠: baracuda_kernels_where_i64cond_u8_strided_run (baracuda kernels where i64cond u8 strided run).
baracuda_kernels_where_i64cond_u32_can_implement^⚠: baracuda_kernels_where_i64cond_u32_can_implement (baracuda kernels where i64cond u32 can implement).
baracuda_kernels_where_i64cond_u32_run^⚠: where(cond, a, b), i64 cond + u32 values, contig fast path.
baracuda_kernels_where_i64cond_u32_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_i64cond_u32_strided_run^⚠: baracuda_kernels_where_i64cond_u32_strided_run (baracuda kernels where i64cond u32 strided run).
baracuda_kernels_where_u8cond_fp8e4m3_can_implement^⚠: baracuda_kernels_where_u8cond_fp8e4m3_can_implement (baracuda kernels where u8cond fp8e4m3 can implement).
baracuda_kernels_where_u8cond_fp8e4m3_run^⚠: where(cond, a, b), u8 cond + Fp8E4M3 values, contig fast path.
baracuda_kernels_where_u8cond_fp8e4m3_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u8cond_fp8e4m3_strided_run^⚠: baracuda_kernels_where_u8cond_fp8e4m3_strided_run (baracuda kernels where u8cond fp8e4m3 strided run).
baracuda_kernels_where_u8cond_i8_can_implement^⚠: Pre-launch check for where_u8cond_i8.
baracuda_kernels_where_u8cond_i8_run^⚠: where(cond, a, b), u8 cond + i8 values, contig fast path.
baracuda_kernels_where_u8cond_i8_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u8cond_i8_strided_run^⚠: where(cond, a, b), u8 cond + i8 values, strided / broadcast.
baracuda_kernels_where_u8cond_i16_can_implement^⚠: Pre-launch check for where_u8cond_i16.
baracuda_kernels_where_u8cond_i16_run^⚠: where(cond, a, b), u8 cond + i16 values, contig fast path.
baracuda_kernels_where_u8cond_i16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u8cond_i16_strided_run^⚠: where(cond, a, b), u8 cond + i16 values, strided / broadcast.
baracuda_kernels_where_u8cond_i32_can_implement^⚠: Pre-launch check for where_u8cond_i32.
baracuda_kernels_where_u8cond_i32_run^⚠: where(cond, a, b), u8 cond + i32 values, contig fast path.
baracuda_kernels_where_u8cond_i32_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u8cond_i32_strided_run^⚠: where(cond, a, b), u8 cond + i32 values, strided / broadcast.
baracuda_kernels_where_u8cond_i64_can_implement^⚠: Pre-launch check for where_u8cond_i64.
baracuda_kernels_where_u8cond_i64_run^⚠: where(cond, a, b), u8 cond + i64 values, contig fast path.
baracuda_kernels_where_u8cond_i64_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u8cond_i64_strided_run^⚠: where(cond, a, b), u8 cond + i64 values, strided / broadcast.
baracuda_kernels_where_u8cond_u8_can_implement^⚠: Pre-launch check for where_u8cond_u8.
baracuda_kernels_where_u8cond_u8_run^⚠: where(cond, a, b), u8 cond + u8 values, contig fast path.
baracuda_kernels_where_u8cond_u8_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u8cond_u8_strided_run^⚠: where(cond, a, b), u8 cond + u8 values, strided / broadcast.
baracuda_kernels_where_u8cond_u32_can_implement^⚠: Pre-launch check for where_u8cond_u32.
baracuda_kernels_where_u8cond_u32_run^⚠: where(cond, a, b), u8 cond + u32 values, contig fast path.
baracuda_kernels_where_u8cond_u32_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u8cond_u32_strided_run^⚠: where(cond, a, b), u8 cond + u32 values, strided / broadcast.
baracuda_kernels_where_u32cond_bf16_can_implement^⚠: Pre-launch check for where_u32cond_bf16.
baracuda_kernels_where_u32cond_bf16_run^⚠: where(cond, a, b), u32 cond + bf16 values, contig fast path.
baracuda_kernels_where_u32cond_bf16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u32cond_bf16_strided_run^⚠: where(cond, a, b), u32 cond + bf16 values, strided / broadcast.
baracuda_kernels_where_u32cond_f16_can_implement^⚠: Pre-launch check for where_u32cond_f16.
baracuda_kernels_where_u32cond_f16_run^⚠: where(cond, a, b), u32 cond + f16 values, contig fast path.
baracuda_kernels_where_u32cond_f16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u32cond_f16_strided_run^⚠: where(cond, a, b), u32 cond + f16 values, strided / broadcast.
baracuda_kernels_where_u32cond_f32_can_implement^⚠: Pre-launch check for where_u32cond_f32.
baracuda_kernels_where_u32cond_f32_run^⚠: where(cond, a, b), u32 cond + f32 values, contig fast path.
baracuda_kernels_where_u32cond_f32_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u32cond_f32_strided_run^⚠: where(cond, a, b), u32 cond + f32 values, strided / broadcast path. Each operand carries its own stride array.
baracuda_kernels_where_u32cond_f64_can_implement^⚠: Pre-launch check for where_u32cond_f64.
baracuda_kernels_where_u32cond_f64_run^⚠: where(cond, a, b), u32 cond + f64 values, contig fast path.
baracuda_kernels_where_u32cond_f64_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u32cond_f64_strided_run^⚠: where(cond, a, b), u32 cond + f64 values, strided / broadcast.
baracuda_kernels_where_u32cond_fp8e4m3_can_implement^⚠: baracuda_kernels_where_u32cond_fp8e4m3_can_implement (baracuda kernels where u32cond fp8e4m3 can implement).
baracuda_kernels_where_u32cond_fp8e4m3_run^⚠: where(cond, a, b), u32 cond + Fp8E4M3 values, contig fast path.
baracuda_kernels_where_u32cond_fp8e4m3_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u32cond_fp8e4m3_strided_run^⚠: baracuda_kernels_where_u32cond_fp8e4m3_strided_run (baracuda kernels where u32cond fp8e4m3 strided run).
baracuda_kernels_where_u32cond_i8_can_implement^⚠: baracuda_kernels_where_u32cond_i8_can_implement (baracuda kernels where u32cond i8 can implement).
baracuda_kernels_where_u32cond_i8_run^⚠: where(cond, a, b), u32 cond + i8 values, contig fast path.
baracuda_kernels_where_u32cond_i8_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u32cond_i8_strided_run^⚠: baracuda_kernels_where_u32cond_i8_strided_run (baracuda kernels where u32cond i8 strided run).
baracuda_kernels_where_u32cond_i16_can_implement^⚠: baracuda_kernels_where_u32cond_i16_can_implement (baracuda kernels where u32cond i16 can implement).
baracuda_kernels_where_u32cond_i16_run^⚠: where(cond, a, b), u32 cond + i16 values, contig fast path.
baracuda_kernels_where_u32cond_i16_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u32cond_i16_strided_run^⚠: baracuda_kernels_where_u32cond_i16_strided_run (baracuda kernels where u32cond i16 strided run).
baracuda_kernels_where_u32cond_i32_can_implement^⚠: baracuda_kernels_where_u32cond_i32_can_implement (baracuda kernels where u32cond i32 can implement).
baracuda_kernels_where_u32cond_i32_run^⚠: where(cond, a, b), u32 cond + i32 values, contig fast path.
baracuda_kernels_where_u32cond_i32_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u32cond_i32_strided_run^⚠: baracuda_kernels_where_u32cond_i32_strided_run (baracuda kernels where u32cond i32 strided run).
baracuda_kernels_where_u32cond_i64_can_implement^⚠: baracuda_kernels_where_u32cond_i64_can_implement (baracuda kernels where u32cond i64 can implement).
baracuda_kernels_where_u32cond_i64_run^⚠: where(cond, a, b), u32 cond + i64 values, contig fast path.
baracuda_kernels_where_u32cond_i64_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u32cond_i64_strided_run^⚠: baracuda_kernels_where_u32cond_i64_strided_run (baracuda kernels where u32cond i64 strided run).
baracuda_kernels_where_u32cond_u8_can_implement^⚠: baracuda_kernels_where_u32cond_u8_can_implement (baracuda kernels where u32cond u8 can implement).
baracuda_kernels_where_u32cond_u8_run^⚠: where(cond, a, b), u32 cond + u8 values, contig fast path.
baracuda_kernels_where_u32cond_u8_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u32cond_u8_strided_run^⚠: baracuda_kernels_where_u32cond_u8_strided_run (baracuda kernels where u32cond u8 strided run).
baracuda_kernels_where_u32cond_u32_can_implement^⚠: baracuda_kernels_where_u32cond_u32_can_implement (baracuda kernels where u32cond u32 can implement).
baracuda_kernels_where_u32cond_u32_run^⚠: where(cond, a, b), u32 cond + u32 values, contig fast path.
baracuda_kernels_where_u32cond_u32_strided_can_implement^⚠: Pre-launch check companion.
baracuda_kernels_where_u32cond_u32_strided_run^⚠: baracuda_kernels_where_u32cond_u32_strided_run (baracuda kernels where u32cond u32 strided run).
baracuda_kernels_write_slice_b1_can_implement^⚠: Implementability check for baracuda_kernels_write_slice_b1. Host-side only.
baracuda_kernels_write_slice_b1_run^⚠: WriteSlice, 1-byte element (i8 / u8 / S8 / U8 / Bool / Fp8E4M3 / Fp8E5M2). Generic per-slab-element memcpy kernel.
baracuda_kernels_write_slice_b2_can_implement^⚠: Implementability check for baracuda_kernels_write_slice_b2. Host-side only.
baracuda_kernels_write_slice_b2_run^⚠: WriteSlice, 2-byte element (f16 / bf16). See b1 variant for the contract.
baracuda_kernels_write_slice_b4_can_implement^⚠: Implementability check for baracuda_kernels_write_slice_b4. Host-side only.
baracuda_kernels_write_slice_b4_run^⚠: WriteSlice, 4-byte element (f32 / F32Strict / i32).
baracuda_kernels_write_slice_b8_can_implement^⚠: Implementability check for baracuda_kernels_write_slice_b8. Host-side only.
baracuda_kernels_write_slice_b8_run^⚠: WriteSlice, 8-byte element (f64 / i64 / Complex32).
baracuda_kernels_write_slice_b16_can_implement^⚠: Implementability check for baracuda_kernels_write_slice_b16. Host-side only.
baracuda_kernels_write_slice_b16_run^⚠: WriteSlice, 16-byte element (Complex64).
baracuda_kernels_write_slice_nibble_can_implement^⚠: Implementability check for baracuda_kernels_write_slice_nibble. Host-side only.
baracuda_kernels_write_slice_nibble_run^⚠: WriteSlice, nibble-packed (S4 / U4 — two elements per byte). Constraint: range_start[rank-1] and range_end[rank-1] must both be even so no read-modify-write straddles a byte boundary. Shape / range_start arrays passed in are byte-counted on the innermost axis (Rust side halves before calling).
cublasCgemmStridedBatched^⚠: cublasCgemmStridedBatched — single-precision complex strided- batched matrix-matrix multiply. Complex32 (== cuComplex == cuFloatComplex) analogue of cublasSgemmStridedBatched. Used by the WY-blocked batched-unmqr plan (crate’s BatchedOrmqrWyPlan<Complex32>) — transa = CUBLAS_OP_C selects V^H for the first GEMM and T^H for the second GEMM when applying Q^H.
cublasCgeqrfBatched^⚠: cublasCgeqrfBatched. Complex32 analogue. tau_array[b] is cuComplex (NOT real-typed even though tau is real-magnitude for real Householder — cuBLAS uses complex tau across the complex family so the same apply routines can dispatch uniformly).
cublasCreate_v2^⚠: cublasCreate_v2 — create a cuBLAS handle.
cublasDestroy_v2^⚠: cublasDestroy_v2 — destroy a cuBLAS handle.
cublasDgemmStridedBatched^⚠: cublasDgemmStridedBatched — double-precision strided-batched matrix-matrix multiply. f64 analogue of cublasSgemmStridedBatched.
cublasDgeqrfBatched^⚠: cublasDgeqrfBatched. f64 analogue.
cublasDtrsm^⚠: cublasDtrsm — double-precision triangular solve. f64 analogue of cublasStrsm.
cublasGemmEx^⚠: cublasGemmEx — mixed-precision GEMM with explicit dtype tags.
cublasGemmStridedBatchedEx^⚠: cublasGemmStridedBatchedEx — mixed-precision strided-batched GEMM with explicit dtype tags (Phase 74). The Ex sibling of cublasSgemmStridedBatched: each batch slot i computes C[i] := α · op(A[i]) · op(B[i]) + β · C[i] where the slot-i operand is reached by adding i * stride_* (in elements) to the base pointer. stride_a / stride_b may be 0 to broadcast one matrix across all slots; stride_c must step disjoint output regions.
cublasSetStream_v2^⚠: cublasSetStream_v2 — bind a CUDA stream to the cuBLAS handle.
cublasSgemmStridedBatched^⚠: cublasSgemmStridedBatched — single-precision strided-batched matrix-matrix multiply. Each slot computes C[i] := α · op(A[i]) · op(B[i]) + β · C[i] where A[i], B[i], C[i] are reached by stepping stride{A,B,C} element counts from the respective base pointers.
cublasSgeqrfBatched^⚠: cublasSgeqrfBatched — batched QR factorization (single precision). Each Aarray[b] is overwritten in place with the geqrf-packed R (upper) + Householder reflectors (strict lower); TauArray[b] receives the Householder scalars.
cublasStrsm^⚠: cublasStrsm — single-precision triangular solve.
cublasZgemmStridedBatched^⚠: cublasZgemmStridedBatched — double-precision complex strided- batched matrix-matrix multiply. Complex64 analogue of cublasCgemmStridedBatched.
cublasZgeqrfBatched^⚠: cublasZgeqrfBatched. Complex64 analogue.
cufftDestroy^⚠: cufftDestroy(plan). Frees the plan’s internal workspace.
cufftExecC2C^⚠: cufftExecC2C(plan, idata, odata, direction) — complex-to- complex single-precision exec. direction is CUFFT_FORWARD or CUFFT_INVERSE. Inverse is unnormalized.
cufftExecC2R^⚠: cufftExecC2R(plan, idata, odata) — complex-to-real single precision. Input length is nx/2 + 1, output length is nx. Unnormalized — caller must scale by 1/nx.
cufftExecD2Z^⚠: cufftExecD2Z(plan, idata, odata) — real-to-complex double precision. Same semantics as cufftExecR2C.
cufftExecR2C^⚠: cufftExecR2C(plan, idata, odata) — real-to-complex single precision. Input length is nx, output length is nx/2 + 1 (Hermitian-half).
cufftExecZ2D^⚠: cufftExecZ2D(plan, idata, odata) — complex-to-real double precision. Unnormalized.
cufftExecZ2Z^⚠: cufftExecZ2Z(plan, idata, odata, direction) — complex-to- complex double precision. Same semantics as cufftExecC2C.
cufftPlan1d^⚠: cufftPlan1d(plan, nx, type, batch). Allocates a 1-D plan (single FFT of length nx, or batch independent FFTs each of length nx laid out contiguously). cuFFT’s plan struct owns its own workspace internally — no caller-supplied workspace is required for the basic 1-D APIs.
cufftPlanMany^⚠: cufftPlanMany(plan, rank, n, inembed, istride, idist, onembed, ostride, odist, type, batch).
cufftSetStream^⚠: cufftSetStream(plan, stream). Binds subsequent exec calls on this plan to the given CUDA stream. Returns 0 on success.
curandCreateGenerator^⚠: curandCreateGenerator(generator, rng_type). Returns 0 on success.
curandDestroyGenerator^⚠: curandDestroyGenerator(generator). Returns 0 on success.
curandGenerateNormal^⚠: curandGenerateNormal(generator, ptr, n, mean, stddev) — writes n normally-distributed float samples to ptr. Note: cuRAND requires n be even for the Box-Muller pair generator. Returns 0 on success.
curandGenerateNormalDouble^⚠: curandGenerateNormalDouble(generator, ptr, n, mean, stddev). Same parity contract as curandGenerateNormal. Returns 0 on success.
curandGenerateUniform^⚠: curandGenerateUniform(generator, ptr, n) — writes n float samples in (0, 1] to ptr. Returns 0 on success.
curandGenerateUniformDouble^⚠: curandGenerateUniformDouble(generator, ptr, n) — writes n double samples in (0, 1] to ptr. Returns 0 on success.
curandSetPseudoRandomGeneratorSeed^⚠: curandSetPseudoRandomGeneratorSeed(generator, seed). Returns 0 on success.
curandSetStream^⚠: curandSetStream(generator, stream). Binds subsequent generator calls to the given CUDA stream. Returns 0 on success.
cusolverDnCgeqrf^⚠: cusolverDnCgeqrf — single-precision complex QR factorization, in place. The packed output uses the same convention as the real variant: strict lower triangle + tau encode the Householder reflectors; the upper triangle holds R.
cusolverDnCgeqrf_bufferSize^⚠: cusolverDnCgeqrf_bufferSize — workspace query for single-precision complex QR factorization. Mirrors cusolverDnSgeqrf_bufferSize.
cusolverDnCheevd^⚠: cusolverDnCheevd — complex-Hermitian eigh (Complex32). A is overwritten in place with the eigenvectors (column-major); W receives the n real eigenvalues sorted ascending.
cusolverDnCheevd_bufferSize^⚠: cusolverDnCheevd_bufferSize — complex-Hermitian divide-and-conquer eigh, single precision (Complex32). Eigenvalues are real-valued float.
cusolverDnCreate^⚠: cusolverDnCreate(handle). Returns 0 on success.
cusolverDnCreateGesvdjInfo^⚠: cusolverDnCreateGesvdjInfo — allocate a Jacobi-SVD params object with cuSOLVER’s defaults (tol = 1e-7 for f32 / 1e-12 for f64, max_sweeps = 100, sort_eig = 1).
cusolverDnCreateParams^⚠: cusolverDnCreateParams — allocate the opaque params struct used by all 64-bit cuSOLVER APIs. Plan layer creates one lazily on first run (mirroring the handle lifecycle).
cusolverDnCunmqr^⚠: cusolverDnCunmqr — apply Q, Q^T, or Q^H from a complex geqrf factorization to a complex C in place.
cusolverDnCunmqr_bufferSize^⚠: cusolverDnCunmqr_bufferSize.
cusolverDnDDgels^⚠: cusolverDnDDgels. f64 analogue.
cusolverDnDDgels_bufferSize^⚠: cusolverDnDDgels_bufferSize. f64 analogue.
cusolverDnDestroy^⚠: cusolverDnDestroy(handle). Returns 0 on success.
cusolverDnDestroyGesvdjInfo^⚠: cusolverDnDestroyGesvdjInfo. Returns 0 on success.
cusolverDnDestroyParams^⚠: cusolverDnDestroyParams. Returns 0 on success.
cusolverDnDgeqrf^⚠: cusolverDnDgeqrf. f64 analogue.
cusolverDnDgeqrf_bufferSize^⚠: cusolverDnDgeqrf_bufferSize. f64 analogue.
cusolverDnDgesvd^⚠: cusolverDnDgesvd. f64 analogue.
cusolverDnDgesvd_bufferSize^⚠: cusolverDnDgesvd_bufferSize. f64 analogue.
cusolverDnDgesvdaStridedBatched^⚠: cusolverDnDgesvdaStridedBatched. f64 analogue.
cusolverDnDgesvdaStridedBatched_bufferSize^⚠: cusolverDnDgesvdaStridedBatched_bufferSize. f64 analogue.
cusolverDnDgesvdjBatched^⚠: cusolverDnDgesvdjBatched. f64 analogue.
cusolverDnDgesvdjBatched_bufferSize^⚠: cusolverDnDgesvdjBatched_bufferSize. f64 analogue.
cusolverDnDgetrf^⚠: cusolverDnDgetrf. f64 analogue.
cusolverDnDgetrf_bufferSize^⚠: cusolverDnDgetrf_bufferSize. f64 analogue.
cusolverDnDgetrs^⚠: cusolverDnDgetrs. f64 analogue.
cusolverDnDormqr^⚠: cusolverDnDormqr. f64 analogue.
cusolverDnDormqr_bufferSize^⚠: cusolverDnDormqr_bufferSize. f64 analogue.
cusolverDnDpotrf^⚠: cusolverDnDpotrf. f64 analogue.
cusolverDnDpotrfBatched^⚠: cusolverDnDpotrfBatched. f64 analogue.
cusolverDnDpotrf_bufferSize^⚠: cusolverDnDpotrf_bufferSize. f64 analogue.
cusolverDnDsyevd^⚠: cusolverDnDsyevd. f64 analogue.
cusolverDnDsyevd_bufferSize^⚠: cusolverDnDsyevd_bufferSize. f64 analogue.
cusolverDnSSgels^⚠: cusolverDnSSgels — least-squares solve min ||A·x - b||² for m ≥ n full-rank A. Iterative refinement; returns niters ≥ 0 on convergence, -N on fallback-needed. Single precision.
cusolverDnSSgels_bufferSize^⚠: cusolverDnSSgels_bufferSize — query bytes (the routine’s workspace is supplied as a raw byte buffer, not a typed element count, distinct from the *_bufferSize entries above).
cusolverDnSetStream^⚠: cusolverDnSetStream(handle, stream). Binds subsequent cuSOLVER calls to the given CUDA stream. Returns 0 on success.
cusolverDnSgeqrf^⚠: cusolverDnSgeqrf — QR factorization in place. A is overwritten: upper triangle = R, strict lower triangle + tau = Householder reflectors that encode Q. To materialize Q as a dense matrix, follow with ormqr against an identity.
cusolverDnSgeqrf_bufferSize^⚠: cusolverDnSgeqrf_bufferSize.
cusolverDnSgesvd^⚠: cusolverDnSgesvd — SVD: A = U · diag(S) · V^T. The jobu / jobv characters are ASCII bytes: 'A' (full U/V^T), 'S' (thin U/V^T), 'O' (overwrite A — disallowed at plan layer), 'N' (skip).
cusolverDnSgesvd_bufferSize^⚠: cusolverDnSgesvd_bufferSize.
cusolverDnSgesvdaStridedBatched^⚠: cusolverDnSgesvdaStridedBatched — f32 rectangular-batched approximate-SVD. Each batch slot factors a [m, n] matrix into U: [m, rank], S: [rank], V: [n, rank] (column-major; cuSOLVER returns V, not V^T). The host array h_R_nrmF (size batch_size) receives per-slot residual Frobenius norms.
cusolverDnSgesvdaStridedBatched_bufferSize^⚠: cusolverDnSgesvdaStridedBatched_bufferSize — query the device workspace size (in elements, multiply by sizeof(f32) for bytes) for the f32 rectangular-batched approximate-SVD.
cusolverDnSgesvdjBatched^⚠: cusolverDnSgesvdjBatched — batched Jacobi SVD A = U · diag(S) · V^T (single precision). Each matrix is square [m, m] (cuSOLVER’s Jacobi-batched API requires square input; thin rectangular is achievable via gesvdaStridedBatched — deferred). The plan surfaces V (not V^T); callers apply the transpose if needed.
cusolverDnSgesvdjBatched_bufferSize^⚠: cusolverDnSgesvdjBatched_bufferSize. jobz is 0 (no vectors) or 1 (compute U / V). For batched, each matrix in A is independently SVD’d; outputs are packed [batch * m * m] etc.
cusolverDnSgetrf^⚠: cusolverDnSgetrf — LU factorization with partial pivoting in place. A := L · U (with L unit-diagonal, stored in the strict lower triangle; U in the upper triangle). ipiv[i] is the row swap performed at step i (1-based per LAPACK convention).
cusolverDnSgetrf_bufferSize^⚠: cusolverDnSgetrf_bufferSize — query workspace element count.
cusolverDnSgetrs^⚠: cusolverDnSgetrs — solve op(A) · X = B using the packed LU
cusolverDnSormqr^⚠: cusolverDnSormqr — apply Q (or Q^T) from geqrf output to a matrix C in place. With C = I this materializes Q as a dense matrix for the “thin” or “full” QR.
cusolverDnSormqr_bufferSize^⚠: cusolverDnSormqr_bufferSize. trans selects Q vs Q^T; side selects left vs right multiply.
cusolverDnSpotrf^⚠: cusolverDnSpotrf — Cholesky factorization in-place (A := L or A := U). Writes the unused triangle untouched. dev_info returns 0 on success, k > 0 if the leading k-minor is not positive definite (factorization halted at step k).
cusolverDnSpotrfBatched^⚠: cusolverDnSpotrfBatched(handle, uplo, n, Aarray, lda, infoArray, batchSize). Each matrix in Aarray[batch_size] is factored independently in-place. Returns 0 on success; per-matrix factor info lands in infoArray[i].
cusolverDnSpotrf_bufferSize^⚠: cusolverDnSpotrf_bufferSize — query workspace bytes (as element count, must be multiplied by sizeof(T) for cudaMalloc).
cusolverDnSsyevd^⚠: cusolverDnSsyevd — real-symmetric eigh, f32. A is overwritten in place with the eigenvectors (column-major) when jobz == VECTOR. W receives the n eigenvalues sorted ascending.
cusolverDnSsyevd_bufferSize^⚠: cusolverDnSsyevd_bufferSize — query workspace element count for real-symmetric divide-and-conquer eigh, f32.
cusolverDnXgeev^⚠: cusolverDnXgeev — general (non-symmetric) eigendecomposition. A is destroyed in place (used as scratch by the LAPACK- equivalent algorithm). W receives the n complex eigenvalues; VL / VR (when requested) receive the column-major left / right complex eigenvectors. For non-Hermitian input the eigenvalues can be complex even when the input is real, hence the always-complex W storage.
cusolverDnXgeev_bufferSize^⚠: cusolverDnXgeev_bufferSize — query the host + device byte counts for cusolverDnXgeev at the given problem size and element types. The two output pointers receive byte counts (NOT element counts — different from the legacy _bufferSize APIs).
cusolverDnZgeqrf^⚠: cusolverDnZgeqrf — double-precision complex QR factorization.
cusolverDnZgeqrf_bufferSize^⚠: cusolverDnZgeqrf_bufferSize. f64-complex analogue of the C variant.
cusolverDnZheevd^⚠: cusolverDnZheevd. Complex64 analogue.
cusolverDnZheevd_bufferSize^⚠: cusolverDnZheevd_bufferSize. Complex64 analogue.
cusolverDnZunmqr^⚠: cusolverDnZunmqr. f64-complex analogue.
cusolverDnZunmqr_bufferSize^⚠: cusolverDnZunmqr_bufferSize. f64-complex analogue.

Type Aliases§

cuFloatComplex: cuFloatComplex is the canonical CUDA name for the single-precision complex struct — an alias for cuComplex. Surfaced so cuSOLVER’s complex APIs (cusolverDn{C,Z}unmqr, …) can spell their signatures in the same vocabulary as the NVIDIA headers.
cublasDiagType_t: cuBLAS diag-type tag for triangular solves (trsm). CUBLAS_DIAG_NON_UNIT = 0, CUBLAS_DIAG_UNIT = 1.
cublasFillMode_t: cuBLAS fill-mode tag re-used by cuSOLVER for triangular factorizations. CUBLAS_FILL_MODE_LOWER = 0, CUBLAS_FILL_MODE_UPPER = 1.
cublasHandle_t: Opaque cuBLAS handle. Used by cublas*geqrfBatched (which lives in cuBLAS, not cuSOLVER) and any future cuBLAS-routed linalg paths.
cudaDataType: cudaDataType tag used by the 64-bit cuSOLVER APIs (Xgeev, Xgesvd, …) to identify tensor element types. These constants originate in <library_types.h> and are stable across CUDA versions.
cufftHandle: Opaque cuFFT plan handle. Unusually for CUDA libraries this is an integer ID (int), not a pointer. A value of -1 is reserved at the safe-plan layer as the “not yet created” sentinel — cuFFT itself returns small non-negative integers for live handles.
cufftResult: cuFFT result code type. CUFFT_SUCCESS = 0. Any non-zero return is mapped to a negative status at the safe-plan layer for distinct error reporting.
curandGenerator_t: Opaque cuRAND generator handle. Treated as a stateful object owned by safe Rust at the plan layer — never inspect its internals here.
cusolverDnHandle_t: Opaque cuSOLVER dense handle. Stateful object; the plan layer creates one lazily on first run and reuses across launches.
cusolverDnParams_t: Opaque parameter struct used by the 64-bit cuSOLVER APIs (Xgeev, Xpotrf, …). The struct holds advanced configuration (algorithm choice, precision modes) — for the trailblazer the plan layer leaves it at defaults. Created via cusolverDnCreateParams and destroyed via cusolverDnDestroyParams.
cusolverEigMode_t: cuSOLVER eig-mode enum tag (used by syevd / heevd / Xgeev). 0 = NOVECTOR (compute eigenvalues only), 1 = VECTOR (eigenvalues + eigenvectors). Routed through as an i32 for the legacy syevd / heevd APIs. The CUSOLVER_EIG_MODE_NOVECTOR / _VECTOR constants live further down (originally introduced for gesvdjBatched’s jobz argument; reused verbatim here for the eig family).
gesvdjInfo_t: Opaque cuSOLVER Jacobi-SVD parameter object. Stateful; created once per plan, reused across launches, destroyed on plan drop. Used by cusolverDn*gesvdjBatched for the batched-SVD path.

Crate baracuda_kernels_sys

Crate baracuda_kernels_sys Copy item path

§baracuda-kernels-sys

§Status codes

Structs§

Constants§

Functions§

Type Aliases§

Crate baracuda_kernels_sys