Expand description
§baracuda-kernels-sys
Raw extern "C" entry points for compiled bespoke kernels.
You almost certainly want baracuda-kernels instead — that
crate wraps these unsafe calls with typed plans, lifetime-checked
device buffers, and a proper Rust API.
Functions in this crate take raw void* pointers, integer
dimensions, and a cudaStream_t cast as *mut c_void. They are
unsafe because:
- They dereference the pointer arguments without bounds-checking.
- They assume the pointers are valid device addresses.
- They assume the workspace pointer (when non-null) points to at
least
workspace_bytesof writable device memory. - They assume the stream is a valid CUDA stream owned by the calling thread’s current context.
§Status codes
All *_run and *_can_implement functions return an i32 status:
0: success.1: misaligned operand.2: invalid problem (e.g. M, N, or K is non-positive).3: not supported (this kernel doesn’t implement the requested shape).4: workspace too small or null when required.5: internal kernel error (typically a launch failure).
Structs§
- cuComplex
- ABI-compatible single-precision complex struct, matching
cuComplexfrom<cuComplex.h>(interleaved real/imagf32). Identical layout tocrate::cufftComplexand to the safe-side [Complex32] frombaracuda-kernels-types— aDeviceBuffer<Complex32>can be cast to a*mut cuComplexfor the cuSOLVER complex APIs without copy. - cuDouble
Complex - ABI-compatible double-precision complex struct, matching
cuDoubleComplexfrom<cuComplex.h>. Sibling tocuComplex. - cufft
Complex - Single-precision complex element layout. Interleaved real/imag
pairs —
#[repr(C)]matches NVIDIA’scufftComplexstruct exactly (which is itself an alias forfloat2in<vector_types.h>). The plan layer pairs this with thecrate-levelComplex32newtype. - cufft
Double Complex - Double-precision complex element layout. ABI-compatible with cuFFT’s
cufftDoubleComplex(alias fordouble2).
Constants§
- CUBLAS_
COMPUTE_ 32F CUBLAS_COMPUTE_32F— fp32 accumulator.- CUBLAS_
COMPUTE_ 64F CUBLAS_COMPUTE_64F— fp64 accumulator.- CUBLAS_
DIAG_ NON_ UNIT CUBLAS_DIAG_NON_UNIT—trsmreads the actual diagonal ofA. Used by the LstSq QR-fallback path for the back-substitutionR · X = Q^T · B, whereR’s diagonal is the meaningful pivots.- CUBLAS_
DIAG_ UNIT CUBLAS_DIAG_UNIT—trsmtreats the diagonal as all-1s (unit-triangular). Not used by the current plan layer; surfaced for completeness.- CUBLAS_
FILL_ MODE_ LOWER CUBLAS_FILL_MODE_LOWER— pass topotrfto request the lower- triangular Cholesky factor.- CUBLAS_
FILL_ MODE_ UPPER CUBLAS_FILL_MODE_UPPER— pass topotrfto request the upper- triangular Cholesky factor.- CUBLAS_
GEMM_ DEFAULT CUBLAS_GEMM_DEFAULT— let cuBLAS pick the algorithm.- CUBLAS_
OP_ C CUBLAS_OP_C— conjugate transpose (only meaningful for complex dtypes). Used bycusolverDn{C,Z}unmqrto applyQ^H.- CUBLAS_
OP_ N CUBLAS_OP_N— no transpose. Used byormqrto control whether to applyQorQ^T.- CUBLAS_
OP_ T CUBLAS_OP_T— transpose.- CUBLAS_
SIDE_ LEFT CUBLAS_SIDE_LEFT—Qis applied from the left inormqr(C := Q · CorC := Q^T · C).- CUBLAS_
SIDE_ RIGHT CUBLAS_SIDE_RIGHT—Qis applied from the right.- CUDA_
C_ 32F CUDA_C_32F— complexf32(interleaved real/imag).- CUDA_
C_ 64F CUDA_C_64F— complexf64(interleaved real/imag).- CUDA_
R_ 16BF CUDA_R_16BF— bfloat16 (real). Storage tag for__nv_bfloat16.- CUDA_
R_ 16F CUDA_R_16F— realf16.- CUDA_
R_ 32F CUDA_R_32F— realf32.- CUDA_
R_ 64F CUDA_R_64F— realf64.- CUFFT_
C2C - cuFFT plan type: complex-to-complex (single precision). Direction is
supplied to
cufftExecC2C. - CUFFT_
C2R - cuFFT plan type: complex-to-real (single precision). Input is
N/2 + 1complex cells (Hermitian-half), output isNreal cells. - CUFFT_
D2Z - cuFFT plan type: double-precision real-to-complex.
- CUFFT_
FORWARD - Forward FFT direction tag for
cufftExecC2C/cufftExecZ2Z. cuFFT’s forward transform is unnormalized. - CUFFT_
INVERSE - Inverse FFT direction tag for
cufftExecC2C/cufftExecZ2Z. cuFFT’s inverse transform is also unnormalized — the safe-plan layer multiplies the output by1/Nafter exec to match PyTorch’snorm="backward"(forward unnormalized, inverse normalized by N) convention. - CUFFT_
R2C - cuFFT plan type: real-to-complex (single precision). Output buffer
size is
N/2 + 1complex cells for anN-long real input (Hermitian symmetry). - CUFFT_
SUCCESS CUFFT_SUCCESS— the only success code.- CUFFT_
Z2D - cuFFT plan type: double-precision complex-to-real.
- CUFFT_
Z2Z - cuFFT plan type: double-precision complex-to-complex.
- CURAND_
RNG_ PSEUDO_ DEFAULT CURAND_RNG_PSEUDO_DEFAULT— XORWOW pseudo-random generator. Adequate for the dropout / sampling use cases this milestone targets; future QRNG / Philox / MT19937 work can extend the descriptor surface.- CURAND_
STATUS_ SUCCESS CURAND_STATUS_SUCCESS— only success code. Any non-zero return from the cuRAND host API is mapped to status5(“internal kernel error”) at the safe-plan layer.- CUSOLVER_
EIG_ MODE_ NOVECTOR CUSOLVER_EIG_MODE_NOVECTOR—gesvdjBatchedjobzvalue for computing singular values only (skip U / V).- CUSOLVER_
EIG_ MODE_ VECTOR CUSOLVER_EIG_MODE_VECTOR—gesvdjBatchedjobzvalue for computing both singular values and singular vectors.- CUSOLVER_
STATUS_ SUCCESS CUSOLVER_STATUS_SUCCESS— the only success code. Any non-zero return from a cuSOLVER routine is mapped to a negative status at the safe-plan layer for distinct error reporting.
Functions§
- baracuda_
kernels_ ⚠adaptive_ avg_ pool_ bf16_ bw_ can_ implement baracuda_kernels_adaptive_avg_pool_bf16_bw_can_implement(baracuda kernels adaptive avg pool bf16 bw can implement).- baracuda_
kernels_ ⚠adaptive_ avg_ pool_ bf16_ bw_ run - Adaptive AvgPool BW, bf16.
- baracuda_
kernels_ ⚠adaptive_ avg_ pool_ bf16_ fw_ can_ implement baracuda_kernels_adaptive_avg_pool_bf16_fw_can_implement(baracuda kernels adaptive avg pool bf16 fw can implement).- baracuda_
kernels_ ⚠adaptive_ avg_ pool_ bf16_ fw_ run - Adaptive AvgPool FW, bf16.
- baracuda_
kernels_ ⚠adaptive_ avg_ pool_ f16_ bw_ can_ implement baracuda_kernels_adaptive_avg_pool_f16_bw_can_implement(baracuda kernels adaptive avg pool f16 bw can implement).- baracuda_
kernels_ ⚠adaptive_ avg_ pool_ f16_ bw_ run - Adaptive AvgPool BW, f16. Zeros
dxinternally, then atomic-scatters. - baracuda_
kernels_ ⚠adaptive_ avg_ pool_ f16_ fw_ can_ implement baracuda_kernels_adaptive_avg_pool_f16_fw_can_implement(baracuda kernels adaptive avg pool f16 fw can implement).- baracuda_
kernels_ ⚠adaptive_ avg_ pool_ f16_ fw_ run - Adaptive AvgPool FW, f16. Rank-agnostic (
spatial_rank ∈ {1,2,3}). - baracuda_
kernels_ ⚠adaptive_ avg_ pool_ f32_ bw_ can_ implement baracuda_kernels_adaptive_avg_pool_f32_bw_can_implement(baracuda kernels adaptive avg pool f32 bw can implement).- baracuda_
kernels_ ⚠adaptive_ avg_ pool_ f32_ bw_ run - Adaptive AvgPool BW, f32.
- baracuda_
kernels_ ⚠adaptive_ avg_ pool_ f32_ fw_ can_ implement baracuda_kernels_adaptive_avg_pool_f32_fw_can_implement(baracuda kernels adaptive avg pool f32 fw can implement).- baracuda_
kernels_ ⚠adaptive_ avg_ pool_ f32_ fw_ run - Adaptive AvgPool FW, f32.
- baracuda_
kernels_ ⚠adaptive_ avg_ pool_ f64_ bw_ can_ implement baracuda_kernels_adaptive_avg_pool_f64_bw_can_implement(baracuda kernels adaptive avg pool f64 bw can implement).- baracuda_
kernels_ ⚠adaptive_ avg_ pool_ f64_ bw_ run - Adaptive AvgPool BW, f64.
- baracuda_
kernels_ ⚠adaptive_ avg_ pool_ f64_ fw_ can_ implement baracuda_kernels_adaptive_avg_pool_f64_fw_can_implement(baracuda kernels adaptive avg pool f64 fw can implement).- baracuda_
kernels_ ⚠adaptive_ avg_ pool_ f64_ fw_ run - Adaptive AvgPool FW, f64.
- baracuda_
kernels_ ⚠adaptive_ max_ pool_ bf16_ bw_ can_ implement baracuda_kernels_adaptive_max_pool_bf16_bw_can_implement(baracuda kernels adaptive max pool bf16 bw can implement).- baracuda_
kernels_ ⚠adaptive_ max_ pool_ bf16_ bw_ run - Adaptive MaxPool BW, bf16.
- baracuda_
kernels_ ⚠adaptive_ max_ pool_ bf16_ fw_ can_ implement baracuda_kernels_adaptive_max_pool_bf16_fw_can_implement(baracuda kernels adaptive max pool bf16 fw can implement).- baracuda_
kernels_ ⚠adaptive_ max_ pool_ bf16_ fw_ run - Adaptive MaxPool FW, bf16.
- baracuda_
kernels_ ⚠adaptive_ max_ pool_ f16_ bw_ can_ implement baracuda_kernels_adaptive_max_pool_f16_bw_can_implement(baracuda kernels adaptive max pool f16 bw can implement).- baracuda_
kernels_ ⚠adaptive_ max_ pool_ f16_ bw_ run - Adaptive MaxPool BW, f16. Recomputes the per-window argmax from
the saved
x, zerosdxinternally, then atomic-scattersdyinto the argmax positions. - baracuda_
kernels_ ⚠adaptive_ max_ pool_ f16_ fw_ can_ implement baracuda_kernels_adaptive_max_pool_f16_fw_can_implement(baracuda kernels adaptive max pool f16 fw can implement).- baracuda_
kernels_ ⚠adaptive_ max_ pool_ f16_ fw_ run - Adaptive MaxPool FW, f16. Writes
yonly — the matching BW recomputes the argmax internally from the savedx(keeps the Phase 11.8 args shape; no separate indices tensor). - baracuda_
kernels_ ⚠adaptive_ max_ pool_ f32_ bw_ can_ implement baracuda_kernels_adaptive_max_pool_f32_bw_can_implement(baracuda kernels adaptive max pool f32 bw can implement).- baracuda_
kernels_ ⚠adaptive_ max_ pool_ f32_ bw_ run - Adaptive MaxPool BW, f32.
- baracuda_
kernels_ ⚠adaptive_ max_ pool_ f32_ fw_ can_ implement baracuda_kernels_adaptive_max_pool_f32_fw_can_implement(baracuda kernels adaptive max pool f32 fw can implement).- baracuda_
kernels_ ⚠adaptive_ max_ pool_ f32_ fw_ run - Adaptive MaxPool FW, f32.
- baracuda_
kernels_ ⚠adaptive_ max_ pool_ f64_ bw_ can_ implement baracuda_kernels_adaptive_max_pool_f64_bw_can_implement(baracuda kernels adaptive max pool f64 bw can implement).- baracuda_
kernels_ ⚠adaptive_ max_ pool_ f64_ bw_ run - Adaptive MaxPool BW, f64.
- baracuda_
kernels_ ⚠adaptive_ max_ pool_ f64_ fw_ can_ implement baracuda_kernels_adaptive_max_pool_f64_fw_can_implement(baracuda kernels adaptive max pool f64 fw can implement).- baracuda_
kernels_ ⚠adaptive_ max_ pool_ f64_ fw_ run - Adaptive MaxPool FW, f64.
- baracuda_
kernels_ ⚠affine_ bf16_ can_ implement - Implementability check for
affine_bf16. Host-side only. - baracuda_
kernels_ ⚠affine_ bf16_ run - Affine
y = a*x + b, bf16 storage / f32 compute.a/barrive asf32. - baracuda_
kernels_ ⚠affine_ bf16_ strided_ can_ implement baracuda_kernels_affine_bf16_strided_can_implement(baracuda kernels affine bf16 strided can implement).- baracuda_
kernels_ ⚠affine_ bf16_ strided_ run - Strided affine
y = a*x + b, bf16 storage / f32 compute.a/barrive asf32. - baracuda_
kernels_ ⚠affine_ f16_ can_ implement - Implementability check for
affine_f16. Host-side only. - baracuda_
kernels_ ⚠affine_ f16_ run - Affine
y = a*x + b, f16 storage / f32 compute.a/barrive asf32. - baracuda_
kernels_ ⚠affine_ f16_ strided_ can_ implement baracuda_kernels_affine_f16_strided_can_implement(baracuda kernels affine f16 strided can implement).- baracuda_
kernels_ ⚠affine_ f16_ strided_ run - Strided affine
y = a*x + b, f16 storage / f32 compute.a/barrive asf32. - baracuda_
kernels_ ⚠affine_ f32_ can_ implement - Implementability check for
affine_f32. Host-side only. - baracuda_
kernels_ ⚠affine_ f32_ run - Affine
y = a*x + b, f32 dtype. - baracuda_
kernels_ ⚠affine_ f32_ strided_ can_ implement baracuda_kernels_affine_f32_strided_can_implement(baracuda kernels affine f32 strided can implement).- baracuda_
kernels_ ⚠affine_ f32_ strided_ run - Strided affine
y = a*x + b, f32 dtype. - baracuda_
kernels_ ⚠affine_ f64_ can_ implement - Implementability check for
affine_f64. Host-side only. - baracuda_
kernels_ ⚠affine_ f64_ run - Affine
y = a*x + b, f64 dtype. - baracuda_
kernels_ ⚠affine_ f64_ strided_ can_ implement baracuda_kernels_affine_f64_strided_can_implement(baracuda kernels affine f64 strided can implement).- baracuda_
kernels_ ⚠affine_ f64_ strided_ run - Strided affine
y = a*x + b, f64 dtype. - baracuda_
kernels_ ⚠affine_ grid_ 2d_ f32_ can_ implement baracuda_kernels_affine_grid_2d_f32_can_implement(baracuda kernels affine grid 2d f32 can implement).- baracuda_
kernels_ ⚠affine_ grid_ 2d_ f32_ run affine_grid(theta, size)— produce[N, OH, OW, 2]grid fromtheta: [N, 2, 3]. f32. # Safety: as above.- baracuda_
kernels_ ⚠affine_ grid_ 2d_ f64_ can_ implement baracuda_kernels_affine_grid_2d_f64_can_implement(baracuda kernels affine grid 2d f64 can implement).- baracuda_
kernels_ ⚠affine_ grid_ 2d_ f64_ run affine_grid_2d, f64. # Safety: as f32.- baracuda_
kernels_ ⚠affine_ i8_ can_ implement - Implementability check for
affine_i8. Host-side only. - baracuda_
kernels_ ⚠affine_ i8_ run - Affine
y = a*x + b, i8 dtype. - baracuda_
kernels_ ⚠affine_ i32_ can_ implement - Implementability check for
affine_i32. Host-side only. - baracuda_
kernels_ ⚠affine_ i32_ run - Affine
y = a*x + b, i32 dtype. - baracuda_
kernels_ ⚠affine_ i32_ strided_ can_ implement baracuda_kernels_affine_i32_strided_can_implement(baracuda kernels affine i32 strided can implement).- baracuda_
kernels_ ⚠affine_ i32_ strided_ run - Strided affine
y = a*x + b, i32 dtype. - baracuda_
kernels_ ⚠affine_ i64_ can_ implement - Implementability check for
affine_i64. Host-side only. - baracuda_
kernels_ ⚠affine_ i64_ run - Affine
y = a*x + b, i64 dtype. - baracuda_
kernels_ ⚠affine_ i64_ strided_ can_ implement baracuda_kernels_affine_i64_strided_can_implement(baracuda kernels affine i64 strided can implement).- baracuda_
kernels_ ⚠affine_ i64_ strided_ run - Strided affine
y = a*x + b, i64 dtype. - baracuda_
kernels_ ⚠affine_ inplace_ bf16_ can_ implement - Implementability check for
baracuda_kernels_affine_inplace_bf16. Host-side only. - baracuda_
kernels_ ⚠affine_ inplace_ bf16_ run - In-place affine
y = scale * y + offset(bf16). Phase 61 — added for Fuel’s INPLACE_AFFINE op family completion (bf16/f16 weight-decay scaling,Op::AddScalar/Op::MulScalaron bf16 model weights). - baracuda_
kernels_ ⚠affine_ inplace_ bf16_ strided_ can_ implement - Implementability check for
baracuda_kernels_affine_inplace_bf16_strided. Host-side only. - baracuda_
kernels_ ⚠affine_ inplace_ bf16_ strided_ run - Strided in-place affine (bf16; f32 scalars). Phase 62.
- baracuda_
kernels_ ⚠affine_ inplace_ f16_ can_ implement - Implementability check for
baracuda_kernels_affine_inplace_f16. Host-side only. - baracuda_
kernels_ ⚠affine_ inplace_ f16_ run - In-place affine
y = scale * y + offset(f16). Phase 61 — added for Fuel’s INPLACE_AFFINE op family completion. - baracuda_
kernels_ ⚠affine_ inplace_ f16_ strided_ can_ implement - Implementability check for
baracuda_kernels_affine_inplace_f16_strided. Host-side only. - baracuda_
kernels_ ⚠affine_ inplace_ f16_ strided_ run - Strided in-place affine (f16; f32 scalars). Phase 62.
- baracuda_
kernels_ ⚠affine_ inplace_ f32_ can_ implement - Implementability check for
baracuda_kernels_affine_inplace_f32. Host-side only. - baracuda_
kernels_ ⚠affine_ inplace_ f32_ run - In-place affine
y = scale * y + offset(f32). Used by the safe-plan layer to remap a cuRAND uniform-(0, 1] buffer intoUniform(low, high]. - baracuda_
kernels_ ⚠affine_ inplace_ f32_ strided_ can_ implement - Implementability check for
baracuda_kernels_affine_inplace_f32_strided. Host-side only. - baracuda_
kernels_ ⚠affine_ inplace_ f32_ strided_ run - In-place affine
y[off] = scale * y[off] + offsetover a strided view (f32). Phase 62. - baracuda_
kernels_ ⚠affine_ inplace_ f64_ can_ implement - Implementability check for
baracuda_kernels_affine_inplace_f64. Host-side only. - baracuda_
kernels_ ⚠affine_ inplace_ f64_ run - In-place affine
y = scale * y + offset(f64). - baracuda_
kernels_ ⚠affine_ inplace_ f64_ strided_ can_ implement - Implementability check for
baracuda_kernels_affine_inplace_f64_strided. Host-side only. - baracuda_
kernels_ ⚠affine_ inplace_ f64_ strided_ run - Strided in-place affine (f64). Phase 62.
- baracuda_
kernels_ ⚠affine_ inplace_ i8_ can_ implement - Implementability check for
baracuda_kernels_affine_inplace_i8. Host-side only. - baracuda_
kernels_ ⚠affine_ inplace_ i8_ run - In-place affine
y = scale * y + offset(i8). Phase 62. - baracuda_
kernels_ ⚠affine_ inplace_ i32_ can_ implement - Implementability check for
baracuda_kernels_affine_inplace_i32. Host-side only. - baracuda_
kernels_ ⚠affine_ inplace_ i32_ run - In-place affine
y = scale * y + offset(i32). Phase 62. - baracuda_
kernels_ ⚠affine_ inplace_ i32_ strided_ can_ implement - Implementability check for
baracuda_kernels_affine_inplace_i32_strided. Host-side only. - baracuda_
kernels_ ⚠affine_ inplace_ i32_ strided_ run - Strided in-place affine (i32). Phase 62.
- baracuda_
kernels_ ⚠affine_ inplace_ i64_ can_ implement - Implementability check for
baracuda_kernels_affine_inplace_i64. Host-side only. - baracuda_
kernels_ ⚠affine_ inplace_ i64_ run - In-place affine
y = scale * y + offset(i64). Phase 62. - baracuda_
kernels_ ⚠affine_ inplace_ i64_ strided_ can_ implement - Implementability check for
baracuda_kernels_affine_inplace_i64_strided. Host-side only. - baracuda_
kernels_ ⚠affine_ inplace_ i64_ strided_ run - Strided in-place affine (i64). Phase 62.
- baracuda_
kernels_ ⚠affine_ inplace_ u8_ can_ implement - Implementability check for
baracuda_kernels_affine_inplace_u8. Host-side only. - baracuda_
kernels_ ⚠affine_ inplace_ u8_ run - In-place affine
y = scale * y + offset(u8). Phase 62. - baracuda_
kernels_ ⚠affine_ inplace_ u8_ strided_ can_ implement - Implementability check for
baracuda_kernels_affine_inplace_u8_strided. Host-side only. - baracuda_
kernels_ ⚠affine_ inplace_ u8_ strided_ run - Strided in-place affine (u8). Phase 62.
- baracuda_
kernels_ ⚠affine_ u8_ can_ implement - Implementability check for
affine_u8. Host-side only. - baracuda_
kernels_ ⚠affine_ u8_ run - Affine
y = a*x + b, u8 dtype. - baracuda_
kernels_ ⚠affine_ u8_ strided_ can_ implement baracuda_kernels_affine_u8_strided_can_implement(baracuda kernels affine u8 strided can implement).- baracuda_
kernels_ ⚠affine_ u8_ strided_ run - Strided affine
y = a*x + b, u8 dtype. - baracuda_
kernels_ ⚠alibi_ backward_ bf16_ can_ implement - Implementability check for
alibi_backward_bf16. Host-side only. - baracuda_
kernels_ ⚠alibi_ backward_ bf16_ run - ALiBi BW, bf16.
- baracuda_
kernels_ ⚠alibi_ backward_ f16_ can_ implement - Implementability check for
alibi_backward_f16. Host-side only. - baracuda_
kernels_ ⚠alibi_ backward_ f16_ run - ALiBi BW, f16.
- baracuda_
kernels_ ⚠alibi_ backward_ f32_ can_ implement - Implementability check for
alibi_backward_f32. Host-side only. - baracuda_
kernels_ ⚠alibi_ backward_ f32_ run - ALiBi BW, f32.
da[b, h, i, j] = dy[b, h, i, j](pass-through);dslope[h] = Σ_{b, i, j} dy[b, h, i, j] · (j - i). Eitherdaordslopemay be null to skip; both null is rejected. - baracuda_
kernels_ ⚠alibi_ backward_ f64_ can_ implement - Implementability check for
alibi_backward_f64. Host-side only. - baracuda_
kernels_ ⚠alibi_ backward_ f64_ run - ALiBi BW, f64.
- baracuda_
kernels_ ⚠alibi_ bf16_ can_ implement - Implementability check for
alibi_bf16. Host-side only. - baracuda_
kernels_ ⚠alibi_ bf16_ run - ALiBi FW, bf16.
- baracuda_
kernels_ ⚠alibi_ f16_ can_ implement - Implementability check for
alibi_f16. Host-side only. - baracuda_
kernels_ ⚠alibi_ f16_ run - ALiBi FW, f16.
- baracuda_
kernels_ ⚠alibi_ f32_ can_ implement - Implementability check for
alibi_f32. Host-side only. - baracuda_
kernels_ ⚠alibi_ f32_ run - ALiBi FW, f32.
y[b, h, i, j] = scores[b, h, i, j] + slopes[h] · (j - i). - baracuda_
kernels_ ⚠alibi_ f64_ can_ implement - Implementability check for
alibi_f64. Host-side only. - baracuda_
kernels_ ⚠alibi_ f64_ run - ALiBi FW, f64.
- baracuda_
kernels_ ⚠apply_ token_ penalty_ f32_ can_ implement baracuda_kernels_apply_token_penalty_f32_can_implement(baracuda kernels apply token penalty f32 can implement).- baracuda_
kernels_ ⚠apply_ token_ penalty_ f32_ run baracuda_kernels_apply_token_penalty_f32_run(baracuda kernels apply token penalty f32 run).- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ bf16_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_bf16. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ bf16_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_bf16_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ bf16_ i32_ run argmax(x, axis=k), bf16 input, i32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ bf16_ run argmax(x, axis=k), bf16 input, i64 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ bf16_ u32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_bf16_u32. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ bf16_ u32_ run argmax(x, axis=k), bf16 input, u32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f16_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_f16. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f16_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_f16_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f16_ i32_ run argmax(x, axis=k), f16 input, i32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f16_ run argmax(x, axis=k), f16 input, i64 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f16_ u32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_f16_u32. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f16_ u32_ run argmax(x, axis=k), f16 input, u32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_f32. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f32_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_f32_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f32_ i32_ run argmax(x, axis=k), f32 input, i32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f32_ run argmax(x, axis=k), f32 input, i64 output. Ties broken by first occurrence (smallest index wins).- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f32_ u32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_f32_u32. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f32_ u32_ run argmax(x, axis=k), f32 input, u32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f64_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_f64. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f64_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_f64_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f64_ i32_ run argmax(x, axis=k), f64 input, i32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f64_ run argmax(x, axis=k), f64 input, i64 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f64_ u32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_f64_u32. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ f64_ u32_ run argmax(x, axis=k), f64 input, u32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i8_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_i8_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i8_ i32_ run argmax(x, axis=k)i8 input, i32 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i8_ i64_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_i8_i64. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i8_ i64_ run argmax(x, axis=k)i8 input, i64 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i16_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_i16_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i16_ i32_ run argmax(x, axis=k)i16 input, i32 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i16_ i64_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_i16_i64. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i16_ i64_ run argmax(x, axis=k)i16 input, i64 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i32_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_i32_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i32_ i32_ run argmax(x, axis=k)i32 input, i32 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i32_ i64_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_i32_i64. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i32_ i64_ run argmax(x, axis=k)i32 input, i64 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i64_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_i64_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i64_ i32_ run argmax(x, axis=k)i64 input, i32 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i64_ i64_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_i64_i64. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ i64_ i64_ run argmax(x, axis=k)i64 input, i64 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ u8_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_u8_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ u8_ i32_ run argmax(x, axis=k)u8 input, i32 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ u8_ i64_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_u8_i64. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ u8_ i64_ run argmax(x, axis=k)u8 input, i64 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ u32_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_u32_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ u32_ i32_ run argmax(x, axis=k)u32 input, i32 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmax_ u32_ i64_ can_ implement - Pre-launch implementability check for
arg_reduce_argmax_u32_i64. - baracuda_
kernels_ ⚠arg_ reduce_ argmax_ u32_ i64_ run argmax(x, axis=k)u32 input, i64 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ bf16_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_bf16. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ bf16_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_bf16_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ bf16_ i32_ run argmin(x, axis=k), bf16 input, i32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ bf16_ run argmin(x, axis=k), bf16 input, i64 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ bf16_ u32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_bf16_u32. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ bf16_ u32_ run argmin(x, axis=k), bf16 input, u32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f16_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_f16. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f16_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_f16_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f16_ i32_ run argmin(x, axis=k), f16 input, i32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f16_ run argmin(x, axis=k), f16 input, i64 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f16_ u32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_f16_u32. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f16_ u32_ run argmin(x, axis=k), f16 input, u32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_f32. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f32_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_f32_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f32_ i32_ run argmin(x, axis=k), f32 input, i32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f32_ run argmin(x, axis=k), f32 input, i64 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f32_ u32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_f32_u32. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f32_ u32_ run argmin(x, axis=k), f32 input, u32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f64_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_f64. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f64_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_f64_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f64_ i32_ run argmin(x, axis=k), f64 input, i32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f64_ run argmin(x, axis=k), f64 input, i64 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f64_ u32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_f64_u32. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ f64_ u32_ run argmin(x, axis=k), f64 input, u32 output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i8_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_i8_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i8_ i32_ run argmin(x, axis=k)i8 input, i32 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i8_ i64_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_i8_i64. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i8_ i64_ run argmin(x, axis=k)i8 input, i64 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i16_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_i16_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i16_ i32_ run argmin(x, axis=k)i16 input, i32 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i16_ i64_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_i16_i64. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i16_ i64_ run argmin(x, axis=k)i16 input, i64 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i32_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_i32_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i32_ i32_ run argmin(x, axis=k)i32 input, i32 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i32_ i64_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_i32_i64. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i32_ i64_ run argmin(x, axis=k)i32 input, i64 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i64_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_i64_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i64_ i32_ run argmin(x, axis=k)i64 input, i32 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i64_ i64_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_i64_i64. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ i64_ i64_ run argmin(x, axis=k)i64 input, i64 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ u8_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_u8_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ u8_ i32_ run argmin(x, axis=k)u8 input, i32 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ u8_ i64_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_u8_i64. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ u8_ i64_ run argmin(x, axis=k)u8 input, i64 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ u32_ i32_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_u32_i32. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ u32_ i32_ run argmin(x, axis=k)u32 input, i32 idx output.- baracuda_
kernels_ ⚠arg_ reduce_ argmin_ u32_ i64_ can_ implement - Pre-launch implementability check for
arg_reduce_argmin_u32_i64. - baracuda_
kernels_ ⚠arg_ reduce_ argmin_ u32_ i64_ run argmin(x, axis=k)u32 input, i64 idx output.- baracuda_
kernels_ ⚠argsort_ bf16_ can_ implement baracuda_kernels_argsort_bf16_can_implement(baracuda kernels argsort bf16 can implement).- baracuda_
kernels_ ⚠argsort_ bf16_ run - Block-bitonic argsort, bf16. Comparator uses native
__nv_bfloat16operator<(CUDA device-side intrinsics). - baracuda_
kernels_ ⚠argsort_ f16_ can_ implement baracuda_kernels_argsort_f16_can_implement(baracuda kernels argsort f16 can implement).- baracuda_
kernels_ ⚠argsort_ f16_ run - Block-bitonic argsort, f16. Comparator uses native
__halfoperator<. - baracuda_
kernels_ ⚠argsort_ f32_ big_ can_ implement baracuda_kernels_argsort_f32_big_can_implement(baracuda kernels argsort f32 big can implement).- baracuda_
kernels_ ⚠argsort_ f32_ big_ run - Multi-block radix argsort, f32, for
row_len > 1024. - baracuda_
kernels_ ⚠argsort_ f32_ big_ workspace_ size baracuda_kernels_argsort_f32_big_workspace_size(baracuda kernels argsort f32 big workspace size).- baracuda_
kernels_ ⚠argsort_ f32_ can_ implement baracuda_kernels_argsort_f32_can_implement(baracuda kernels argsort f32 can implement).- baracuda_
kernels_ ⚠argsort_ f32_ run - Block-bitonic argsort, f32. Returns indices; values not written.
- baracuda_
kernels_ ⚠argsort_ f64_ big_ can_ implement baracuda_kernels_argsort_f64_big_can_implement(baracuda kernels argsort f64 big can implement).- baracuda_
kernels_ ⚠argsort_ f64_ big_ run - Multi-block radix argsort, f64.
- baracuda_
kernels_ ⚠argsort_ f64_ big_ workspace_ size baracuda_kernels_argsort_f64_big_workspace_size(baracuda kernels argsort f64 big workspace size).- baracuda_
kernels_ ⚠argsort_ f64_ can_ implement baracuda_kernels_argsort_f64_can_implement(baracuda kernels argsort f64 can implement).- baracuda_
kernels_ ⚠argsort_ f64_ run - Block-bitonic argsort, f64.
- baracuda_
kernels_ ⚠argsort_ fp8e4m3_ can_ implement baracuda_kernels_argsort_fp8e4m3_can_implement(baracuda kernels argsort fp8e4m3 can implement).- baracuda_
kernels_ ⚠argsort_ fp8e4m3_ run - Block-bitonic argsort, FP8 E4M3. Storage is byte-identical to
raw
u8; the kernel wraps it in anFp8E4M3Sortstruct that decodes tofloatin the comparator. Raw-byte buffer in, i32 index buffer out. - baracuda_
kernels_ ⚠argsort_ i8_ can_ implement baracuda_kernels_argsort_i8_can_implement(baracuda kernels argsort i8 can implement).- baracuda_
kernels_ ⚠argsort_ i8_ run - Block-bitonic argsort, i8.
- baracuda_
kernels_ ⚠argsort_ i16_ can_ implement baracuda_kernels_argsort_i16_can_implement(baracuda kernels argsort i16 can implement).- baracuda_
kernels_ ⚠argsort_ i16_ run - Block-bitonic argsort, i16.
- baracuda_
kernels_ ⚠argsort_ i32_ big_ can_ implement baracuda_kernels_argsort_i32_big_can_implement(baracuda kernels argsort i32 big can implement).- baracuda_
kernels_ ⚠argsort_ i32_ big_ run - Multi-block radix argsort, i32.
- baracuda_
kernels_ ⚠argsort_ i32_ big_ workspace_ size baracuda_kernels_argsort_i32_big_workspace_size(baracuda kernels argsort i32 big workspace size).- baracuda_
kernels_ ⚠argsort_ i32_ can_ implement baracuda_kernels_argsort_i32_can_implement(baracuda kernels argsort i32 can implement).- baracuda_
kernels_ ⚠argsort_ i32_ run - Block-bitonic argsort, i32.
- baracuda_
kernels_ ⚠argsort_ i64_ big_ can_ implement baracuda_kernels_argsort_i64_big_can_implement(baracuda kernels argsort i64 big can implement).- baracuda_
kernels_ ⚠argsort_ i64_ big_ run - Multi-block radix argsort, i64.
- baracuda_
kernels_ ⚠argsort_ i64_ big_ workspace_ size baracuda_kernels_argsort_i64_big_workspace_size(baracuda kernels argsort i64 big workspace size).- baracuda_
kernels_ ⚠argsort_ i64_ can_ implement baracuda_kernels_argsort_i64_can_implement(baracuda kernels argsort i64 can implement).- baracuda_
kernels_ ⚠argsort_ i64_ run - Block-bitonic argsort, i64.
- baracuda_
kernels_ ⚠argsort_ u8_ can_ implement baracuda_kernels_argsort_u8_can_implement(baracuda kernels argsort u8 can implement).- baracuda_
kernels_ ⚠argsort_ u8_ run - Block-bitonic argsort, u8.
- baracuda_
kernels_ ⚠argsort_ u32_ can_ implement baracuda_kernels_argsort_u32_can_implement(baracuda kernels argsort u32 can implement).- baracuda_
kernels_ ⚠argsort_ u32_ run - Block-bitonic argsort, u32.
- baracuda_
kernels_ ⚠batch_ norm_ backward_ bf16_ can_ implement baracuda_kernels_batch_norm_backward_bf16_can_implement(baracuda kernels batch norm backward bf16 can implement).- baracuda_
kernels_ ⚠batch_ norm_ backward_ bf16_ run - BatchNorm BW, bf16.
- baracuda_
kernels_ ⚠batch_ norm_ backward_ f16_ can_ implement baracuda_kernels_batch_norm_backward_f16_can_implement(baracuda kernels batch norm backward f16 can implement).- baracuda_
kernels_ ⚠batch_ norm_ backward_ f16_ run - BatchNorm BW, f16.
- baracuda_
kernels_ ⚠batch_ norm_ backward_ f32_ can_ implement baracuda_kernels_batch_norm_backward_f32_can_implement(baracuda kernels batch norm backward f32 can implement).- baracuda_
kernels_ ⚠batch_ norm_ backward_ f32_ run - BatchNorm BW, f32. Three-stage: per-group sum_dxh / sum_dxhxh,
per-cell dx, per-channel dgamma / dbeta. Requires workspace of
2 * group_count * sizeof(float)bytes for the stage-1 partial sums (group_count = c_extent for BN). - baracuda_
kernels_ ⚠batch_ norm_ backward_ f64_ can_ implement baracuda_kernels_batch_norm_backward_f64_can_implement(baracuda kernels batch norm backward f64 can implement).- baracuda_
kernels_ ⚠batch_ norm_ backward_ f64_ run - BatchNorm BW, f64.
- baracuda_
kernels_ ⚠batch_ norm_ bf16_ can_ implement baracuda_kernels_batch_norm_bf16_can_implement(baracuda kernels batch norm bf16 can implement).- baracuda_
kernels_ ⚠batch_ norm_ bf16_ run - BatchNorm FW, bf16.
- baracuda_
kernels_ ⚠batch_ norm_ f16_ can_ implement baracuda_kernels_batch_norm_f16_can_implement(baracuda kernels batch norm f16 can implement).- baracuda_
kernels_ ⚠batch_ norm_ f16_ run - BatchNorm FW, f16.
- baracuda_
kernels_ ⚠batch_ norm_ f32_ can_ implement baracuda_kernels_batch_norm_f32_can_implement(baracuda kernels batch norm f32 can implement).- baracuda_
kernels_ ⚠batch_ norm_ f32_ run - BatchNorm FW, f32. Training mode: computes per-channel
(mean, inv_std)from the batch + spatial cells, writes them tosaved_mean/saved_rstdfor BW.gamma/betaoptional (both supplied together per PyTorch convention). - baracuda_
kernels_ ⚠batch_ norm_ f64_ can_ implement baracuda_kernels_batch_norm_f64_can_implement(baracuda kernels batch norm f64 can implement).- baracuda_
kernels_ ⚠batch_ norm_ f64_ run - BatchNorm FW, f64.
- baracuda_
kernels_ ⚠batched_ ormqr_ complex32_ can_ implement baracuda_kernels_batched_ormqr_complex32_can_implement(baracuda kernels batched ormqr complex32 can implement).- baracuda_
kernels_ ⚠batched_ ormqr_ complex32_ run - Batched-
unmqr,Complex32. Same shape/contract as thef32variant but withcuFloatComplexstorage.op = 2(C — conjugate transpose) is supported;op = 1(T — plain transpose) is rejected by the Rust safe layer for complex (mathematically unusual for Householder). - baracuda_
kernels_ ⚠batched_ ormqr_ complex64_ can_ implement baracuda_kernels_batched_ormqr_complex64_can_implement(baracuda kernels batched ormqr complex64 can implement).- baracuda_
kernels_ ⚠batched_ ormqr_ complex64_ run - Batched-
unmqr,Complex64. Same as thecomplex32variant withcuDoubleComplexstorage. - baracuda_
kernels_ ⚠batched_ ormqr_ f32_ can_ implement baracuda_kernels_batched_ormqr_f32_can_implement(baracuda kernels batched ormqr f32 can implement).- baracuda_
kernels_ ⚠batched_ ormqr_ f32_ run - Batched-
ormqr,f32. Applies the implicitQ(orQ^T) from aBatchedQrPlanpacked output (A_packed [B, M, K]column-major - baracuda_
kernels_ ⚠batched_ ormqr_ f64_ can_ implement baracuda_kernels_batched_ormqr_f64_can_implement(baracuda kernels batched ormqr f64 can implement).- baracuda_
kernels_ ⚠batched_ ormqr_ f64_ run - Batched-
ormqr,f64. Same contract as thef32variant. - baracuda_
kernels_ ⚠batched_ ormqr_ wy_ build_ t_ complex32_ can_ implement baracuda_kernels_batched_ormqr_wy_build_t_complex32_can_implement(baracuda kernels batched ormqr wy build t complex32 can implement).- baracuda_
kernels_ ⚠batched_ ormqr_ wy_ build_ t_ complex32_ run - WY block T-build,
Complex32. f32-complex analogue of thef32variant. Storage iscuFloatComplex(==Complex32, ABI-compatible). - baracuda_
kernels_ ⚠batched_ ormqr_ wy_ build_ t_ complex64_ can_ implement baracuda_kernels_batched_ormqr_wy_build_t_complex64_can_implement(baracuda kernels batched ormqr wy build t complex64 can implement).- baracuda_
kernels_ ⚠batched_ ormqr_ wy_ build_ t_ complex64_ run - WY block T-build,
Complex64. f64-complex analogue. - baracuda_
kernels_ ⚠batched_ ormqr_ wy_ build_ t_ f32_ can_ implement baracuda_kernels_batched_ormqr_wy_build_t_f32_can_implement(baracuda kernels batched ormqr wy build t f32 can implement).- baracuda_
kernels_ ⚠batched_ ormqr_ wy_ build_ t_ f32_ run - WY block T-build,
f32. For each(batch_slot, block_index), builds the[nb, nb]upper-triangular block-reflector matrixTsuch thatH_0 · ... · H_{nb-1} = I - V·T·V^T. One CUDA block per(batch, num_blocks)cell. Status codes: 0 success, 2 invalid problem, 5 launch failure. - baracuda_
kernels_ ⚠batched_ ormqr_ wy_ build_ t_ f64_ can_ implement baracuda_kernels_batched_ormqr_wy_build_t_f64_can_implement(baracuda kernels batched ormqr wy build t f64 can implement).- baracuda_
kernels_ ⚠batched_ ormqr_ wy_ build_ t_ f64_ run - WY block T-build,
f64analogue. - baracuda_
kernels_ ⚠batched_ ormqr_ wy_ extract_ v_ complex32_ can_ implement baracuda_kernels_batched_ormqr_wy_extract_v_complex32_can_implement(baracuda kernels batched ormqr wy extract v complex32 can implement).- baracuda_
kernels_ ⚠batched_ ormqr_ wy_ extract_ v_ complex32_ run - WY V-extraction,
Complex32. f32-complex analogue. Pure copy kernel — sets the implicit-1 (as(1, 0)), zeroes above the diagonal (as(0, 0)), copies the strict lower below. - baracuda_
kernels_ ⚠batched_ ormqr_ wy_ extract_ v_ complex64_ can_ implement baracuda_kernels_batched_ormqr_wy_extract_v_complex64_can_implement(baracuda kernels batched ormqr wy extract v complex64 can implement).- baracuda_
kernels_ ⚠batched_ ormqr_ wy_ extract_ v_ complex64_ run - WY V-extraction,
Complex64. f64-complex analogue. - baracuda_
kernels_ ⚠batched_ ormqr_ wy_ extract_ v_ f32_ can_ implement baracuda_kernels_batched_ormqr_wy_extract_v_f32_can_implement(baracuda kernels batched ormqr wy extract v f32 can implement).- baracuda_
kernels_ ⚠batched_ ormqr_ wy_ extract_ v_ f32_ run - WY V-extraction,
f32. Materializes the denseV [B, M, nb]panel for one block of reflectors (block_start =block_start,block_k = min(nb, K - block_start)) into a contiguous workspace buffer. Sets the implicit-1 at each reflector’s diagonal, copies the packed-A strict lower below, zeros above the diagonal, and zeros entire columns pastblock_k(handles the partial-last- block case). - baracuda_
kernels_ ⚠batched_ ormqr_ wy_ extract_ v_ f64_ can_ implement baracuda_kernels_batched_ormqr_wy_extract_v_f64_can_implement(baracuda kernels batched ormqr wy extract v f64 can implement).- baracuda_
kernels_ ⚠batched_ ormqr_ wy_ extract_ v_ f64_ run - WY V-extraction,
f64analogue. - baracuda_
kernels_ ⚠batched_ qr_ materialize_ identity_ f32_ can_ implement baracuda_kernels_batched_qr_materialize_identity_f32_can_implement(baracuda kernels batched qr materialize identity f32 can implement).- baracuda_
kernels_ ⚠batched_ qr_ materialize_ identity_ f32_ run - Stage a column-major identity
Q [B, M, M](one identity per batch slot) into a freshly allocated buffer. Caller then chainsbaracuda_kernels_batched_ormqr_*_runwithop = 0(N) to overwriteQin place with the dense Q matrix from thegeqrf-packed input.f32. - baracuda_
kernels_ ⚠batched_ qr_ materialize_ identity_ f64_ can_ implement baracuda_kernels_batched_qr_materialize_identity_f64_can_implement(baracuda kernels batched qr materialize identity f64 can implement).- baracuda_
kernels_ ⚠batched_ qr_ materialize_ identity_ f64_ run - Stage identity,
f64analogue. - baracuda_
kernels_ ⚠batched_ qr_ materialize_ r_ f32_ can_ implement baracuda_kernels_batched_qr_materialize_r_f32_can_implement(baracuda kernels batched qr materialize r f32 can implement).- baracuda_
kernels_ ⚠batched_ qr_ materialize_ r_ f32_ run - Materialize dense
R [B, K, N]from ageqrf-packedA [B, M, N](column-major).K = min(M, N). CellR[b, i, j]=A[b, i, j]ifi ≤ j, else0. One CUDA block per(batch_slot, column).f32. - baracuda_
kernels_ ⚠batched_ qr_ materialize_ r_ f64_ can_ implement baracuda_kernels_batched_qr_materialize_r_f64_can_implement(baracuda kernels batched qr materialize r f64 can implement).- baracuda_
kernels_ ⚠batched_ qr_ materialize_ r_ f64_ run - Materialize dense
R,f64analogue. - baracuda_
kernels_ ⚠bernoulli_ can_ implement baracuda_kernels_bernoulli_can_implement(baracuda kernels bernoulli can implement).- baracuda_
kernels_ ⚠bernoulli_ run bernoulliover afloatuniform-rand buffer.- baracuda_
kernels_ ⚠binary_ add_ backward_ bf16_ can_ implement baracuda_kernels_binary_add_backward_bf16_can_implement(baracuda kernels binary add backward bf16 can implement).- baracuda_
kernels_ ⚠binary_ add_ backward_ bf16_ run - Add backward, bf16.
- baracuda_
kernels_ ⚠binary_ add_ backward_ f16_ can_ implement baracuda_kernels_binary_add_backward_f16_can_implement(baracuda kernels binary add backward f16 can implement).- baracuda_
kernels_ ⚠binary_ add_ backward_ f16_ run - Add backward, f16.
- baracuda_
kernels_ ⚠binary_ add_ backward_ f32_ can_ implement baracuda_kernels_binary_add_backward_f32_can_implement(baracuda kernels binary add backward f32 can implement).- baracuda_
kernels_ ⚠binary_ add_ backward_ f32_ run - Add backward, f32. Writes
da = dyanddb = dy. - baracuda_
kernels_ ⚠binary_ add_ backward_ f64_ can_ implement baracuda_kernels_binary_add_backward_f64_can_implement(baracuda kernels binary add backward f64 can implement).- baracuda_
kernels_ ⚠binary_ add_ backward_ f64_ run - Add backward, f64.
- baracuda_
kernels_ ⚠binary_ add_ bf16_ can_ implement - Pre-launch implementability check for
binary_add_bf16. - baracuda_
kernels_ ⚠binary_ add_ bf16_ run - Binary elementwise
add, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠binary_ add_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_add_bf16_strided. - baracuda_
kernels_ ⚠binary_ add_ bf16_ strided_ run - Binary elementwise
add, bf16 dtype, strided / broadcast path. - baracuda_
kernels_ ⚠binary_ add_ f16_ can_ implement - Pre-launch implementability check for
binary_add_f16. - baracuda_
kernels_ ⚠binary_ add_ f16_ run - Binary elementwise
add, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠binary_ add_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_add_f16_strided. - baracuda_
kernels_ ⚠binary_ add_ f16_ strided_ run - Binary elementwise
add, f16 dtype, strided / broadcast path. - baracuda_
kernels_ ⚠binary_ add_ f32_ can_ implement - Pre-launch implementability check for
binary_add_f32. Validates the problem size without launching a kernel. Returns the standard status code mapping. - baracuda_
kernels_ ⚠binary_ add_ f32_ run - Binary elementwise
add, f32 dtype, contiguous fast path. This is the binary-pointwise trailblazer — its safety contract carries over to every other binary contig launcher (add,sub,mul,div,min,max,pow, comparison ops, etc.) across all dtypes. - baracuda_
kernels_ ⚠binary_ add_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_add_f32_strided. - baracuda_
kernels_ ⚠binary_ add_ f32_ strided_ run - Binary elementwise
add, f32 dtype, strided / broadcast path. This is the binary-strided trailblazer — its safety contract (including aliasing) carries over to every other binary strided launcher across all dtypes. - baracuda_
kernels_ ⚠binary_ add_ f64_ can_ implement - Pre-launch implementability check for
binary_add_f64. - baracuda_
kernels_ ⚠binary_ add_ f64_ run - Binary elementwise
add, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠binary_ add_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_add_f64_strided. - baracuda_
kernels_ ⚠binary_ add_ f64_ strided_ run - Binary elementwise
add, f64 dtype, strided / broadcast path. - baracuda_
kernels_ ⚠binary_ atan2_ backward_ bf16_ can_ implement baracuda_kernels_binary_atan2_backward_bf16_can_implement(baracuda kernels binary atan2 backward bf16 can implement).- baracuda_
kernels_ ⚠binary_ atan2_ backward_ bf16_ run - Atan2 backward, bf16.
- baracuda_
kernels_ ⚠binary_ atan2_ backward_ f16_ can_ implement baracuda_kernels_binary_atan2_backward_f16_can_implement(baracuda kernels binary atan2 backward f16 can implement).- baracuda_
kernels_ ⚠binary_ atan2_ backward_ f16_ run - Atan2 backward, f16.
- baracuda_
kernels_ ⚠binary_ atan2_ backward_ f32_ can_ implement baracuda_kernels_binary_atan2_backward_f32_can_implement(baracuda kernels binary atan2 backward f32 can implement).- baracuda_
kernels_ ⚠binary_ atan2_ backward_ f32_ run - Atan2 backward, f32.
denom = a²+b²,da = dy*b/denom,db = -dy*a/denom. Caller responsible for guarding againsta == 0 && b == 0(denom == 0). - baracuda_
kernels_ ⚠binary_ atan2_ backward_ f64_ can_ implement baracuda_kernels_binary_atan2_backward_f64_can_implement(baracuda kernels binary atan2 backward f64 can implement).- baracuda_
kernels_ ⚠binary_ atan2_ backward_ f64_ run - Atan2 backward, f64.
- baracuda_
kernels_ ⚠binary_ atan2_ bf16_ can_ implement - Binary
atan2, bf16, can-implement. - baracuda_
kernels_ ⚠binary_ atan2_ bf16_ run - Binary
atan2, bf16, contig. - baracuda_
kernels_ ⚠binary_ atan2_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_atan2_bf16_strided. - baracuda_
kernels_ ⚠binary_ atan2_ bf16_ strided_ run - Binary
atan2, bf16, strided. - baracuda_
kernels_ ⚠binary_ atan2_ f16_ can_ implement - Binary
atan2, f16, can-implement. - baracuda_
kernels_ ⚠binary_ atan2_ f16_ run - Binary
atan2, f16, contig. - baracuda_
kernels_ ⚠binary_ atan2_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_atan2_f16_strided. - baracuda_
kernels_ ⚠binary_ atan2_ f16_ strided_ run - Binary
atan2, f16, strided. - baracuda_
kernels_ ⚠binary_ atan2_ f32_ can_ implement - Binary
atan2, f32, can-implement. - baracuda_
kernels_ ⚠binary_ atan2_ f32_ run - Binary
atan2, f32, contig. - baracuda_
kernels_ ⚠binary_ atan2_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_atan2_f32_strided. - baracuda_
kernels_ ⚠binary_ atan2_ f32_ strided_ run - Binary
atan2, f32, strided. - baracuda_
kernels_ ⚠binary_ atan2_ f64_ can_ implement - Binary
atan2, f64, can-implement. - baracuda_
kernels_ ⚠binary_ atan2_ f64_ run - Binary
atan2, f64, contig. - baracuda_
kernels_ ⚠binary_ atan2_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_atan2_f64_strided. - baracuda_
kernels_ ⚠binary_ atan2_ f64_ strided_ run - Binary
atan2, f64, strided. - baracuda_
kernels_ ⚠binary_ bitwise_ and_ i32_ can_ implement - Binary bitwise
and, i32 dtype, can-implement. - baracuda_
kernels_ ⚠binary_ bitwise_ and_ i32_ run - Binary bitwise
and, i32 dtype, contig. - baracuda_
kernels_ ⚠binary_ bitwise_ and_ i64_ can_ implement - Binary bitwise
and, i64 dtype, can-implement. - baracuda_
kernels_ ⚠binary_ bitwise_ and_ i64_ run - Binary bitwise
and, i64 dtype, contig. - baracuda_
kernels_ ⚠binary_ bitwise_ left_ shift_ i32_ can_ implement - Binary bitwise
left_shift, i32 dtype, can-implement. - baracuda_
kernels_ ⚠binary_ bitwise_ left_ shift_ i32_ run - Binary bitwise
left_shift, i32 dtype, contig. - baracuda_
kernels_ ⚠binary_ bitwise_ left_ shift_ i64_ can_ implement - Binary bitwise
left_shift, i64 dtype, can-implement. - baracuda_
kernels_ ⚠binary_ bitwise_ left_ shift_ i64_ run - Binary bitwise
left_shift, i64 dtype, contig. - baracuda_
kernels_ ⚠binary_ bitwise_ or_ i32_ can_ implement - Binary bitwise
or, i32 dtype, can-implement. - baracuda_
kernels_ ⚠binary_ bitwise_ or_ i32_ run - Binary bitwise
or, i32 dtype, contig. - baracuda_
kernels_ ⚠binary_ bitwise_ or_ i64_ can_ implement - Binary bitwise
or, i64 dtype, can-implement. - baracuda_
kernels_ ⚠binary_ bitwise_ or_ i64_ run - Binary bitwise
or, i64 dtype, contig. - baracuda_
kernels_ ⚠binary_ bitwise_ right_ shift_ i32_ can_ implement - Binary bitwise
right_shift, i32 dtype, can-implement. - baracuda_
kernels_ ⚠binary_ bitwise_ right_ shift_ i32_ run - Binary bitwise
right_shift, i32 dtype, contig. Arithmetic shift (sign-extending), matching PyTorch. - baracuda_
kernels_ ⚠binary_ bitwise_ right_ shift_ i64_ can_ implement - Binary bitwise
right_shift, i64 dtype, can-implement. - baracuda_
kernels_ ⚠binary_ bitwise_ right_ shift_ i64_ run - Binary bitwise
right_shift, i64 dtype, contig. Arithmetic shift (sign-extending), matching PyTorch. - baracuda_
kernels_ ⚠binary_ bitwise_ xor_ i32_ can_ implement - Binary bitwise
xor, i32 dtype, can-implement. - baracuda_
kernels_ ⚠binary_ bitwise_ xor_ i32_ run - Binary bitwise
xor, i32 dtype, contig. - baracuda_
kernels_ ⚠binary_ bitwise_ xor_ i64_ can_ implement - Binary bitwise
xor, i64 dtype, can-implement. - baracuda_
kernels_ ⚠binary_ bitwise_ xor_ i64_ run - Binary bitwise
xor, i64 dtype, contig. - baracuda_
kernels_ ⚠binary_ cmp_ eq_ bf16_ can_ implement - Pre-launch implementability check for
binary_cmp_eq_bf16. - baracuda_
kernels_ ⚠binary_ cmp_ eq_ bf16_ run - Binary elementwise
eq, bf16 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ eq_ bf16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ eq_ bf16_ strided_ run - Binary elementwise
eq, bf16 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ eq_ f16_ can_ implement - Pre-launch implementability check for
binary_cmp_eq_f16. - baracuda_
kernels_ ⚠binary_ cmp_ eq_ f16_ run - Binary elementwise
eq, f16 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ eq_ f16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ eq_ f16_ strided_ run - Binary elementwise
eq, f16 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ eq_ f32_ can_ implement - Pre-launch implementability check for
binary_cmp_eq_f32. - baracuda_
kernels_ ⚠binary_ cmp_ eq_ f32_ run - Binary elementwise
eq, f32 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ eq_ f32_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ eq_ f32_ strided_ run - Binary elementwise
eq, f32 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ eq_ f64_ can_ implement - Pre-launch implementability check for
binary_cmp_eq_f64. - baracuda_
kernels_ ⚠binary_ cmp_ eq_ f64_ run - Binary elementwise
eq, f64 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ eq_ f64_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ eq_ f64_ strided_ run - Binary elementwise
eq, f64 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ ge_ bf16_ can_ implement - Pre-launch implementability check for
binary_cmp_ge_bf16. - baracuda_
kernels_ ⚠binary_ cmp_ ge_ bf16_ run - Binary elementwise
ge, bf16 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ ge_ bf16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ ge_ bf16_ strided_ run - Binary elementwise
ge, bf16 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ ge_ f16_ can_ implement - Pre-launch implementability check for
binary_cmp_ge_f16. - baracuda_
kernels_ ⚠binary_ cmp_ ge_ f16_ run - Binary elementwise
ge, f16 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ ge_ f16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ ge_ f16_ strided_ run - Binary elementwise
ge, f16 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ ge_ f32_ can_ implement - Pre-launch implementability check for
binary_cmp_ge_f32. - baracuda_
kernels_ ⚠binary_ cmp_ ge_ f32_ run - Binary elementwise
ge(a >= b), f32 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ ge_ f32_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ ge_ f32_ strided_ run - Binary elementwise
ge, f32 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ ge_ f64_ can_ implement - Pre-launch implementability check for
binary_cmp_ge_f64. - baracuda_
kernels_ ⚠binary_ cmp_ ge_ f64_ run - Binary elementwise
ge, f64 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ ge_ f64_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ ge_ f64_ strided_ run - Binary elementwise
ge, f64 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ gt_ bf16_ can_ implement - Pre-launch implementability check for
binary_cmp_gt_bf16. - baracuda_
kernels_ ⚠binary_ cmp_ gt_ bf16_ run - Binary elementwise
gt, bf16 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ gt_ bf16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ gt_ bf16_ strided_ run - Binary elementwise
gt, bf16 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ gt_ f16_ can_ implement - Pre-launch implementability check for
binary_cmp_gt_f16. - baracuda_
kernels_ ⚠binary_ cmp_ gt_ f16_ run - Binary elementwise
gt, f16 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ gt_ f16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ gt_ f16_ strided_ run - Binary elementwise
gt, f16 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ gt_ f32_ can_ implement - Pre-launch implementability check for
binary_cmp_gt_f32. - baracuda_
kernels_ ⚠binary_ cmp_ gt_ f32_ run - Binary elementwise
gt(a > b), f32 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ gt_ f32_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ gt_ f32_ strided_ run - Binary elementwise
gt, f32 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ gt_ f64_ can_ implement - Pre-launch implementability check for
binary_cmp_gt_f64. - baracuda_
kernels_ ⚠binary_ cmp_ gt_ f64_ run - Binary elementwise
gt, f64 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ gt_ f64_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ gt_ f64_ strided_ run - Binary elementwise
gt, f64 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ le_ bf16_ can_ implement - Pre-launch implementability check for
binary_cmp_le_bf16. - baracuda_
kernels_ ⚠binary_ cmp_ le_ bf16_ run - Binary elementwise
le, bf16 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ le_ bf16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ le_ bf16_ strided_ run - Binary elementwise
le, bf16 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ le_ f16_ can_ implement - Pre-launch implementability check for
binary_cmp_le_f16. - baracuda_
kernels_ ⚠binary_ cmp_ le_ f16_ run - Binary elementwise
le, f16 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ le_ f16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ le_ f16_ strided_ run - Binary elementwise
le, f16 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ le_ f32_ can_ implement - Pre-launch implementability check for
binary_cmp_le_f32. - baracuda_
kernels_ ⚠binary_ cmp_ le_ f32_ run - Binary elementwise
le(a <= b), f32 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ le_ f32_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ le_ f32_ strided_ run - Binary elementwise
le, f32 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ le_ f64_ can_ implement - Pre-launch implementability check for
binary_cmp_le_f64. - baracuda_
kernels_ ⚠binary_ cmp_ le_ f64_ run - Binary elementwise
le, f64 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ le_ f64_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ le_ f64_ strided_ run - Binary elementwise
le, f64 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ lt_ bf16_ can_ implement - Pre-launch implementability check for
binary_cmp_lt_bf16. - baracuda_
kernels_ ⚠binary_ cmp_ lt_ bf16_ run - Binary elementwise
lt, bf16 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ lt_ bf16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ lt_ bf16_ strided_ run - Binary elementwise
lt, bf16 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ lt_ f16_ can_ implement - Pre-launch implementability check for
binary_cmp_lt_f16. - baracuda_
kernels_ ⚠binary_ cmp_ lt_ f16_ run - Binary elementwise
lt, f16 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ lt_ f16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ lt_ f16_ strided_ run - Binary elementwise
lt, f16 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ lt_ f32_ can_ implement - Pre-launch implementability check for
binary_cmp_lt_f32. - baracuda_
kernels_ ⚠binary_ cmp_ lt_ f32_ run - Binary elementwise
lt(a < b), f32 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ lt_ f32_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ lt_ f32_ strided_ run - Binary elementwise
lt, f32 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ lt_ f64_ can_ implement - Pre-launch implementability check for
binary_cmp_lt_f64. - baracuda_
kernels_ ⚠binary_ cmp_ lt_ f64_ run - Binary elementwise
lt, f64 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ lt_ f64_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ lt_ f64_ strided_ run - Binary elementwise
lt, f64 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ ne_ bf16_ can_ implement - Pre-launch implementability check for
binary_cmp_ne_bf16. - baracuda_
kernels_ ⚠binary_ cmp_ ne_ bf16_ run - Binary elementwise
ne, bf16 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ ne_ bf16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ ne_ bf16_ strided_ run - Binary elementwise
ne, bf16 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ ne_ f16_ can_ implement - Pre-launch implementability check for
binary_cmp_ne_f16. - baracuda_
kernels_ ⚠binary_ cmp_ ne_ f16_ run - Binary elementwise
ne, f16 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ ne_ f16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ ne_ f16_ strided_ run - Binary elementwise
ne, f16 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ ne_ f32_ can_ implement - Pre-launch implementability check for
binary_cmp_ne_f32. - baracuda_
kernels_ ⚠binary_ cmp_ ne_ f32_ run - Binary elementwise
ne, f32 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ ne_ f32_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ ne_ f32_ strided_ run - Binary elementwise
ne, f32 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ cmp_ ne_ f64_ can_ implement - Pre-launch implementability check for
binary_cmp_ne_f64. - baracuda_
kernels_ ⚠binary_ cmp_ ne_ f64_ run - Binary elementwise
ne, f64 inputs, u8 output, contig fast path. - baracuda_
kernels_ ⚠binary_ cmp_ ne_ f64_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠binary_ cmp_ ne_ f64_ strided_ run - Binary elementwise
ne, f64 inputs, u8 output, strided path. - baracuda_
kernels_ ⚠binary_ copysign_ bf16_ can_ implement - Binary
copysign, bf16, can-implement. - baracuda_
kernels_ ⚠binary_ copysign_ bf16_ run - Binary
copysign, bf16, contig. - baracuda_
kernels_ ⚠binary_ copysign_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_copysign_bf16_strided. - baracuda_
kernels_ ⚠binary_ copysign_ bf16_ strided_ run - Binary
copysign, bf16, strided. - baracuda_
kernels_ ⚠binary_ copysign_ f16_ can_ implement - Binary
copysign, f16, can-implement. - baracuda_
kernels_ ⚠binary_ copysign_ f16_ run - Binary
copysign, f16, contig. - baracuda_
kernels_ ⚠binary_ copysign_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_copysign_f16_strided. - baracuda_
kernels_ ⚠binary_ copysign_ f16_ strided_ run - Binary
copysign, f16, strided. - baracuda_
kernels_ ⚠binary_ copysign_ f32_ can_ implement - Binary
copysign, f32, can-implement. - baracuda_
kernels_ ⚠binary_ copysign_ f32_ run - Binary
copysign, f32, contig. - baracuda_
kernels_ ⚠binary_ copysign_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_copysign_f32_strided. - baracuda_
kernels_ ⚠binary_ copysign_ f32_ strided_ run - Binary
copysign, f32, strided. - baracuda_
kernels_ ⚠binary_ copysign_ f64_ can_ implement - Binary
copysign, f64, can-implement. - baracuda_
kernels_ ⚠binary_ copysign_ f64_ run - Binary
copysign, f64, contig. - baracuda_
kernels_ ⚠binary_ copysign_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_copysign_f64_strided. - baracuda_
kernels_ ⚠binary_ copysign_ f64_ strided_ run - Binary
copysign, f64, strided. - baracuda_
kernels_ ⚠binary_ div_ backward_ bf16_ can_ implement baracuda_kernels_binary_div_backward_bf16_can_implement(baracuda kernels binary div backward bf16 can implement).- baracuda_
kernels_ ⚠binary_ div_ backward_ bf16_ run - Div backward, bf16.
- baracuda_
kernels_ ⚠binary_ div_ backward_ f16_ can_ implement baracuda_kernels_binary_div_backward_f16_can_implement(baracuda kernels binary div backward f16 can implement).- baracuda_
kernels_ ⚠binary_ div_ backward_ f16_ run - Div backward, f16.
- baracuda_
kernels_ ⚠binary_ div_ backward_ f32_ can_ implement baracuda_kernels_binary_div_backward_f32_can_implement(baracuda kernels binary div backward f32 can implement).- baracuda_
kernels_ ⚠binary_ div_ backward_ f32_ run - Div backward, f32. Writes
da = dy / banddb = -dy * a / b². Both saved tensorsaandbmust be non-null; callers must also ensureb[i] != 0for every cell. - baracuda_
kernels_ ⚠binary_ div_ backward_ f64_ can_ implement baracuda_kernels_binary_div_backward_f64_can_implement(baracuda kernels binary div backward f64 can implement).- baracuda_
kernels_ ⚠binary_ div_ backward_ f64_ run - Div backward, f64.
- baracuda_
kernels_ ⚠binary_ div_ bf16_ can_ implement - Pre-launch implementability check for
binary_div_bf16. - baracuda_
kernels_ ⚠binary_ div_ bf16_ run - Binary elementwise
div, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠binary_ div_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_div_bf16_strided. - baracuda_
kernels_ ⚠binary_ div_ bf16_ strided_ run - Binary elementwise
div, bf16 dtype, strided / broadcast path. - baracuda_
kernels_ ⚠binary_ div_ f16_ can_ implement - Pre-launch implementability check for
binary_div_f16. - baracuda_
kernels_ ⚠binary_ div_ f16_ run - Binary elementwise
div, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠binary_ div_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_div_f16_strided. - baracuda_
kernels_ ⚠binary_ div_ f16_ strided_ run - Binary elementwise
div, f16 dtype, strided / broadcast path. - baracuda_
kernels_ ⚠binary_ div_ f32_ can_ implement - Pre-launch implementability check for
binary_div_f32. Validates the problem size without launching a kernel. Returns the standard status code mapping. - baracuda_
kernels_ ⚠binary_ div_ f32_ run - Binary elementwise
div, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠binary_ div_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_div_f32_strided. - baracuda_
kernels_ ⚠binary_ div_ f32_ strided_ run - Binary elementwise
div, f32 dtype, strided / broadcast path. - baracuda_
kernels_ ⚠binary_ div_ f64_ can_ implement - Pre-launch implementability check for
binary_div_f64. - baracuda_
kernels_ ⚠binary_ div_ f64_ run - Binary elementwise
div, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠binary_ div_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_div_f64_strided. - baracuda_
kernels_ ⚠binary_ div_ f64_ strided_ run - Binary elementwise
div, f64 dtype, strided / broadcast path. - baracuda_
kernels_ ⚠binary_ floor_ divide_ bf16_ can_ implement - Binary
floor_divide, bf16, can-implement. - baracuda_
kernels_ ⚠binary_ floor_ divide_ bf16_ run - Binary
floor_divide, bf16, contig. - baracuda_
kernels_ ⚠binary_ floor_ divide_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_floor_divide_bf16_strided. - baracuda_
kernels_ ⚠binary_ floor_ divide_ bf16_ strided_ run - Binary
floor_divide, bf16, strided. - baracuda_
kernels_ ⚠binary_ floor_ divide_ f16_ can_ implement - Binary
floor_divide, f16, can-implement. - baracuda_
kernels_ ⚠binary_ floor_ divide_ f16_ run - Binary
floor_divide, f16, contig. - baracuda_
kernels_ ⚠binary_ floor_ divide_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_floor_divide_f16_strided. - baracuda_
kernels_ ⚠binary_ floor_ divide_ f16_ strided_ run - Binary
floor_divide, f16, strided. - baracuda_
kernels_ ⚠binary_ floor_ divide_ f32_ can_ implement - Binary
floor_divide, f32, can-implement. - baracuda_
kernels_ ⚠binary_ floor_ divide_ f32_ run - Binary
floor_divide, f32, contig. - baracuda_
kernels_ ⚠binary_ floor_ divide_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_floor_divide_f32_strided. - baracuda_
kernels_ ⚠binary_ floor_ divide_ f32_ strided_ run - Binary
floor_divide, f32, strided. - baracuda_
kernels_ ⚠binary_ floor_ divide_ f64_ can_ implement - Binary
floor_divide, f64, can-implement. - baracuda_
kernels_ ⚠binary_ floor_ divide_ f64_ run - Binary
floor_divide, f64, contig. - baracuda_
kernels_ ⚠binary_ floor_ divide_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_floor_divide_f64_strided. - baracuda_
kernels_ ⚠binary_ floor_ divide_ f64_ strided_ run - Binary
floor_divide, f64, strided. - baracuda_
kernels_ ⚠binary_ fmax_ bf16_ can_ implement - Binary
fmax, bf16, can-implement. - baracuda_
kernels_ ⚠binary_ fmax_ bf16_ run - Binary
fmax, bf16, contig. - baracuda_
kernels_ ⚠binary_ fmax_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_fmax_bf16_strided. - baracuda_
kernels_ ⚠binary_ fmax_ bf16_ strided_ run - Binary
fmax, bf16, strided. - baracuda_
kernels_ ⚠binary_ fmax_ f16_ can_ implement - Binary
fmax, f16, can-implement. - baracuda_
kernels_ ⚠binary_ fmax_ f16_ run - Binary
fmax, f16, contig. - baracuda_
kernels_ ⚠binary_ fmax_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_fmax_f16_strided. - baracuda_
kernels_ ⚠binary_ fmax_ f16_ strided_ run - Binary
fmax, f16, strided. - baracuda_
kernels_ ⚠binary_ fmax_ f32_ can_ implement - Binary
fmax, f32, can-implement. - baracuda_
kernels_ ⚠binary_ fmax_ f32_ run - Binary
fmax, f32, contig. - baracuda_
kernels_ ⚠binary_ fmax_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_fmax_f32_strided. - baracuda_
kernels_ ⚠binary_ fmax_ f32_ strided_ run - Binary
fmax, f32, strided. - baracuda_
kernels_ ⚠binary_ fmax_ f64_ can_ implement - Binary
fmax, f64, can-implement. - baracuda_
kernels_ ⚠binary_ fmax_ f64_ run - Binary
fmax, f64, contig. - baracuda_
kernels_ ⚠binary_ fmax_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_fmax_f64_strided. - baracuda_
kernels_ ⚠binary_ fmax_ f64_ strided_ run - Binary
fmax, f64, strided. - baracuda_
kernels_ ⚠binary_ fmin_ bf16_ can_ implement - Binary
fmin, bf16, can-implement. - baracuda_
kernels_ ⚠binary_ fmin_ bf16_ run - Binary
fmin, bf16, contig. - baracuda_
kernels_ ⚠binary_ fmin_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_fmin_bf16_strided. - baracuda_
kernels_ ⚠binary_ fmin_ bf16_ strided_ run - Binary
fmin, bf16, strided. - baracuda_
kernels_ ⚠binary_ fmin_ f16_ can_ implement - Binary
fmin, f16, can-implement. - baracuda_
kernels_ ⚠binary_ fmin_ f16_ run - Binary
fmin, f16, contig. - baracuda_
kernels_ ⚠binary_ fmin_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_fmin_f16_strided. - baracuda_
kernels_ ⚠binary_ fmin_ f16_ strided_ run - Binary
fmin, f16, strided. - baracuda_
kernels_ ⚠binary_ fmin_ f32_ can_ implement - Binary
fmin, f32, can-implement. - baracuda_
kernels_ ⚠binary_ fmin_ f32_ run - Binary
fmin, f32, contig. - baracuda_
kernels_ ⚠binary_ fmin_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_fmin_f32_strided. - baracuda_
kernels_ ⚠binary_ fmin_ f32_ strided_ run - Binary
fmin, f32, strided. - baracuda_
kernels_ ⚠binary_ fmin_ f64_ can_ implement - Binary
fmin, f64, can-implement. - baracuda_
kernels_ ⚠binary_ fmin_ f64_ run - Binary
fmin, f64, contig. - baracuda_
kernels_ ⚠binary_ fmin_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_fmin_f64_strided. - baracuda_
kernels_ ⚠binary_ fmin_ f64_ strided_ run - Binary
fmin, f64, strided. - baracuda_
kernels_ ⚠binary_ hypot_ backward_ bf16_ can_ implement baracuda_kernels_binary_hypot_backward_bf16_can_implement(baracuda kernels binary hypot backward bf16 can implement).- baracuda_
kernels_ ⚠binary_ hypot_ backward_ bf16_ run - Hypot backward, bf16.
- baracuda_
kernels_ ⚠binary_ hypot_ backward_ f16_ can_ implement baracuda_kernels_binary_hypot_backward_f16_can_implement(baracuda kernels binary hypot backward f16 can implement).- baracuda_
kernels_ ⚠binary_ hypot_ backward_ f16_ run - Hypot backward, f16.
- baracuda_
kernels_ ⚠binary_ hypot_ backward_ f32_ can_ implement baracuda_kernels_binary_hypot_backward_f32_can_implement(baracuda kernels binary hypot backward f32 can implement).- baracuda_
kernels_ ⚠binary_ hypot_ backward_ f32_ run - Hypot backward, f32.
y = sqrt(a²+b²)is reconstructed inside the kernel from savedaandb(no saved-y slot inBinaryBackwardArgs);da = dy*a/y,db = dy*b/y. Caller responsible for guarding againsta == 0 && b == 0(y == 0). - baracuda_
kernels_ ⚠binary_ hypot_ backward_ f64_ can_ implement baracuda_kernels_binary_hypot_backward_f64_can_implement(baracuda kernels binary hypot backward f64 can implement).- baracuda_
kernels_ ⚠binary_ hypot_ backward_ f64_ run - Hypot backward, f64.
- baracuda_
kernels_ ⚠binary_ hypot_ bf16_ can_ implement - Binary
hypot, bf16, can-implement. - baracuda_
kernels_ ⚠binary_ hypot_ bf16_ run - Binary
hypot, bf16, contig. - baracuda_
kernels_ ⚠binary_ hypot_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_hypot_bf16_strided. - baracuda_
kernels_ ⚠binary_ hypot_ bf16_ strided_ run - Binary
hypot, bf16, strided. - baracuda_
kernels_ ⚠binary_ hypot_ f16_ can_ implement - Binary
hypot, f16, can-implement. - baracuda_
kernels_ ⚠binary_ hypot_ f16_ run - Binary
hypot, f16, contig. - baracuda_
kernels_ ⚠binary_ hypot_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_hypot_f16_strided. - baracuda_
kernels_ ⚠binary_ hypot_ f16_ strided_ run - Binary
hypot, f16, strided. - baracuda_
kernels_ ⚠binary_ hypot_ f32_ can_ implement - Binary
hypot, f32, can-implement. - baracuda_
kernels_ ⚠binary_ hypot_ f32_ run - Binary
hypot, f32, contig. - baracuda_
kernels_ ⚠binary_ hypot_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_hypot_f32_strided. - baracuda_
kernels_ ⚠binary_ hypot_ f32_ strided_ run - Binary
hypot, f32, strided. - baracuda_
kernels_ ⚠binary_ hypot_ f64_ can_ implement - Binary
hypot, f64, can-implement. - baracuda_
kernels_ ⚠binary_ hypot_ f64_ run - Binary
hypot, f64, contig. - baracuda_
kernels_ ⚠binary_ hypot_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_hypot_f64_strided. - baracuda_
kernels_ ⚠binary_ hypot_ f64_ strided_ run - Binary
hypot, f64, strided. - baracuda_
kernels_ ⚠binary_ lerp_ backward_ bf16_ can_ implement baracuda_kernels_binary_lerp_backward_bf16_can_implement(baracuda kernels binary lerp backward bf16 can implement).- baracuda_
kernels_ ⚠binary_ lerp_ backward_ bf16_ run lerpBW, bf16.- baracuda_
kernels_ ⚠binary_ lerp_ backward_ f16_ can_ implement baracuda_kernels_binary_lerp_backward_f16_can_implement(baracuda kernels binary lerp backward f16 can implement).- baracuda_
kernels_ ⚠binary_ lerp_ backward_ f16_ run lerpBW, f16.- baracuda_
kernels_ ⚠binary_ lerp_ backward_ f32_ can_ implement baracuda_kernels_binary_lerp_backward_f32_can_implement(baracuda kernels binary lerp backward f32 can implement).- baracuda_
kernels_ ⚠binary_ lerp_ backward_ f32_ run lerpbackward:da = (1 - weight)·dy,db = weight·dy, f32. No saves.- baracuda_
kernels_ ⚠binary_ lerp_ backward_ f64_ can_ implement baracuda_kernels_binary_lerp_backward_f64_can_implement(baracuda kernels binary lerp backward f64 can implement).- baracuda_
kernels_ ⚠binary_ lerp_ backward_ f64_ run lerpBW, f64.- baracuda_
kernels_ ⚠binary_ lerp_ bf16_ can_ implement baracuda_kernels_binary_lerp_bf16_can_implement(baracuda kernels binary lerp bf16 can implement).- baracuda_
kernels_ ⚠binary_ lerp_ bf16_ run lerpFW, bf16.- baracuda_
kernels_ ⚠binary_ lerp_ f16_ can_ implement baracuda_kernels_binary_lerp_f16_can_implement(baracuda kernels binary lerp f16 can implement).- baracuda_
kernels_ ⚠binary_ lerp_ f16_ run lerpFW, f16.- baracuda_
kernels_ ⚠binary_ lerp_ f32_ can_ implement baracuda_kernels_binary_lerp_f32_can_implement(baracuda kernels binary lerp f32 can implement).- baracuda_
kernels_ ⚠binary_ lerp_ f32_ run - Binary elementwise
lerp(a, b; weight) = a + weight·(b - a), f32, contig. - baracuda_
kernels_ ⚠binary_ lerp_ f64_ can_ implement baracuda_kernels_binary_lerp_f64_can_implement(baracuda kernels binary lerp f64 can implement).- baracuda_
kernels_ ⚠binary_ lerp_ f64_ run lerpFW, f64. The f32 weight widens to f64 losslessly.- baracuda_
kernels_ ⚠binary_ logical_ and_ bool_ can_ implement - Binary logical
and, Bool dtype, can-implement. - baracuda_
kernels_ ⚠binary_ logical_ and_ bool_ run - Binary logical
and, Bool dtype (1-byte storage), contig. - baracuda_
kernels_ ⚠binary_ logical_ or_ bool_ can_ implement - Binary logical
or, Bool dtype, can-implement. - baracuda_
kernels_ ⚠binary_ logical_ or_ bool_ run - Binary logical
or, Bool dtype, contig. - baracuda_
kernels_ ⚠binary_ logical_ xor_ bool_ can_ implement - Binary logical
xor, Bool dtype, can-implement. - baracuda_
kernels_ ⚠binary_ logical_ xor_ bool_ run - Binary logical
xor, Bool dtype, contig. - baracuda_
kernels_ ⚠binary_ maximum_ backward_ bf16_ can_ implement baracuda_kernels_binary_maximum_backward_bf16_can_implement(baracuda kernels binary maximum backward bf16 can implement).- baracuda_
kernels_ ⚠binary_ maximum_ backward_ bf16_ run - Maximum backward, bf16.
- baracuda_
kernels_ ⚠binary_ maximum_ backward_ f16_ can_ implement baracuda_kernels_binary_maximum_backward_f16_can_implement(baracuda kernels binary maximum backward f16 can implement).- baracuda_
kernels_ ⚠binary_ maximum_ backward_ f16_ run - Maximum backward, f16.
- baracuda_
kernels_ ⚠binary_ maximum_ backward_ f32_ can_ implement baracuda_kernels_binary_maximum_backward_f32_can_implement(baracuda kernels binary maximum backward f32 can implement).- baracuda_
kernels_ ⚠binary_ maximum_ backward_ f32_ run - Maximum backward, f32. Tie-break: split
dyevenly ona == b; NaN inputs propagatedyto both. Savedaandbare used purely as references for the comparison. - baracuda_
kernels_ ⚠binary_ maximum_ backward_ f64_ can_ implement baracuda_kernels_binary_maximum_backward_f64_can_implement(baracuda kernels binary maximum backward f64 can implement).- baracuda_
kernels_ ⚠binary_ maximum_ backward_ f64_ run - Maximum backward, f64.
- baracuda_
kernels_ ⚠binary_ maximum_ bf16_ can_ implement - Binary
maximum, bf16, can-implement. - baracuda_
kernels_ ⚠binary_ maximum_ bf16_ run - Binary
maximum, bf16, contig. - baracuda_
kernels_ ⚠binary_ maximum_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_maximum_bf16_strided. - baracuda_
kernels_ ⚠binary_ maximum_ bf16_ strided_ run - Binary
maximum, bf16, strided. - baracuda_
kernels_ ⚠binary_ maximum_ f16_ can_ implement - Binary
maximum, f16, can-implement. - baracuda_
kernels_ ⚠binary_ maximum_ f16_ run - Binary
maximum, f16, contig. - baracuda_
kernels_ ⚠binary_ maximum_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_maximum_f16_strided. - baracuda_
kernels_ ⚠binary_ maximum_ f16_ strided_ run - Binary
maximum, f16, strided. - baracuda_
kernels_ ⚠binary_ maximum_ f32_ can_ implement - Binary
maximum, f32, can-implement. - baracuda_
kernels_ ⚠binary_ maximum_ f32_ run - Binary
maximum, f32, contig. - baracuda_
kernels_ ⚠binary_ maximum_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_maximum_f32_strided. - baracuda_
kernels_ ⚠binary_ maximum_ f32_ strided_ run - Binary
maximum, f32, strided. - baracuda_
kernels_ ⚠binary_ maximum_ f64_ can_ implement - Binary
maximum, f64, can-implement. - baracuda_
kernels_ ⚠binary_ maximum_ f64_ run - Binary
maximum, f64, contig. - baracuda_
kernels_ ⚠binary_ maximum_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_maximum_f64_strided. - baracuda_
kernels_ ⚠binary_ maximum_ f64_ strided_ run - Binary
maximum, f64, strided. - baracuda_
kernels_ ⚠binary_ minimum_ backward_ bf16_ can_ implement baracuda_kernels_binary_minimum_backward_bf16_can_implement(baracuda kernels binary minimum backward bf16 can implement).- baracuda_
kernels_ ⚠binary_ minimum_ backward_ bf16_ run - Minimum backward, bf16.
- baracuda_
kernels_ ⚠binary_ minimum_ backward_ f16_ can_ implement baracuda_kernels_binary_minimum_backward_f16_can_implement(baracuda kernels binary minimum backward f16 can implement).- baracuda_
kernels_ ⚠binary_ minimum_ backward_ f16_ run - Minimum backward, f16.
- baracuda_
kernels_ ⚠binary_ minimum_ backward_ f32_ can_ implement baracuda_kernels_binary_minimum_backward_f32_can_implement(baracuda kernels binary minimum backward f32 can implement).- baracuda_
kernels_ ⚠binary_ minimum_ backward_ f32_ run - Minimum backward, f32. Tie-break: split
dyevenly ona == b; NaN inputs propagatedyto both. Savedaandbare used purely as references for the comparison. - baracuda_
kernels_ ⚠binary_ minimum_ backward_ f64_ can_ implement baracuda_kernels_binary_minimum_backward_f64_can_implement(baracuda kernels binary minimum backward f64 can implement).- baracuda_
kernels_ ⚠binary_ minimum_ backward_ f64_ run - Minimum backward, f64.
- baracuda_
kernels_ ⚠binary_ minimum_ bf16_ can_ implement - Binary
minimum, bf16, can-implement. - baracuda_
kernels_ ⚠binary_ minimum_ bf16_ run - Binary
minimum, bf16, contig. - baracuda_
kernels_ ⚠binary_ minimum_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_minimum_bf16_strided. - baracuda_
kernels_ ⚠binary_ minimum_ bf16_ strided_ run - Binary
minimum, bf16, strided. - baracuda_
kernels_ ⚠binary_ minimum_ f16_ can_ implement - Binary
minimum, f16, can-implement. - baracuda_
kernels_ ⚠binary_ minimum_ f16_ run - Binary
minimum, f16, contig. - baracuda_
kernels_ ⚠binary_ minimum_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_minimum_f16_strided. - baracuda_
kernels_ ⚠binary_ minimum_ f16_ strided_ run - Binary
minimum, f16, strided. - baracuda_
kernels_ ⚠binary_ minimum_ f32_ can_ implement - Binary
minimum, f32, can-implement. - baracuda_
kernels_ ⚠binary_ minimum_ f32_ run - Binary
minimum, f32, contig. - baracuda_
kernels_ ⚠binary_ minimum_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_minimum_f32_strided. - baracuda_
kernels_ ⚠binary_ minimum_ f32_ strided_ run - Binary
minimum, f32, strided. - baracuda_
kernels_ ⚠binary_ minimum_ f64_ can_ implement - Binary
minimum, f64, can-implement. - baracuda_
kernels_ ⚠binary_ minimum_ f64_ run - Binary
minimum, f64, contig. - baracuda_
kernels_ ⚠binary_ minimum_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_minimum_f64_strided. - baracuda_
kernels_ ⚠binary_ minimum_ f64_ strided_ run - Binary
minimum, f64, strided. - baracuda_
kernels_ ⚠binary_ mod_ bf16_ can_ implement - Binary
mod, bf16, can-implement. - baracuda_
kernels_ ⚠binary_ mod_ bf16_ run - Binary
mod, bf16, contig. - baracuda_
kernels_ ⚠binary_ mod_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_mod_bf16_strided. - baracuda_
kernels_ ⚠binary_ mod_ bf16_ strided_ run - Binary
mod, bf16, strided. - baracuda_
kernels_ ⚠binary_ mod_ f16_ can_ implement - Binary
mod, f16, can-implement. - baracuda_
kernels_ ⚠binary_ mod_ f16_ run - Binary
mod, f16, contig. - baracuda_
kernels_ ⚠binary_ mod_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_mod_f16_strided. - baracuda_
kernels_ ⚠binary_ mod_ f16_ strided_ run - Binary
mod, f16, strided. - baracuda_
kernels_ ⚠binary_ mod_ f32_ can_ implement - Binary
mod, f32, can-implement. - baracuda_
kernels_ ⚠binary_ mod_ f32_ run - Binary
mod, f32, contig. - baracuda_
kernels_ ⚠binary_ mod_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_mod_f32_strided. - baracuda_
kernels_ ⚠binary_ mod_ f32_ strided_ run - Binary
mod, f32, strided. - baracuda_
kernels_ ⚠binary_ mod_ f64_ can_ implement - Binary
mod, f64, can-implement. - baracuda_
kernels_ ⚠binary_ mod_ f64_ run - Binary
mod, f64, contig. - baracuda_
kernels_ ⚠binary_ mod_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_mod_f64_strided. - baracuda_
kernels_ ⚠binary_ mod_ f64_ strided_ run - Binary
mod, f64, strided. - baracuda_
kernels_ ⚠binary_ mul_ backward_ bf16_ can_ implement baracuda_kernels_binary_mul_backward_bf16_can_implement(baracuda kernels binary mul backward bf16 can implement).- baracuda_
kernels_ ⚠binary_ mul_ backward_ bf16_ run - Mul backward, bf16.
- baracuda_
kernels_ ⚠binary_ mul_ backward_ f16_ can_ implement baracuda_kernels_binary_mul_backward_f16_can_implement(baracuda kernels binary mul backward f16 can implement).- baracuda_
kernels_ ⚠binary_ mul_ backward_ f16_ run - Mul backward, f16.
- baracuda_
kernels_ ⚠binary_ mul_ backward_ f32_ can_ implement baracuda_kernels_binary_mul_backward_f32_can_implement(baracuda kernels binary mul backward f32 can implement).- baracuda_
kernels_ ⚠binary_ mul_ backward_ f32_ run - Mul backward, f32. Writes
da = dy * banddb = dy * a. Both saved tensorsaandbmust be non-null. - baracuda_
kernels_ ⚠binary_ mul_ backward_ f64_ can_ implement baracuda_kernels_binary_mul_backward_f64_can_implement(baracuda kernels binary mul backward f64 can implement).- baracuda_
kernels_ ⚠binary_ mul_ backward_ f64_ run - Mul backward, f64.
- baracuda_
kernels_ ⚠binary_ mul_ bf16_ can_ implement - Pre-launch implementability check for
binary_mul_bf16. - baracuda_
kernels_ ⚠binary_ mul_ bf16_ run - Binary elementwise
mul, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠binary_ mul_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_mul_bf16_strided. - baracuda_
kernels_ ⚠binary_ mul_ bf16_ strided_ run - Binary elementwise
mul, bf16 dtype, strided / broadcast path. - baracuda_
kernels_ ⚠binary_ mul_ f16_ can_ implement - Pre-launch implementability check for
binary_mul_f16. - baracuda_
kernels_ ⚠binary_ mul_ f16_ run - Binary elementwise
mul, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠binary_ mul_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_mul_f16_strided. - baracuda_
kernels_ ⚠binary_ mul_ f16_ strided_ run - Binary elementwise
mul, f16 dtype, strided / broadcast path. - baracuda_
kernels_ ⚠binary_ mul_ f32_ can_ implement - Pre-launch implementability check for
binary_mul_f32. Validates the problem size without launching a kernel. Returns the standard status code mapping. - baracuda_
kernels_ ⚠binary_ mul_ f32_ run - Binary elementwise
mul, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠binary_ mul_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_mul_f32_strided. - baracuda_
kernels_ ⚠binary_ mul_ f32_ strided_ run - Binary elementwise
mul, f32 dtype, strided / broadcast path. - baracuda_
kernels_ ⚠binary_ mul_ f64_ can_ implement - Pre-launch implementability check for
binary_mul_f64. - baracuda_
kernels_ ⚠binary_ mul_ f64_ run - Binary elementwise
mul, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠binary_ mul_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_mul_f64_strided. - baracuda_
kernels_ ⚠binary_ mul_ f64_ strided_ run - Binary elementwise
mul, f64 dtype, strided / broadcast path. - baracuda_
kernels_ ⚠binary_ nextafter_ bf16_ can_ implement - Binary
nextafter, bf16, can-implement. - baracuda_
kernels_ ⚠binary_ nextafter_ bf16_ run - Binary
nextafter, bf16, contig. - baracuda_
kernels_ ⚠binary_ nextafter_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_nextafter_bf16_strided. - baracuda_
kernels_ ⚠binary_ nextafter_ bf16_ strided_ run - Binary
nextafter, bf16, strided. - baracuda_
kernels_ ⚠binary_ nextafter_ f16_ can_ implement - Binary
nextafter, f16, can-implement. - baracuda_
kernels_ ⚠binary_ nextafter_ f16_ run - Binary
nextafter, f16, contig. - baracuda_
kernels_ ⚠binary_ nextafter_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_nextafter_f16_strided. - baracuda_
kernels_ ⚠binary_ nextafter_ f16_ strided_ run - Binary
nextafter, f16, strided. - baracuda_
kernels_ ⚠binary_ nextafter_ f32_ can_ implement - Binary
nextafter, f32, can-implement. - baracuda_
kernels_ ⚠binary_ nextafter_ f32_ run - Binary
nextafter, f32, contig. - baracuda_
kernels_ ⚠binary_ nextafter_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_nextafter_f32_strided. - baracuda_
kernels_ ⚠binary_ nextafter_ f32_ strided_ run - Binary
nextafter, f32, strided. - baracuda_
kernels_ ⚠binary_ nextafter_ f64_ can_ implement - Binary
nextafter, f64, can-implement. - baracuda_
kernels_ ⚠binary_ nextafter_ f64_ run - Binary
nextafter, f64, contig. - baracuda_
kernels_ ⚠binary_ nextafter_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_nextafter_f64_strided. - baracuda_
kernels_ ⚠binary_ nextafter_ f64_ strided_ run - Binary
nextafter, f64, strided. - baracuda_
kernels_ ⚠binary_ pow_ backward_ bf16_ can_ implement baracuda_kernels_binary_pow_backward_bf16_can_implement(baracuda kernels binary pow backward bf16 can implement).- baracuda_
kernels_ ⚠binary_ pow_ backward_ bf16_ run - Pow backward, bf16.
- baracuda_
kernels_ ⚠binary_ pow_ backward_ f16_ can_ implement baracuda_kernels_binary_pow_backward_f16_can_implement(baracuda kernels binary pow backward f16 can implement).- baracuda_
kernels_ ⚠binary_ pow_ backward_ f16_ run - Pow backward, f16.
- baracuda_
kernels_ ⚠binary_ pow_ backward_ f32_ can_ implement baracuda_kernels_binary_pow_backward_f32_can_implement(baracuda kernels binary pow backward f32 can implement).- baracuda_
kernels_ ⚠binary_ pow_ backward_ f32_ run - Pow backward, f32.
da = dy * b * a^(b-1),db = dy * a^b * ln(a). Caller responsible for guarding against undefined regions (a < 0non-integerb, ora == 0withb < 1). - baracuda_
kernels_ ⚠binary_ pow_ backward_ f64_ can_ implement baracuda_kernels_binary_pow_backward_f64_can_implement(baracuda kernels binary pow backward f64 can implement).- baracuda_
kernels_ ⚠binary_ pow_ backward_ f64_ run - Pow backward, f64.
- baracuda_
kernels_ ⚠binary_ pow_ bf16_ can_ implement - Binary
pow, bf16, can-implement. - baracuda_
kernels_ ⚠binary_ pow_ bf16_ run - Binary
pow, bf16, contig. - baracuda_
kernels_ ⚠binary_ pow_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_pow_bf16_strided. - baracuda_
kernels_ ⚠binary_ pow_ bf16_ strided_ run - Binary
pow, bf16, strided. - baracuda_
kernels_ ⚠binary_ pow_ f16_ can_ implement - Binary
pow, f16, can-implement. - baracuda_
kernels_ ⚠binary_ pow_ f16_ run - Binary
pow, f16, contig. - baracuda_
kernels_ ⚠binary_ pow_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_pow_f16_strided. - baracuda_
kernels_ ⚠binary_ pow_ f16_ strided_ run - Binary
pow, f16, strided. - baracuda_
kernels_ ⚠binary_ pow_ f32_ can_ implement - Binary
pow, f32, can-implement. - baracuda_
kernels_ ⚠binary_ pow_ f32_ run - Binary
pow, f32, contig. - baracuda_
kernels_ ⚠binary_ pow_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_pow_f32_strided. - baracuda_
kernels_ ⚠binary_ pow_ f32_ strided_ run - Binary
pow, f32, strided. - baracuda_
kernels_ ⚠binary_ pow_ f64_ can_ implement - Binary
pow, f64, can-implement. - baracuda_
kernels_ ⚠binary_ pow_ f64_ run - Binary
pow, f64, contig. - baracuda_
kernels_ ⚠binary_ pow_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_pow_f64_strided. - baracuda_
kernels_ ⚠binary_ pow_ f64_ strided_ run - Binary
pow, f64, strided. - baracuda_
kernels_ ⚠binary_ remainder_ bf16_ can_ implement - Binary
remainder, bf16, can-implement. - baracuda_
kernels_ ⚠binary_ remainder_ bf16_ run - Binary
remainder, bf16, contig. - baracuda_
kernels_ ⚠binary_ remainder_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_remainder_bf16_strided. - baracuda_
kernels_ ⚠binary_ remainder_ bf16_ strided_ run - Binary
remainder, bf16, strided. - baracuda_
kernels_ ⚠binary_ remainder_ f16_ can_ implement - Binary
remainder, f16, can-implement. - baracuda_
kernels_ ⚠binary_ remainder_ f16_ run - Binary
remainder, f16, contig. - baracuda_
kernels_ ⚠binary_ remainder_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_remainder_f16_strided. - baracuda_
kernels_ ⚠binary_ remainder_ f16_ strided_ run - Binary
remainder, f16, strided. - baracuda_
kernels_ ⚠binary_ remainder_ f32_ can_ implement - Binary
remainder, f32, can-implement. - baracuda_
kernels_ ⚠binary_ remainder_ f32_ run - Binary
remainder, f32, contig. - baracuda_
kernels_ ⚠binary_ remainder_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_remainder_f32_strided. - baracuda_
kernels_ ⚠binary_ remainder_ f32_ strided_ run - Binary
remainder, f32, strided. - baracuda_
kernels_ ⚠binary_ remainder_ f64_ can_ implement - Binary
remainder, f64, can-implement. - baracuda_
kernels_ ⚠binary_ remainder_ f64_ run - Binary
remainder, f64, contig. - baracuda_
kernels_ ⚠binary_ remainder_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_remainder_f64_strided. - baracuda_
kernels_ ⚠binary_ remainder_ f64_ strided_ run - Binary
remainder, f64, strided. - baracuda_
kernels_ ⚠binary_ sub_ backward_ bf16_ can_ implement baracuda_kernels_binary_sub_backward_bf16_can_implement(baracuda kernels binary sub backward bf16 can implement).- baracuda_
kernels_ ⚠binary_ sub_ backward_ bf16_ run - Sub backward, bf16.
- baracuda_
kernels_ ⚠binary_ sub_ backward_ f16_ can_ implement baracuda_kernels_binary_sub_backward_f16_can_implement(baracuda kernels binary sub backward f16 can implement).- baracuda_
kernels_ ⚠binary_ sub_ backward_ f16_ run - Sub backward, f16.
- baracuda_
kernels_ ⚠binary_ sub_ backward_ f32_ can_ implement baracuda_kernels_binary_sub_backward_f32_can_implement(baracuda kernels binary sub backward f32 can implement).- baracuda_
kernels_ ⚠binary_ sub_ backward_ f32_ run - Sub backward, f32. Writes
da = dyanddb = -dy. - baracuda_
kernels_ ⚠binary_ sub_ backward_ f64_ can_ implement baracuda_kernels_binary_sub_backward_f64_can_implement(baracuda kernels binary sub backward f64 can implement).- baracuda_
kernels_ ⚠binary_ sub_ backward_ f64_ run - Sub backward, f64.
- baracuda_
kernels_ ⚠binary_ sub_ bf16_ can_ implement - Pre-launch implementability check for
binary_sub_bf16. - baracuda_
kernels_ ⚠binary_ sub_ bf16_ run - Binary elementwise
sub, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠binary_ sub_ bf16_ strided_ can_ implement - Pre-launch implementability check for
binary_sub_bf16_strided. - baracuda_
kernels_ ⚠binary_ sub_ bf16_ strided_ run - Binary elementwise
sub, bf16 dtype, strided / broadcast path. - baracuda_
kernels_ ⚠binary_ sub_ f16_ can_ implement - Pre-launch implementability check for
binary_sub_f16. - baracuda_
kernels_ ⚠binary_ sub_ f16_ run - Binary elementwise
sub, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠binary_ sub_ f16_ strided_ can_ implement - Pre-launch implementability check for
binary_sub_f16_strided. - baracuda_
kernels_ ⚠binary_ sub_ f16_ strided_ run - Binary elementwise
sub, f16 dtype, strided / broadcast path. - baracuda_
kernels_ ⚠binary_ sub_ f32_ can_ implement - Pre-launch implementability check for
binary_sub_f32. Validates the problem size without launching a kernel. Returns the standard status code mapping. - baracuda_
kernels_ ⚠binary_ sub_ f32_ run - Binary elementwise
sub, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠binary_ sub_ f32_ strided_ can_ implement - Pre-launch implementability check for
binary_sub_f32_strided. - baracuda_
kernels_ ⚠binary_ sub_ f32_ strided_ run - Binary elementwise
sub, f32 dtype, strided / broadcast path. - baracuda_
kernels_ ⚠binary_ sub_ f64_ can_ implement - Pre-launch implementability check for
binary_sub_f64. - baracuda_
kernels_ ⚠binary_ sub_ f64_ run - Binary elementwise
sub, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠binary_ sub_ f64_ strided_ can_ implement - Pre-launch implementability check for
binary_sub_f64_strided. - baracuda_
kernels_ ⚠binary_ sub_ f64_ strided_ run - Binary elementwise
sub, f64 dtype, strided / broadcast path. - baracuda_
kernels_ ⚠bincount_ i32_ can_ implement baracuda_kernels_bincount_i32_can_implement(baracuda kernels bincount i32 can implement).- baracuda_
kernels_ ⚠bincount_ i32_ run bincount, i32 input. Out-of-range (< 0or>= num_bins) silently dropped.- baracuda_
kernels_ ⚠bincount_ i64_ can_ implement baracuda_kernels_bincount_i64_can_implement(baracuda kernels bincount i64 can implement).- baracuda_
kernels_ ⚠bincount_ i64_ run bincount, i64 input.- baracuda_
kernels_ ⚠cast_ bf16_ bf16_ can_ implement - Implementability check for
cast_bf16_bf16. - baracuda_
kernels_ ⚠cast_ bf16_ bf16_ run - Cast
bf16 -> bf16. - baracuda_
kernels_ ⚠cast_ bf16_ bool_ can_ implement - Implementability check for
cast_bf16_bool. - baracuda_
kernels_ ⚠cast_ bf16_ bool_ run - Cast
bf16 -> Bool. - baracuda_
kernels_ ⚠cast_ bf16_ f16_ can_ implement - Implementability check for
cast_bf16_f16. - baracuda_
kernels_ ⚠cast_ bf16_ f16_ run - Cast
bf16 -> f16. - baracuda_
kernels_ ⚠cast_ bf16_ f32_ can_ implement - Implementability check for
cast_bf16_f32. - baracuda_
kernels_ ⚠cast_ bf16_ f32_ run - Cast
bf16 -> f32. - baracuda_
kernels_ ⚠cast_ bf16_ f64_ can_ implement - Implementability check for
cast_bf16_f64. - baracuda_
kernels_ ⚠cast_ bf16_ f64_ run - Cast
bf16 -> f64. - baracuda_
kernels_ ⚠cast_ bf16_ fp8e4m3_ can_ implement baracuda_kernels_cast_bf16_fp8e4m3_can_implement(baracuda kernels cast bf16 fp8e4m3 can implement).- baracuda_
kernels_ ⚠cast_ bf16_ fp8e4m3_ run - Cast
bf16 -> Fp8E4M3. - baracuda_
kernels_ ⚠cast_ bf16_ fp8e5m2_ can_ implement baracuda_kernels_cast_bf16_fp8e5m2_can_implement(baracuda kernels cast bf16 fp8e5m2 can implement).- baracuda_
kernels_ ⚠cast_ bf16_ fp8e5m2_ run - Cast
bf16 -> Fp8E5M2. - baracuda_
kernels_ ⚠cast_ bf16_ i8_ can_ implement - Implementability check for
cast_bf16_i8. - baracuda_
kernels_ ⚠cast_ bf16_ i8_ run - Cast
bf16 -> i8. - baracuda_
kernels_ ⚠cast_ bf16_ i16_ can_ implement baracuda_kernels_cast_bf16_i16_can_implement(baracuda kernels cast bf16 i16 can implement).- baracuda_
kernels_ ⚠cast_ bf16_ i16_ run - Cast
bf16 -> i16. Phase 31. - baracuda_
kernels_ ⚠cast_ bf16_ i32_ can_ implement - Implementability check for
cast_bf16_i32. - baracuda_
kernels_ ⚠cast_ bf16_ i32_ run - Cast
bf16 -> i32. - baracuda_
kernels_ ⚠cast_ bf16_ i64_ can_ implement - Implementability check for
cast_bf16_i64. - baracuda_
kernels_ ⚠cast_ bf16_ i64_ run - Cast
bf16 -> i64. - baracuda_
kernels_ ⚠cast_ bf16_ u8_ can_ implement - Implementability check for
cast_bf16_u8. - baracuda_
kernels_ ⚠cast_ bf16_ u8_ run - Cast
bf16 -> u8. - baracuda_
kernels_ ⚠cast_ bf16_ u32_ can_ implement baracuda_kernels_cast_bf16_u32_can_implement(baracuda kernels cast bf16 u32 can implement).- baracuda_
kernels_ ⚠cast_ bf16_ u32_ run - Cast
bf16 -> u32. Phase 31. - baracuda_
kernels_ ⚠cast_ bool_ bf16_ can_ implement - Implementability check for
cast_bool_bf16. - baracuda_
kernels_ ⚠cast_ bool_ bf16_ run - Cast
Bool -> bf16. - baracuda_
kernels_ ⚠cast_ bool_ f16_ can_ implement - Implementability check for
cast_bool_f16. - baracuda_
kernels_ ⚠cast_ bool_ f16_ run - Cast
Bool -> f16. - baracuda_
kernels_ ⚠cast_ bool_ f32_ can_ implement - Implementability check for
cast_bool_f32. - baracuda_
kernels_ ⚠cast_ bool_ f32_ run - Cast
Bool -> f32. - baracuda_
kernels_ ⚠cast_ bool_ i32_ can_ implement - Implementability check for
cast_bool_i32. - baracuda_
kernels_ ⚠cast_ bool_ i32_ run - Cast
Bool -> i32.x != 0 → 1. - baracuda_
kernels_ ⚠cast_ bool_ i64_ can_ implement - Implementability check for
cast_bool_i64. - baracuda_
kernels_ ⚠cast_ bool_ i64_ run - Cast
Bool -> i64. - baracuda_
kernels_ ⚠cast_ f16_ bf16_ can_ implement - Implementability check for
cast_f16_bf16. - baracuda_
kernels_ ⚠cast_ f16_ bf16_ run - Cast
f16 -> bf16. - baracuda_
kernels_ ⚠cast_ f16_ bool_ can_ implement - Implementability check for
cast_f16_bool. - baracuda_
kernels_ ⚠cast_ f16_ bool_ run - Cast
f16 -> Bool. - baracuda_
kernels_ ⚠cast_ f16_ f16_ can_ implement - Implementability check for
cast_f16_f16. - baracuda_
kernels_ ⚠cast_ f16_ f16_ run - Cast
f16 -> f16. - baracuda_
kernels_ ⚠cast_ f16_ f32_ can_ implement - Implementability check for
cast_f16_f32. - baracuda_
kernels_ ⚠cast_ f16_ f32_ run - Cast
f16 -> f32. - baracuda_
kernels_ ⚠cast_ f16_ f64_ can_ implement - Implementability check for
cast_f16_f64. - baracuda_
kernels_ ⚠cast_ f16_ f64_ run - Cast
f16 -> f64. - baracuda_
kernels_ ⚠cast_ f16_ fp8e4m3_ can_ implement baracuda_kernels_cast_f16_fp8e4m3_can_implement(baracuda kernels cast f16 fp8e4m3 can implement).- baracuda_
kernels_ ⚠cast_ f16_ fp8e4m3_ run - Cast
f16 -> Fp8E4M3. - baracuda_
kernels_ ⚠cast_ f16_ fp8e5m2_ can_ implement baracuda_kernels_cast_f16_fp8e5m2_can_implement(baracuda kernels cast f16 fp8e5m2 can implement).- baracuda_
kernels_ ⚠cast_ f16_ fp8e5m2_ run - Cast
f16 -> Fp8E5M2. - baracuda_
kernels_ ⚠cast_ f16_ i8_ can_ implement - Implementability check for
cast_f16_i8. - baracuda_
kernels_ ⚠cast_ f16_ i8_ run - Cast
f16 -> i8. - baracuda_
kernels_ ⚠cast_ f16_ i16_ can_ implement baracuda_kernels_cast_f16_i16_can_implement(baracuda kernels cast f16 i16 can implement).- baracuda_
kernels_ ⚠cast_ f16_ i16_ run - Cast
f16 -> i16. Phase 31. - baracuda_
kernels_ ⚠cast_ f16_ i32_ can_ implement - Implementability check for
cast_f16_i32. - baracuda_
kernels_ ⚠cast_ f16_ i32_ run - Cast
f16 -> i32. - baracuda_
kernels_ ⚠cast_ f16_ i64_ can_ implement - Implementability check for
cast_f16_i64. - baracuda_
kernels_ ⚠cast_ f16_ i64_ run - Cast
f16 -> i64. - baracuda_
kernels_ ⚠cast_ f16_ u8_ can_ implement - Implementability check for
cast_f16_u8. - baracuda_
kernels_ ⚠cast_ f16_ u8_ run - Cast
f16 -> u8. - baracuda_
kernels_ ⚠cast_ f16_ u32_ can_ implement baracuda_kernels_cast_f16_u32_can_implement(baracuda kernels cast f16 u32 can implement).- baracuda_
kernels_ ⚠cast_ f16_ u32_ run - Cast
f16 -> u32. Phase 31. - baracuda_
kernels_ ⚠cast_ f32_ bf16_ can_ implement - Implementability check for
cast_f32_bf16. - baracuda_
kernels_ ⚠cast_ f32_ bf16_ run - Cast
f32 -> bf16. - baracuda_
kernels_ ⚠cast_ f32_ bool_ can_ implement - Implementability check for
cast_f32_bool. - baracuda_
kernels_ ⚠cast_ f32_ bool_ run - Cast
f32 -> Bool. - baracuda_
kernels_ ⚠cast_ f32_ f16_ can_ implement - Implementability check for
cast_f32_f16. - baracuda_
kernels_ ⚠cast_ f32_ f16_ run - Cast
f32 -> f16. - baracuda_
kernels_ ⚠cast_ f32_ f32_ can_ implement - Implementability check for
cast_f32_f32. - baracuda_
kernels_ ⚠cast_ f32_ f32_ run - Cast
f32 -> f32. SeeLICENSE-thirdparty.md. - baracuda_
kernels_ ⚠cast_ f32_ f64_ can_ implement - Implementability check for
cast_f32_f64. - baracuda_
kernels_ ⚠cast_ f32_ f64_ run - Cast
f32 -> f64. - baracuda_
kernels_ ⚠cast_ f32_ fp8e4m3_ can_ implement baracuda_kernels_cast_f32_fp8e4m3_can_implement(baracuda kernels cast f32 fp8e4m3 can implement).- baracuda_
kernels_ ⚠cast_ f32_ fp8e4m3_ run - Cast
f32 -> Fp8E4M3(saturates to ±448). - baracuda_
kernels_ ⚠cast_ f32_ fp8e5m2_ can_ implement baracuda_kernels_cast_f32_fp8e5m2_can_implement(baracuda kernels cast f32 fp8e5m2 can implement).- baracuda_
kernels_ ⚠cast_ f32_ fp8e5m2_ run - Cast
f32 -> Fp8E5M2(saturates to ±57344). - baracuda_
kernels_ ⚠cast_ f32_ i8_ can_ implement - Implementability check for
cast_f32_i8. - baracuda_
kernels_ ⚠cast_ f32_ i8_ run - Cast
f32 -> i8. - baracuda_
kernels_ ⚠cast_ f32_ i16_ can_ implement baracuda_kernels_cast_f32_i16_can_implement(baracuda kernels cast f32 i16 can implement).- baracuda_
kernels_ ⚠cast_ f32_ i16_ run - Cast
f32 -> i16. Phase 31. - baracuda_
kernels_ ⚠cast_ f32_ i32_ can_ implement - Implementability check for
cast_f32_i32. - baracuda_
kernels_ ⚠cast_ f32_ i32_ run - Cast
f32 -> i32. - baracuda_
kernels_ ⚠cast_ f32_ i64_ can_ implement - Implementability check for
cast_f32_i64. - baracuda_
kernels_ ⚠cast_ f32_ i64_ run - Cast
f32 -> i64. - baracuda_
kernels_ ⚠cast_ f32_ s4_ can_ implement baracuda_kernels_cast_f32_s4_can_implement(baracuda kernels cast f32 s4 can implement).- baracuda_
kernels_ ⚠cast_ f32_ s4_ run - Cast
f32 -> S4(round-to-nearest then saturate). - baracuda_
kernels_ ⚠cast_ f32_ u4_ can_ implement baracuda_kernels_cast_f32_u4_can_implement(baracuda kernels cast f32 u4 can implement).- baracuda_
kernels_ ⚠cast_ f32_ u4_ run - Cast
f32 -> U4(round-to-nearest then saturate). - baracuda_
kernels_ ⚠cast_ f32_ u8_ can_ implement - Implementability check for
cast_f32_u8. - baracuda_
kernels_ ⚠cast_ f32_ u8_ run - Cast
f32 -> u8. - baracuda_
kernels_ ⚠cast_ f32_ u32_ can_ implement baracuda_kernels_cast_f32_u32_can_implement(baracuda kernels cast f32 u32 can implement).- baracuda_
kernels_ ⚠cast_ f32_ u32_ run - Cast
f32 -> u32. Negative inputs are undefined per C++ rules (typical NVCC behaviour: saturates toward 0). Phase 31. - baracuda_
kernels_ ⚠cast_ f64_ bf16_ can_ implement - Implementability check for
cast_f64_bf16. - baracuda_
kernels_ ⚠cast_ f64_ bf16_ run - Cast
f64 -> bf16. - baracuda_
kernels_ ⚠cast_ f64_ f16_ can_ implement - Implementability check for
cast_f64_f16. - baracuda_
kernels_ ⚠cast_ f64_ f16_ run - Cast
f64 -> f16. - baracuda_
kernels_ ⚠cast_ f64_ f32_ can_ implement - Implementability check for
cast_f64_f32. - baracuda_
kernels_ ⚠cast_ f64_ f32_ run - Cast
f64 -> f32. - baracuda_
kernels_ ⚠cast_ f64_ f64_ can_ implement - Implementability check for
cast_f64_f64. - baracuda_
kernels_ ⚠cast_ f64_ f64_ run - Cast
f64 -> f64. - baracuda_
kernels_ ⚠cast_ f64_ i8_ can_ implement - Implementability check for
cast_f64_i8. - baracuda_
kernels_ ⚠cast_ f64_ i8_ run - Cast
f64 -> i8. - baracuda_
kernels_ ⚠cast_ f64_ i16_ can_ implement baracuda_kernels_cast_f64_i16_can_implement(baracuda kernels cast f64 i16 can implement).- baracuda_
kernels_ ⚠cast_ f64_ i16_ run - Cast
f64 -> i16. Phase 31. - baracuda_
kernels_ ⚠cast_ f64_ i32_ can_ implement - Implementability check for
cast_f64_i32. - baracuda_
kernels_ ⚠cast_ f64_ i32_ run - Cast
f64 -> i32. - baracuda_
kernels_ ⚠cast_ f64_ i64_ can_ implement - Implementability check for
cast_f64_i64. - baracuda_
kernels_ ⚠cast_ f64_ i64_ run - Cast
f64 -> i64. - baracuda_
kernels_ ⚠cast_ f64_ u8_ can_ implement - Implementability check for
cast_f64_u8. - baracuda_
kernels_ ⚠cast_ f64_ u8_ run - Cast
f64 -> u8. - baracuda_
kernels_ ⚠cast_ f64_ u32_ can_ implement baracuda_kernels_cast_f64_u32_can_implement(baracuda kernels cast f64 u32 can implement).- baracuda_
kernels_ ⚠cast_ f64_ u32_ run - Cast
f64 -> u32. Phase 31. - baracuda_
kernels_ ⚠cast_ fp8e4m3_ bf16_ can_ implement baracuda_kernels_cast_fp8e4m3_bf16_can_implement(baracuda kernels cast fp8e4m3 bf16 can implement).- baracuda_
kernels_ ⚠cast_ fp8e4m3_ bf16_ run - Cast
Fp8E4M3 -> bf16. - baracuda_
kernels_ ⚠cast_ fp8e4m3_ f16_ can_ implement baracuda_kernels_cast_fp8e4m3_f16_can_implement(baracuda kernels cast fp8e4m3 f16 can implement).- baracuda_
kernels_ ⚠cast_ fp8e4m3_ f16_ run - Cast
Fp8E4M3 -> f16. - baracuda_
kernels_ ⚠cast_ fp8e4m3_ f32_ can_ implement baracuda_kernels_cast_fp8e4m3_f32_can_implement(baracuda kernels cast fp8e4m3 f32 can implement).- baracuda_
kernels_ ⚠cast_ fp8e4m3_ f32_ run - Cast
Fp8E4M3 -> f32. - baracuda_
kernels_ ⚠cast_ fp8e5m2_ bf16_ can_ implement baracuda_kernels_cast_fp8e5m2_bf16_can_implement(baracuda kernels cast fp8e5m2 bf16 can implement).- baracuda_
kernels_ ⚠cast_ fp8e5m2_ bf16_ run - Cast
Fp8E5M2 -> bf16. - baracuda_
kernels_ ⚠cast_ fp8e5m2_ f16_ can_ implement baracuda_kernels_cast_fp8e5m2_f16_can_implement(baracuda kernels cast fp8e5m2 f16 can implement).- baracuda_
kernels_ ⚠cast_ fp8e5m2_ f16_ run - Cast
Fp8E5M2 -> f16. - baracuda_
kernels_ ⚠cast_ fp8e5m2_ f32_ can_ implement baracuda_kernels_cast_fp8e5m2_f32_can_implement(baracuda kernels cast fp8e5m2 f32 can implement).- baracuda_
kernels_ ⚠cast_ fp8e5m2_ f32_ run - Cast
Fp8E5M2 -> f32. - baracuda_
kernels_ ⚠cast_ i8_ bf16_ can_ implement - Implementability check for
cast_i8_bf16. - baracuda_
kernels_ ⚠cast_ i8_ bf16_ run - Cast
i8 -> bf16. - baracuda_
kernels_ ⚠cast_ i8_ f16_ can_ implement - Implementability check for
cast_i8_f16. - baracuda_
kernels_ ⚠cast_ i8_ f16_ run - Cast
i8 -> f16. - baracuda_
kernels_ ⚠cast_ i8_ f32_ can_ implement - Implementability check for
cast_i8_f32. - baracuda_
kernels_ ⚠cast_ i8_ f32_ run - Cast
i8 -> f32. - baracuda_
kernels_ ⚠cast_ i8_ f64_ can_ implement - Implementability check for
cast_i8_f64. - baracuda_
kernels_ ⚠cast_ i8_ f64_ run - Cast
i8 -> f64. - baracuda_
kernels_ ⚠cast_ i8_ i8_ can_ implement - Implementability check for
cast_i8_i8. - baracuda_
kernels_ ⚠cast_ i8_ i8_ run - Cast
i8 -> i8. - baracuda_
kernels_ ⚠cast_ i8_ i16_ can_ implement baracuda_kernels_cast_i8_i16_can_implement(baracuda kernels cast i8 i16 can implement).- baracuda_
kernels_ ⚠cast_ i8_ i16_ run - Cast
i8 -> i16. Sign-extends. Phase 31. - baracuda_
kernels_ ⚠cast_ i8_ i32_ can_ implement - Implementability check for
cast_i8_i32. - baracuda_
kernels_ ⚠cast_ i8_ i32_ run - Cast
i8 -> i32. - baracuda_
kernels_ ⚠cast_ i8_ i64_ can_ implement - Implementability check for
cast_i8_i64. - baracuda_
kernels_ ⚠cast_ i8_ i64_ run - Cast
i8 -> i64. - baracuda_
kernels_ ⚠cast_ i8_ u8_ can_ implement - Implementability check for
cast_i8_u8. - baracuda_
kernels_ ⚠cast_ i8_ u8_ run - Cast
i8 -> u8. - baracuda_
kernels_ ⚠cast_ i8_ u32_ can_ implement baracuda_kernels_cast_i8_u32_can_implement(baracuda kernels cast i8 u32 can implement).- baracuda_
kernels_ ⚠cast_ i8_ u32_ run - Cast
i8 -> u32. Sign-extends then reinterprets. Phase 31. - baracuda_
kernels_ ⚠cast_ i16_ bf16_ can_ implement baracuda_kernels_cast_i16_bf16_can_implement(baracuda kernels cast i16 bf16 can implement).- baracuda_
kernels_ ⚠cast_ i16_ bf16_ run - Cast
i16 -> bf16. Phase 31. - baracuda_
kernels_ ⚠cast_ i16_ f16_ can_ implement baracuda_kernels_cast_i16_f16_can_implement(baracuda kernels cast i16 f16 can implement).- baracuda_
kernels_ ⚠cast_ i16_ f16_ run - Cast
i16 -> f16. Phase 31. - baracuda_
kernels_ ⚠cast_ i16_ f32_ can_ implement baracuda_kernels_cast_i16_f32_can_implement(baracuda kernels cast i16 f32 can implement).- baracuda_
kernels_ ⚠cast_ i16_ f32_ run - Cast
i16 -> f32. Phase 31. - baracuda_
kernels_ ⚠cast_ i16_ f64_ can_ implement baracuda_kernels_cast_i16_f64_can_implement(baracuda kernels cast i16 f64 can implement).- baracuda_
kernels_ ⚠cast_ i16_ f64_ run - Cast
i16 -> f64. Phase 31. - baracuda_
kernels_ ⚠cast_ i16_ i8_ can_ implement baracuda_kernels_cast_i16_i8_can_implement(baracuda kernels cast i16 i8 can implement).- baracuda_
kernels_ ⚠cast_ i16_ i8_ run - Cast
i16 -> i8. Truncates to low byte. Phase 31. - baracuda_
kernels_ ⚠cast_ i16_ i16_ can_ implement baracuda_kernels_cast_i16_i16_can_implement(baracuda kernels cast i16 i16 can implement).- baracuda_
kernels_ ⚠cast_ i16_ i16_ run - Cast
i16 -> i16(identity). Phase 31. - baracuda_
kernels_ ⚠cast_ i16_ i32_ can_ implement baracuda_kernels_cast_i16_i32_can_implement(baracuda kernels cast i16 i32 can implement).- baracuda_
kernels_ ⚠cast_ i16_ i32_ run - Cast
i16 -> i32. Sign-extends. Phase 31. - baracuda_
kernels_ ⚠cast_ i16_ i64_ can_ implement baracuda_kernels_cast_i16_i64_can_implement(baracuda kernels cast i16 i64 can implement).- baracuda_
kernels_ ⚠cast_ i16_ i64_ run - Cast
i16 -> i64. Sign-extends. Phase 31. - baracuda_
kernels_ ⚠cast_ i16_ u8_ can_ implement baracuda_kernels_cast_i16_u8_can_implement(baracuda kernels cast i16 u8 can implement).- baracuda_
kernels_ ⚠cast_ i16_ u8_ run - Cast
i16 -> u8. Truncates to low byte then reinterprets. Phase 31. - baracuda_
kernels_ ⚠cast_ i16_ u32_ can_ implement baracuda_kernels_cast_i16_u32_can_implement(baracuda kernels cast i16 u32 can implement).- baracuda_
kernels_ ⚠cast_ i16_ u32_ run - Cast
i16 -> u32. Sign-extends to i32 then reinterprets. Phase 31. - baracuda_
kernels_ ⚠cast_ i32_ bf16_ can_ implement - Implementability check for
cast_i32_bf16. - baracuda_
kernels_ ⚠cast_ i32_ bf16_ run - Cast
i32 -> bf16. - baracuda_
kernels_ ⚠cast_ i32_ bool_ can_ implement - Implementability check for
cast_i32_bool. - baracuda_
kernels_ ⚠cast_ i32_ bool_ run - Cast
i32 -> Bool.x != 0 → 1. - baracuda_
kernels_ ⚠cast_ i32_ f16_ can_ implement - Implementability check for
cast_i32_f16. - baracuda_
kernels_ ⚠cast_ i32_ f16_ run - Cast
i32 -> f16. - baracuda_
kernels_ ⚠cast_ i32_ f32_ can_ implement - Implementability check for
cast_i32_f32. - baracuda_
kernels_ ⚠cast_ i32_ f32_ run - Cast
i32 -> f32. - baracuda_
kernels_ ⚠cast_ i32_ f64_ can_ implement - Implementability check for
cast_i32_f64. - baracuda_
kernels_ ⚠cast_ i32_ f64_ run - Cast
i32 -> f64. - baracuda_
kernels_ ⚠cast_ i32_ i8_ can_ implement - Implementability check for
cast_i32_i8. - baracuda_
kernels_ ⚠cast_ i32_ i8_ run - Cast
i32 -> i8. - baracuda_
kernels_ ⚠cast_ i32_ i16_ can_ implement baracuda_kernels_cast_i32_i16_can_implement(baracuda kernels cast i32 i16 can implement).- baracuda_
kernels_ ⚠cast_ i32_ i16_ run - Cast
i32 -> i16. Truncates to low 16 bits. Phase 31. - baracuda_
kernels_ ⚠cast_ i32_ i32_ can_ implement - Implementability check for
cast_i32_i32. - baracuda_
kernels_ ⚠cast_ i32_ i32_ run - Cast
i32 -> i32. - baracuda_
kernels_ ⚠cast_ i32_ i64_ can_ implement - Implementability check for
cast_i32_i64. - baracuda_
kernels_ ⚠cast_ i32_ i64_ run - Cast
i32 -> i64. - baracuda_
kernels_ ⚠cast_ i32_ s4_ can_ implement baracuda_kernels_cast_i32_s4_can_implement(baracuda kernels cast i32 s4 can implement).- baracuda_
kernels_ ⚠cast_ i32_ s4_ run - Cast
i32 -> S4(pack: saturate to [-8, +7] then nibble-mask). - baracuda_
kernels_ ⚠cast_ i32_ u4_ can_ implement baracuda_kernels_cast_i32_u4_can_implement(baracuda kernels cast i32 u4 can implement).- baracuda_
kernels_ ⚠cast_ i32_ u4_ run - Cast
i32 -> U4(pack: saturate to [0, 15] then nibble-mask). - baracuda_
kernels_ ⚠cast_ i32_ u8_ can_ implement - Implementability check for
cast_i32_u8. - baracuda_
kernels_ ⚠cast_ i32_ u8_ run - Cast
i32 -> u8. - baracuda_
kernels_ ⚠cast_ i32_ u32_ can_ implement baracuda_kernels_cast_i32_u32_can_implement(baracuda kernels cast i32 u32 can implement).- baracuda_
kernels_ ⚠cast_ i32_ u32_ run - Cast
i32 -> u32. Bitwise reinterpret for the common case (x >= 0); two’s-complement wraparound otherwise. Phase 31. - baracuda_
kernels_ ⚠cast_ i64_ bf16_ can_ implement - Implementability check for
cast_i64_bf16. - baracuda_
kernels_ ⚠cast_ i64_ bf16_ run - Cast
i64 -> bf16. - baracuda_
kernels_ ⚠cast_ i64_ bool_ can_ implement - Implementability check for
cast_i64_bool. - baracuda_
kernels_ ⚠cast_ i64_ bool_ run - Cast
i64 -> Bool. - baracuda_
kernels_ ⚠cast_ i64_ f16_ can_ implement - Implementability check for
cast_i64_f16. - baracuda_
kernels_ ⚠cast_ i64_ f16_ run - Cast
i64 -> f16. - baracuda_
kernels_ ⚠cast_ i64_ f32_ can_ implement - Implementability check for
cast_i64_f32. - baracuda_
kernels_ ⚠cast_ i64_ f32_ run - Cast
i64 -> f32. - baracuda_
kernels_ ⚠cast_ i64_ f64_ can_ implement - Implementability check for
cast_i64_f64. - baracuda_
kernels_ ⚠cast_ i64_ f64_ run - Cast
i64 -> f64. - baracuda_
kernels_ ⚠cast_ i64_ i8_ can_ implement - Implementability check for
cast_i64_i8. - baracuda_
kernels_ ⚠cast_ i64_ i8_ run - Cast
i64 -> i8. - baracuda_
kernels_ ⚠cast_ i64_ i16_ can_ implement baracuda_kernels_cast_i64_i16_can_implement(baracuda kernels cast i64 i16 can implement).- baracuda_
kernels_ ⚠cast_ i64_ i16_ run - Cast
i64 -> i16. Truncates to low 16 bits. Phase 31. - baracuda_
kernels_ ⚠cast_ i64_ i32_ can_ implement - Implementability check for
cast_i64_i32. - baracuda_
kernels_ ⚠cast_ i64_ i32_ run - Cast
i64 -> i32. - baracuda_
kernels_ ⚠cast_ i64_ i64_ can_ implement - Implementability check for
cast_i64_i64. - baracuda_
kernels_ ⚠cast_ i64_ i64_ run - Cast
i64 -> i64. - baracuda_
kernels_ ⚠cast_ i64_ s4_ can_ implement baracuda_kernels_cast_i64_s4_can_implement(baracuda kernels cast i64 s4 can implement).- baracuda_
kernels_ ⚠cast_ i64_ s4_ run - Cast
i64 -> S4. - baracuda_
kernels_ ⚠cast_ i64_ u4_ can_ implement baracuda_kernels_cast_i64_u4_can_implement(baracuda kernels cast i64 u4 can implement).- baracuda_
kernels_ ⚠cast_ i64_ u4_ run - Cast
i64 -> U4. - baracuda_
kernels_ ⚠cast_ i64_ u8_ can_ implement - Implementability check for
cast_i64_u8. - baracuda_
kernels_ ⚠cast_ i64_ u8_ run - Cast
i64 -> u8. - baracuda_
kernels_ ⚠cast_ i64_ u32_ can_ implement baracuda_kernels_cast_i64_u32_can_implement(baracuda kernels cast i64 u32 can implement).- baracuda_
kernels_ ⚠cast_ i64_ u32_ run - Cast
i64 -> u32. Truncates the top 32 bits. Phase 31. - baracuda_
kernels_ ⚠cast_ s4_ f32_ can_ implement baracuda_kernels_cast_s4_f32_can_implement(baracuda kernels cast s4 f32 can implement).- baracuda_
kernels_ ⚠cast_ s4_ f32_ run - Cast
S4 -> f32. - baracuda_
kernels_ ⚠cast_ s4_ i32_ can_ implement baracuda_kernels_cast_s4_i32_can_implement(baracuda kernels cast s4 i32 can implement).- baracuda_
kernels_ ⚠cast_ s4_ i32_ run - Cast
S4 -> i32(unpack: sign-extend nibble to int32). - baracuda_
kernels_ ⚠cast_ s4_ i64_ can_ implement baracuda_kernels_cast_s4_i64_can_implement(baracuda kernels cast s4 i64 can implement).- baracuda_
kernels_ ⚠cast_ s4_ i64_ run - Cast
S4 -> i64. - baracuda_
kernels_ ⚠cast_ u4_ f32_ can_ implement baracuda_kernels_cast_u4_f32_can_implement(baracuda kernels cast u4 f32 can implement).- baracuda_
kernels_ ⚠cast_ u4_ f32_ run - Cast
U4 -> f32. - baracuda_
kernels_ ⚠cast_ u4_ i32_ can_ implement baracuda_kernels_cast_u4_i32_can_implement(baracuda kernels cast u4 i32 can implement).- baracuda_
kernels_ ⚠cast_ u4_ i32_ run - Cast
U4 -> i32(unpack: zero-extend nibble to int32). - baracuda_
kernels_ ⚠cast_ u4_ i64_ can_ implement baracuda_kernels_cast_u4_i64_can_implement(baracuda kernels cast u4 i64 can implement).- baracuda_
kernels_ ⚠cast_ u4_ i64_ run - Cast
U4 -> i64. - baracuda_
kernels_ ⚠cast_ u8_ bf16_ can_ implement - Implementability check for
cast_u8_bf16. - baracuda_
kernels_ ⚠cast_ u8_ bf16_ run - Cast
u8 -> bf16. - baracuda_
kernels_ ⚠cast_ u8_ f16_ can_ implement - Implementability check for
cast_u8_f16. - baracuda_
kernels_ ⚠cast_ u8_ f16_ run - Cast
u8 -> f16. - baracuda_
kernels_ ⚠cast_ u8_ f32_ can_ implement - Implementability check for
cast_u8_f32. - baracuda_
kernels_ ⚠cast_ u8_ f32_ run - Cast
u8 -> f32. - baracuda_
kernels_ ⚠cast_ u8_ f64_ can_ implement - Implementability check for
cast_u8_f64. - baracuda_
kernels_ ⚠cast_ u8_ f64_ run - Cast
u8 -> f64. - baracuda_
kernels_ ⚠cast_ u8_ i8_ can_ implement - Implementability check for
cast_u8_i8. - baracuda_
kernels_ ⚠cast_ u8_ i8_ run - Cast
u8 -> i8. - baracuda_
kernels_ ⚠cast_ u8_ i16_ can_ implement baracuda_kernels_cast_u8_i16_can_implement(baracuda kernels cast u8 i16 can implement).- baracuda_
kernels_ ⚠cast_ u8_ i16_ run - Cast
u8 -> i16. Zero-extends. Phase 31. - baracuda_
kernels_ ⚠cast_ u8_ i32_ can_ implement - Implementability check for
cast_u8_i32. - baracuda_
kernels_ ⚠cast_ u8_ i32_ run - Cast
u8 -> i32. - baracuda_
kernels_ ⚠cast_ u8_ i64_ can_ implement - Implementability check for
cast_u8_i64. - baracuda_
kernels_ ⚠cast_ u8_ i64_ run - Cast
u8 -> i64. - baracuda_
kernels_ ⚠cast_ u8_ u8_ can_ implement - Implementability check for
cast_u8_u8. - baracuda_
kernels_ ⚠cast_ u8_ u8_ run - Cast
u8 -> u8. - baracuda_
kernels_ ⚠cast_ u8_ u32_ can_ implement baracuda_kernels_cast_u8_u32_can_implement(baracuda kernels cast u8 u32 can implement).- baracuda_
kernels_ ⚠cast_ u8_ u32_ run - Cast
u8 -> u32. Zero-extends. Phase 31. - baracuda_
kernels_ ⚠cast_ u32_ bf16_ can_ implement baracuda_kernels_cast_u32_bf16_can_implement(baracuda kernels cast u32 bf16 can implement).- baracuda_
kernels_ ⚠cast_ u32_ bf16_ run - Cast
u32 -> bf16. Phase 31. - baracuda_
kernels_ ⚠cast_ u32_ f16_ can_ implement baracuda_kernels_cast_u32_f16_can_implement(baracuda kernels cast u32 f16 can implement).- baracuda_
kernels_ ⚠cast_ u32_ f16_ run - Cast
u32 -> f16. Phase 31. - baracuda_
kernels_ ⚠cast_ u32_ f32_ can_ implement baracuda_kernels_cast_u32_f32_can_implement(baracuda kernels cast u32 f32 can implement).- baracuda_
kernels_ ⚠cast_ u32_ f32_ run - Cast
u32 -> f32. Phase 31. - baracuda_
kernels_ ⚠cast_ u32_ f64_ can_ implement baracuda_kernels_cast_u32_f64_can_implement(baracuda kernels cast u32 f64 can implement).- baracuda_
kernels_ ⚠cast_ u32_ f64_ run - Cast
u32 -> f64. Phase 31. - baracuda_
kernels_ ⚠cast_ u32_ i8_ can_ implement baracuda_kernels_cast_u32_i8_can_implement(baracuda kernels cast u32 i8 can implement).- baracuda_
kernels_ ⚠cast_ u32_ i8_ run - Cast
u32 -> i8. Truncates to low byte then reinterprets. Phase 31. - baracuda_
kernels_ ⚠cast_ u32_ i16_ can_ implement baracuda_kernels_cast_u32_i16_can_implement(baracuda kernels cast u32 i16 can implement).- baracuda_
kernels_ ⚠cast_ u32_ i16_ run - Cast
u32 -> i16. Truncates to low 16 bits then reinterprets. Phase 31. - baracuda_
kernels_ ⚠cast_ u32_ i32_ can_ implement baracuda_kernels_cast_u32_i32_can_implement(baracuda kernels cast u32 i32 can implement).- baracuda_
kernels_ ⚠cast_ u32_ i32_ run - Cast
u32 -> i32. Bitwise reinterpret. Phase 31. - baracuda_
kernels_ ⚠cast_ u32_ i64_ can_ implement baracuda_kernels_cast_u32_i64_can_implement(baracuda kernels cast u32 i64 can implement).- baracuda_
kernels_ ⚠cast_ u32_ i64_ run - Cast
u32 -> i64. Zero-extends. Phase 31. - baracuda_
kernels_ ⚠cast_ u32_ u8_ can_ implement baracuda_kernels_cast_u32_u8_can_implement(baracuda kernels cast u32 u8 can implement).- baracuda_
kernels_ ⚠cast_ u32_ u8_ run - Cast
u32 -> u8. Truncates to low byte. Phase 31. - baracuda_
kernels_ ⚠cast_ u32_ u32_ can_ implement baracuda_kernels_cast_u32_u32_can_implement(baracuda kernels cast u32 u32 can implement).- baracuda_
kernels_ ⚠cast_ u32_ u32_ run - Cast
u32 -> u32(identity). Phase 31. - baracuda_
kernels_ ⚠cholesky_ batched_ f32_ run - Cholesky factorization (batched). Each
a_array[b]is overwritten with the requested triangular factor. cuSOLVER’spotrfBatchedis workspace-free internally but needs a device-resident array of device pointers — caller responsibility. - baracuda_
kernels_ ⚠cholesky_ batched_ f64_ run - Cholesky factorization (batched). Each
a_array[b]is overwritten with the requested triangular factor. cuSOLVER’spotrfBatchedis workspace-free internally but needs a device-resident array of device pointers — caller responsibility. - baracuda_
kernels_ ⚠cholesky_ f32_ run - Cholesky factorization (non-batched). Overwrites
a_inoutin place with the requested triangular factor.uplois0(lower,CUBLAS_FILL_MODE_LOWER) or1(upper,CUBLAS_FILL_MODE_UPPER). - baracuda_
kernels_ ⚠cholesky_ f32_ workspace_ size - Cholesky factorization workspace size in bytes for the
non-batched
potrfpath. Returns0on success and writes the byte count to*out_bytes; non-zero status on cuSOLVER failure (handle allocation / bufferSize query). BatchedpotrfBatchedis workspace-free and has no equivalent query. - baracuda_
kernels_ ⚠cholesky_ f64_ run - Cholesky factorization (non-batched). Overwrites
a_inoutin place with the requested triangular factor.uplois0(lower,CUBLAS_FILL_MODE_LOWER) or1(upper,CUBLAS_FILL_MODE_UPPER). - baracuda_
kernels_ ⚠cholesky_ f64_ workspace_ size - Cholesky factorization workspace size in bytes for the
non-batched
potrfpath. Returns0on success and writes the byte count to*out_bytes; non-zero status on cuSOLVER failure (handle allocation / bufferSize query). BatchedpotrfBatchedis workspace-free and has no equivalent query. - baracuda_
kernels_ ⚠col2im_ 1d_ bf16_ can_ implement baracuda_kernels_col2im_1d_bf16_can_implement(baracuda kernels col2im 1d bf16 can implement).- baracuda_
kernels_ ⚠col2im_ 1d_ bf16_ run - col2im 1-D, bf16. Caller must zero
outputfirst. - baracuda_
kernels_ ⚠col2im_ 1d_ f16_ can_ implement baracuda_kernels_col2im_1d_f16_can_implement(baracuda kernels col2im 1d f16 can implement).- baracuda_
kernels_ ⚠col2im_ 1d_ f16_ run - col2im 1-D, f16. Caller must zero
outputfirst. - baracuda_
kernels_ ⚠col2im_ 1d_ f32_ can_ implement baracuda_kernels_col2im_1d_f32_can_implement(baracuda kernels col2im 1d f32 can implement).- baracuda_
kernels_ ⚠col2im_ 1d_ f32_ run - col2im 1-D, f32. Caller must zero
outputfirst. - baracuda_
kernels_ ⚠col2im_ 1d_ f64_ can_ implement baracuda_kernels_col2im_1d_f64_can_implement(baracuda kernels col2im 1d f64 can implement).- baracuda_
kernels_ ⚠col2im_ 1d_ f64_ run - col2im 1-D, f64. Caller must zero
outputfirst. - baracuda_
kernels_ ⚠concat2_ backward_ bf16_ can_ implement baracuda_kernels_concat2_backward_bf16_can_implement(baracuda kernels concat2 backward bf16 can implement).- baracuda_
kernels_ ⚠concat2_ backward_ bf16_ run - Concat2 backward (slice-split), bf16. See f32 variant.
- baracuda_
kernels_ ⚠concat2_ backward_ f16_ can_ implement baracuda_kernels_concat2_backward_f16_can_implement(baracuda kernels concat2 backward f16 can implement).- baracuda_
kernels_ ⚠concat2_ backward_ f16_ run - Concat2 backward (slice-split), f16. See f32 variant.
- baracuda_
kernels_ ⚠concat2_ backward_ f32_ can_ implement baracuda_kernels_concat2_backward_f32_can_implement(baracuda kernels concat2 backward f32 can implement).- baracuda_
kernels_ ⚠concat2_ backward_ f32_ run - Concat2 backward (slice-split), f32. Bit-exact, no arithmetic.
- baracuda_
kernels_ ⚠concat2_ backward_ f64_ can_ implement baracuda_kernels_concat2_backward_f64_can_implement(baracuda kernels concat2 backward f64 can implement).- baracuda_
kernels_ ⚠concat2_ backward_ f64_ run - Concat2 backward (slice-split), f64. See f32 variant.
- baracuda_
kernels_ ⚠concat2_ bf16_ can_ implement baracuda_kernels_concat2_bf16_can_implement(baracuda kernels concat2 bf16 can implement).- baracuda_
kernels_ ⚠concat2_ bf16_ run cat(a, b, dim), bf16, contig output. See f32 variant.- baracuda_
kernels_ ⚠concat2_ f16_ can_ implement baracuda_kernels_concat2_f16_can_implement(baracuda kernels concat2 f16 can implement).- baracuda_
kernels_ ⚠concat2_ f16_ run cat(a, b, dim), f16, contig output. See f32 variant.- baracuda_
kernels_ ⚠concat2_ f32_ can_ implement baracuda_kernels_concat2_f32_can_implement(baracuda kernels concat2 f32 can implement).- baracuda_
kernels_ ⚠concat2_ f32_ run cat(a, b, dim), f32, contig output.- baracuda_
kernels_ ⚠concat2_ f64_ can_ implement baracuda_kernels_concat2_f64_can_implement(baracuda kernels concat2 f64 can implement).- baracuda_
kernels_ ⚠concat2_ f64_ run cat(a, b, dim), f64, contig output. See f32 variant.- baracuda_
kernels_ ⚠contiguize_ b1_ can_ implement baracuda_kernels_contiguize_b1_can_implement(baracuda kernels contiguize b1 can implement).- baracuda_
kernels_ ⚠contiguize_ b1_ run - Contiguize, 1-byte element (Bool, S8, U8, Fp8E4M3, Fp8E5M2).
- baracuda_
kernels_ ⚠contiguize_ b2_ can_ implement baracuda_kernels_contiguize_b2_can_implement(baracuda kernels contiguize b2 can implement).- baracuda_
kernels_ ⚠contiguize_ b2_ run - Contiguize, 2-byte element (f16, bf16).
- baracuda_
kernels_ ⚠contiguize_ b4_ can_ implement baracuda_kernels_contiguize_b4_can_implement(baracuda kernels contiguize b4 can implement).- baracuda_
kernels_ ⚠contiguize_ b4_ run - Contiguize, 4-byte element (f32, F32Strict, i32).
- baracuda_
kernels_ ⚠contiguize_ b8_ can_ implement baracuda_kernels_contiguize_b8_can_implement(baracuda kernels contiguize b8 can implement).- baracuda_
kernels_ ⚠contiguize_ b8_ run - Contiguize, 8-byte element (f64, i64, Complex32).
- baracuda_
kernels_ ⚠contiguize_ b16_ can_ implement baracuda_kernels_contiguize_b16_can_implement(baracuda kernels contiguize b16 can implement).- baracuda_
kernels_ ⚠contiguize_ b16_ run - Contiguize, 16-byte element (Complex64).
- baracuda_
kernels_ ⚠contiguize_ nibble_ can_ implement baracuda_kernels_contiguize_nibble_can_implement(baracuda kernels contiguize nibble can implement).- baracuda_
kernels_ ⚠contiguize_ nibble_ run - Contiguize, nibble-packed (S4 / U4). Returns status 3
(Unsupported) when the source’s innermost stride is not one of
{1, -1, 2}— i.e. when the source layout breaks nibble alignment. - baracuda_
kernels_ ⚠curand_ normal_ f32_ run - Sample
numelf32cells fromNormal(mean, stddev). - baracuda_
kernels_ ⚠curand_ normal_ f32_ workspace_ size - Normal-sampler workspace size in bytes for
f32— always0. - baracuda_
kernels_ ⚠curand_ normal_ f64_ run - Sample
numelf64cells fromNormal(mean, stddev). - baracuda_
kernels_ ⚠curand_ normal_ f64_ workspace_ size - Normal-sampler workspace size in bytes for
f64— always0. - baracuda_
kernels_ ⚠curand_ uniform_ f32_ run - Sample
numelf32cells fromUniform(low, high]. - baracuda_
kernels_ ⚠curand_ uniform_ f32_ workspace_ size - Uniform-sampler workspace size in bytes for
f32— always0. - baracuda_
kernels_ ⚠curand_ uniform_ f64_ run - Sample
numelf64cells fromUniform(low, high]. - baracuda_
kernels_ ⚠curand_ uniform_ f64_ workspace_ size - Uniform-sampler workspace size in bytes for
f64— always0. - baracuda_
kernels_ ⚠dequantize_ per_ channel_ backward_ bf16_ can_ implement - Implementability check for
dequantize_per_channel_backward_bf16. - baracuda_
kernels_ ⚠dequantize_ per_ channel_ backward_ bf16_ run dequantize_per_channel_backward— bf16.- baracuda_
kernels_ ⚠dequantize_ per_ channel_ backward_ f16_ can_ implement - Implementability check for
dequantize_per_channel_backward_f16. - baracuda_
kernels_ ⚠dequantize_ per_ channel_ backward_ f16_ run dequantize_per_channel_backward— f16.- baracuda_
kernels_ ⚠dequantize_ per_ channel_ backward_ f32_ can_ implement - Implementability check for
dequantize_per_channel_backward_f32. - baracuda_
kernels_ ⚠dequantize_ per_ channel_ backward_ f32_ run dq[i] = dy[i] * scale[c]. f32.- baracuda_
kernels_ ⚠dequantize_ per_ channel_ backward_ f64_ can_ implement - Implementability check for
dequantize_per_channel_backward_f64. - baracuda_
kernels_ ⚠dequantize_ per_ channel_ backward_ f64_ run dequantize_per_channel_backward— f64.- baracuda_
kernels_ ⚠dequantize_ per_ channel_ bf16_ s8_ can_ implement - Implementability check for
dequantize_per_channel_bf16_s8. - baracuda_
kernels_ ⚠dequantize_ per_ channel_ bf16_ s8_ run dequantize_per_channel— s8 → bf16.- baracuda_
kernels_ ⚠dequantize_ per_ channel_ bf16_ u8_ can_ implement - Implementability check for
dequantize_per_channel_bf16_u8. - baracuda_
kernels_ ⚠dequantize_ per_ channel_ bf16_ u8_ run dequantize_per_channel— u8 → bf16.- baracuda_
kernels_ ⚠dequantize_ per_ channel_ f16_ s8_ can_ implement - Implementability check for
dequantize_per_channel_f16_s8. - baracuda_
kernels_ ⚠dequantize_ per_ channel_ f16_ s8_ run dequantize_per_channel— s8 → f16.- baracuda_
kernels_ ⚠dequantize_ per_ channel_ f16_ u8_ can_ implement - Implementability check for
dequantize_per_channel_f16_u8. - baracuda_
kernels_ ⚠dequantize_ per_ channel_ f16_ u8_ run dequantize_per_channel— u8 → f16.- baracuda_
kernels_ ⚠dequantize_ per_ channel_ f32_ s8_ can_ implement - Implementability check for
dequantize_per_channel_f32_s8. - baracuda_
kernels_ ⚠dequantize_ per_ channel_ f32_ s8_ run x[i] = scale[c] * (q[i] - zp[c]). s8 → f32.- baracuda_
kernels_ ⚠dequantize_ per_ channel_ f32_ u8_ can_ implement - Implementability check for
dequantize_per_channel_f32_u8. - baracuda_
kernels_ ⚠dequantize_ per_ channel_ f32_ u8_ run dequantize_per_channel— u8 → f32.- baracuda_
kernels_ ⚠dequantize_ per_ channel_ f64_ s8_ can_ implement - Implementability check for
dequantize_per_channel_f64_s8. - baracuda_
kernels_ ⚠dequantize_ per_ channel_ f64_ s8_ run dequantize_per_channel— s8 → f64.- baracuda_
kernels_ ⚠dequantize_ per_ channel_ f64_ u8_ can_ implement - Implementability check for
dequantize_per_channel_f64_u8. - baracuda_
kernels_ ⚠dequantize_ per_ channel_ f64_ u8_ run dequantize_per_channel— u8 → f64.- baracuda_
kernels_ ⚠dequantize_ per_ group_ backward_ bf16_ can_ implement - Implementability check for
dequantize_per_group_backward_bf16. - baracuda_
kernels_ ⚠dequantize_ per_ group_ backward_ bf16_ run - Dequant BW — bf16.
- baracuda_
kernels_ ⚠dequantize_ per_ group_ backward_ f16_ can_ implement - Implementability check for
dequantize_per_group_backward_f16. - baracuda_
kernels_ ⚠dequantize_ per_ group_ backward_ f16_ run - Dequant BW — f16.
- baracuda_
kernels_ ⚠dequantize_ per_ group_ backward_ f32_ can_ implement - Implementability check for
dequantize_per_group_backward_f32. - baracuda_
kernels_ ⚠dequantize_ per_ group_ backward_ f32_ run - Dequant BW — f32.
- baracuda_
kernels_ ⚠dequantize_ per_ group_ backward_ f64_ can_ implement - Implementability check for
dequantize_per_group_backward_f64. - baracuda_
kernels_ ⚠dequantize_ per_ group_ backward_ f64_ run - Dequant BW — f64.
- baracuda_
kernels_ ⚠dequantize_ per_ group_ bf16_ s8_ can_ implement - Implementability check for
dequantize_per_group_bf16_s8. - baracuda_
kernels_ ⚠dequantize_ per_ group_ bf16_ s8_ run - Dequant — bf16, s8.
- baracuda_
kernels_ ⚠dequantize_ per_ group_ bf16_ u8_ can_ implement - Implementability check for
dequantize_per_group_bf16_u8. - baracuda_
kernels_ ⚠dequantize_ per_ group_ bf16_ u8_ run - Dequant — bf16, u8.
- baracuda_
kernels_ ⚠dequantize_ per_ group_ f16_ s8_ can_ implement - Implementability check for
dequantize_per_group_f16_s8. - baracuda_
kernels_ ⚠dequantize_ per_ group_ f16_ s8_ run - Dequant — f16, s8.
- baracuda_
kernels_ ⚠dequantize_ per_ group_ f16_ u8_ can_ implement - Implementability check for
dequantize_per_group_f16_u8. - baracuda_
kernels_ ⚠dequantize_ per_ group_ f16_ u8_ run - Dequant — f16, u8.
- baracuda_
kernels_ ⚠dequantize_ per_ group_ f32_ s8_ can_ implement - Implementability check for
dequantize_per_group_f32_s8. - baracuda_
kernels_ ⚠dequantize_ per_ group_ f32_ s8_ run - Dequant — f32, s8.
- baracuda_
kernels_ ⚠dequantize_ per_ group_ f32_ u8_ can_ implement - Implementability check for
dequantize_per_group_f32_u8. - baracuda_
kernels_ ⚠dequantize_ per_ group_ f32_ u8_ run - Dequant — f32, u8.
- baracuda_
kernels_ ⚠dequantize_ per_ group_ f64_ s8_ can_ implement - Implementability check for
dequantize_per_group_f64_s8. - baracuda_
kernels_ ⚠dequantize_ per_ group_ f64_ s8_ run - Dequant — f64, s8.
- baracuda_
kernels_ ⚠dequantize_ per_ group_ f64_ u8_ can_ implement - Implementability check for
dequantize_per_group_f64_u8. - baracuda_
kernels_ ⚠dequantize_ per_ group_ f64_ u8_ run - Dequant — f64, u8.
- baracuda_
kernels_ ⚠dequantize_ per_ tensor_ backward_ bf16_ can_ implement - Implementability check for
dequantize_per_tensor_backward_bf16. - baracuda_
kernels_ ⚠dequantize_ per_ tensor_ backward_ bf16_ run dequantize_per_tensor_backward— bf16.- baracuda_
kernels_ ⚠dequantize_ per_ tensor_ backward_ f16_ can_ implement - Implementability check for
dequantize_per_tensor_backward_f16. - baracuda_
kernels_ ⚠dequantize_ per_ tensor_ backward_ f16_ run dequantize_per_tensor_backward— f16.- baracuda_
kernels_ ⚠dequantize_ per_ tensor_ backward_ f32_ can_ implement - Implementability check for
dequantize_per_tensor_backward_f32. - baracuda_
kernels_ ⚠dequantize_ per_ tensor_ backward_ f32_ run dq = dy * scale. f32.- baracuda_
kernels_ ⚠dequantize_ per_ tensor_ backward_ f64_ can_ implement - Implementability check for
dequantize_per_tensor_backward_f64. - baracuda_
kernels_ ⚠dequantize_ per_ tensor_ backward_ f64_ run dequantize_per_tensor_backward— f64.- baracuda_
kernels_ ⚠dequantize_ per_ tensor_ bf16_ s8_ can_ implement - Implementability check for
dequantize_per_tensor_bf16_s8. - baracuda_
kernels_ ⚠dequantize_ per_ tensor_ bf16_ s8_ run dequantize_per_tensor— s8 → bf16.- baracuda_
kernels_ ⚠dequantize_ per_ tensor_ bf16_ u8_ can_ implement - Implementability check for
dequantize_per_tensor_bf16_u8. - baracuda_
kernels_ ⚠dequantize_ per_ tensor_ bf16_ u8_ run dequantize_per_tensor— u8 → bf16.- baracuda_
kernels_ ⚠dequantize_ per_ tensor_ f16_ s8_ can_ implement - Implementability check for
dequantize_per_tensor_f16_s8. - baracuda_
kernels_ ⚠dequantize_ per_ tensor_ f16_ s8_ run dequantize_per_tensor— s8 → f16.- baracuda_
kernels_ ⚠dequantize_ per_ tensor_ f16_ u8_ can_ implement - Implementability check for
dequantize_per_tensor_f16_u8. - baracuda_
kernels_ ⚠dequantize_ per_ tensor_ f16_ u8_ run dequantize_per_tensor— u8 → f16.- baracuda_
kernels_ ⚠dequantize_ per_ tensor_ f32_ s8_ can_ implement - Implementability check for
dequantize_per_tensor_f32_s8. - baracuda_
kernels_ ⚠dequantize_ per_ tensor_ f32_ s8_ run x = scale * (q - zp). s8 → f32.- baracuda_
kernels_ ⚠dequantize_ per_ tensor_ f32_ u8_ can_ implement - Implementability check for
dequantize_per_tensor_f32_u8. - baracuda_
kernels_ ⚠dequantize_ per_ tensor_ f32_ u8_ run dequantize_per_tensor— u8 → f32.- baracuda_
kernels_ ⚠dequantize_ per_ tensor_ f64_ s8_ can_ implement - Implementability check for
dequantize_per_tensor_f64_s8. - baracuda_
kernels_ ⚠dequantize_ per_ tensor_ f64_ s8_ run dequantize_per_tensor— s8 → f64.- baracuda_
kernels_ ⚠dequantize_ per_ tensor_ f64_ u8_ can_ implement - Implementability check for
dequantize_per_tensor_f64_u8. - baracuda_
kernels_ ⚠dequantize_ per_ tensor_ f64_ u8_ run dequantize_per_tensor— u8 → f64.- baracuda_
kernels_ ⚠dequantize_ per_ token_ backward_ bf16_ can_ implement - Implementability check for
dequantize_per_token_backward_bf16. - baracuda_
kernels_ ⚠dequantize_ per_ token_ backward_ bf16_ run - Dequant BW — bf16.
- baracuda_
kernels_ ⚠dequantize_ per_ token_ backward_ f16_ can_ implement - Implementability check for
dequantize_per_token_backward_f16. - baracuda_
kernels_ ⚠dequantize_ per_ token_ backward_ f16_ run - Dequant BW — f16.
- baracuda_
kernels_ ⚠dequantize_ per_ token_ backward_ f32_ can_ implement - Implementability check for
dequantize_per_token_backward_f32. - baracuda_
kernels_ ⚠dequantize_ per_ token_ backward_ f32_ run - Dequant BW — f32.
- baracuda_
kernels_ ⚠dequantize_ per_ token_ backward_ f64_ can_ implement - Implementability check for
dequantize_per_token_backward_f64. - baracuda_
kernels_ ⚠dequantize_ per_ token_ backward_ f64_ run - Dequant BW — f64.
- baracuda_
kernels_ ⚠dequantize_ per_ token_ bf16_ s8_ can_ implement - Implementability check for
dequantize_per_token_bf16_s8. - baracuda_
kernels_ ⚠dequantize_ per_ token_ bf16_ s8_ run dequantize_per_token— q s8 → y bf16.- baracuda_
kernels_ ⚠dequantize_ per_ token_ bf16_ u8_ can_ implement - Implementability check for
dequantize_per_token_bf16_u8. - baracuda_
kernels_ ⚠dequantize_ per_ token_ bf16_ u8_ run dequantize_per_token— q u8 → y bf16.- baracuda_
kernels_ ⚠dequantize_ per_ token_ f16_ s8_ can_ implement - Implementability check for
dequantize_per_token_f16_s8. - baracuda_
kernels_ ⚠dequantize_ per_ token_ f16_ s8_ run dequantize_per_token— q s8 → y f16.- baracuda_
kernels_ ⚠dequantize_ per_ token_ f16_ u8_ can_ implement - Implementability check for
dequantize_per_token_f16_u8. - baracuda_
kernels_ ⚠dequantize_ per_ token_ f16_ u8_ run dequantize_per_token— q u8 → y f16.- baracuda_
kernels_ ⚠dequantize_ per_ token_ f32_ s8_ can_ implement - Implementability check for
dequantize_per_token_f32_s8. - baracuda_
kernels_ ⚠dequantize_ per_ token_ f32_ s8_ run dequantize_per_token— q s8 → y f32.- baracuda_
kernels_ ⚠dequantize_ per_ token_ f32_ u8_ can_ implement - Implementability check for
dequantize_per_token_f32_u8. - baracuda_
kernels_ ⚠dequantize_ per_ token_ f32_ u8_ run dequantize_per_token— q u8 → y f32.- baracuda_
kernels_ ⚠dequantize_ per_ token_ f64_ s8_ can_ implement - Implementability check for
dequantize_per_token_f64_s8. - baracuda_
kernels_ ⚠dequantize_ per_ token_ f64_ s8_ run dequantize_per_token— q s8 → y f64.- baracuda_
kernels_ ⚠dequantize_ per_ token_ f64_ u8_ can_ implement - Implementability check for
dequantize_per_token_f64_u8. - baracuda_
kernels_ ⚠dequantize_ per_ token_ f64_ u8_ run dequantize_per_token— q u8 → y f64.- baracuda_
kernels_ ⚠dequantize_ q2_ K_ can_ implement baracuda_kernels_dequantize_q2_K_can_implement(baracuda kernels dequantize q2 k can implement).- baracuda_
kernels_ ⚠dequantize_ q2_ K_ run - GGUF
Q2_Kdequantize → f32.numelmust be a multiple of 256. - baracuda_
kernels_ ⚠dequantize_ q3_ K_ can_ implement baracuda_kernels_dequantize_q3_K_can_implement(baracuda kernels dequantize q3 k can implement).- baracuda_
kernels_ ⚠dequantize_ q3_ K_ run - GGUF
Q3_Kdequantize → f32. # Safety: asQ2_K. - baracuda_
kernels_ ⚠dequantize_ q4_ 0_ can_ implement baracuda_kernels_dequantize_q4_0_can_implement(baracuda kernels dequantize q4 0 can implement).- baracuda_
kernels_ ⚠dequantize_ q4_ 0_ run - GGUF
Q4_0block-format dequantize → f32.numelmust be a multiple of 32. # Safety: device-residentx,y; valid stream. - baracuda_
kernels_ ⚠dequantize_ q4_ 1_ can_ implement baracuda_kernels_dequantize_q4_1_can_implement(baracuda kernels dequantize q4 1 can implement).- baracuda_
kernels_ ⚠dequantize_ q4_ 1_ run - GGUF
Q4_1dequantize → f32. # Safety: asQ4_0. - baracuda_
kernels_ ⚠dequantize_ q4_ K_ can_ implement baracuda_kernels_dequantize_q4_K_can_implement(baracuda kernels dequantize q4 k can implement).- baracuda_
kernels_ ⚠dequantize_ q4_ K_ run - GGUF
Q4_Kdequantize → f32. # Safety: asQ2_K. - baracuda_
kernels_ ⚠dequantize_ q5_ 0_ can_ implement baracuda_kernels_dequantize_q5_0_can_implement(baracuda kernels dequantize q5 0 can implement).- baracuda_
kernels_ ⚠dequantize_ q5_ 0_ run - GGUF
Q5_0dequantize → f32. # Safety: asQ4_0. - baracuda_
kernels_ ⚠dequantize_ q5_ 1_ can_ implement baracuda_kernels_dequantize_q5_1_can_implement(baracuda kernels dequantize q5 1 can implement).- baracuda_
kernels_ ⚠dequantize_ q5_ 1_ run - GGUF
Q5_1dequantize → f32. # Safety: asQ4_0. - baracuda_
kernels_ ⚠dequantize_ q5_ K_ can_ implement baracuda_kernels_dequantize_q5_K_can_implement(baracuda kernels dequantize q5 k can implement).- baracuda_
kernels_ ⚠dequantize_ q5_ K_ run - GGUF
Q5_Kdequantize → f32. # Safety: asQ2_K. - baracuda_
kernels_ ⚠dequantize_ q6_ K_ can_ implement baracuda_kernels_dequantize_q6_K_can_implement(baracuda kernels dequantize q6 k can implement).- baracuda_
kernels_ ⚠dequantize_ q6_ K_ run - GGUF
Q6_Kdequantize → f32. # Safety: asQ2_K. - baracuda_
kernels_ ⚠dequantize_ q8_ 0_ can_ implement baracuda_kernels_dequantize_q8_0_can_implement(baracuda kernels dequantize q8 0 can implement).- baracuda_
kernels_ ⚠dequantize_ q8_ 0_ run - GGUF
Q8_0dequantize → f32. # Safety: asQ4_0. - baracuda_
kernels_ ⚠dequantize_ q8_ K_ can_ implement baracuda_kernels_dequantize_q8_K_can_implement(baracuda kernels dequantize q8 k can implement).- baracuda_
kernels_ ⚠dequantize_ q8_ K_ run - GGUF
Q8_Kdequantize → f32. # Safety: asQ2_K. - baracuda_
kernels_ ⚠dropout_ backward_ f32_ can_ implement baracuda_kernels_dropout_backward_f32_can_implement(baracuda kernels dropout backward f32 can implement).- baracuda_
kernels_ ⚠dropout_ backward_ f32_ run - Dropout backward (f32). Writes
dx[i] = dy[i] * mask[i] * scalewherescale = 1 / (1 - p). - baracuda_
kernels_ ⚠dropout_ backward_ f64_ can_ implement baracuda_kernels_dropout_backward_f64_can_implement(baracuda kernels dropout backward f64 can implement).- baracuda_
kernels_ ⚠dropout_ backward_ f64_ run - Dropout backward (f64).
- baracuda_
kernels_ ⚠dropout_ f32_ can_ implement baracuda_kernels_dropout_f32_can_implement(baracuda kernels dropout f32 can implement).- baracuda_
kernels_ ⚠dropout_ f32_ run - Dropout forward (f32). Writes:
- baracuda_
kernels_ ⚠dropout_ f64_ can_ implement baracuda_kernels_dropout_f64_can_implement(baracuda kernels dropout f64 can implement).- baracuda_
kernels_ ⚠dropout_ f64_ run - Dropout forward (f64). Same shape as the f32 variant.
- baracuda_
kernels_ ⚠dynamic_ range_ quantize_ per_ token_ sym_ f32_ s8_ can_ implement - Implementability check for
dynamic_range_quantize_per_token_sym_f32_s8. - baracuda_
kernels_ ⚠dynamic_ range_ quantize_ per_ token_ sym_ f32_ s8_ run dynamic_range_quantize_per_token_sym— f32 → s8.- baracuda_
kernels_ ⚠dynamic_ range_ quantize_ per_ token_ sym_ f64_ s8_ can_ implement - Implementability check for
dynamic_range_quantize_per_token_sym_f64_s8. - baracuda_
kernels_ ⚠dynamic_ range_ quantize_ per_ token_ sym_ f64_ s8_ run dynamic_range_quantize_per_token_sym— f64 → s8.- baracuda_
kernels_ ⚠eig_ run - General eigendecomposition via
Xgeev.a_inoutis destroyed in place.dtype_tagselects between f32 / f64 / Complex32 / Complex64 (matches the input dtype; outputs use the same dtype). For real input,w_outis[2 * n](packed wr/wi); for complex input,[n]. Workspace is split host + device per cuSOLVER’s 64-bit API convention. - baracuda_
kernels_ ⚠eig_ workspace_ size - Eig workspace sizes (Xgeev). Writes two byte counts — device + host. Caller must size both.
- baracuda_
kernels_ ⚠eigh_ c32_ run - Hermitian eigendecomposition (Complex32). Eigenvalues are real
f32(the Hermitian eigenvalue spectrum is always real); theeigenvalues_outbuffer isf32[n], notComplex32[n]. - baracuda_
kernels_ ⚠eigh_ c32_ workspace_ size - Hermitian eigendecomposition workspace size (Complex32).
- baracuda_
kernels_ ⚠eigh_ c64_ run - Hermitian eigendecomposition (Complex64). Eigenvalues are real
f64;eigenvalues_outisf64[n], notComplex64[n]. - baracuda_
kernels_ ⚠eigh_ c64_ workspace_ size - Hermitian eigendecomposition workspace size (Complex64).
- baracuda_
kernels_ ⚠eigh_ f32_ run - Symmetric eigendecomposition
A · v = λ · v.a_inoutis overwritten with the eigenvector matrix (column-major);eigenvalues_outreceives theneigenvalues sorted ascending. - baracuda_
kernels_ ⚠eigh_ f32_ workspace_ size - Eigh workspace size in bytes for the real symmetric
syevdpath. - baracuda_
kernels_ ⚠eigh_ f64_ run - Symmetric eigendecomposition
A · v = λ · v.a_inoutis overwritten with the eigenvector matrix (column-major);eigenvalues_outreceives theneigenvalues sorted ascending. - baracuda_
kernels_ ⚠eigh_ f64_ workspace_ size - Eigh workspace size in bytes for the real symmetric
syevdpath. - baracuda_
kernels_ ⚠embedding_ backward_ f32_ can_ implement - Implementability check for
embedding_backward_f32. - baracuda_
kernels_ ⚠embedding_ backward_ f32_ run embeddingBW —dweight[indices[n], :] += dout[n, :](atomicAdd), skipping rows whereindices[n] == padding_idx. f32.- baracuda_
kernels_ ⚠embedding_ backward_ f64_ can_ implement - Implementability check for
embedding_backward_f64. - baracuda_
kernels_ ⚠embedding_ backward_ f64_ run embeddingBW — f64.- baracuda_
kernels_ ⚠embedding_ backward_ i64idx_ f32_ can_ implement - Implementability check for
embedding_backward_i64idx_f32. - baracuda_
kernels_ ⚠embedding_ backward_ i64idx_ f32_ run embeddingBW — f32, i64 indices.- baracuda_
kernels_ ⚠embedding_ backward_ i64idx_ f64_ can_ implement - Implementability check for
embedding_backward_i64idx_f64. - baracuda_
kernels_ ⚠embedding_ backward_ i64idx_ f64_ run embeddingBW — f64, i64 indices.- baracuda_
kernels_ ⚠embedding_ bag_ backward_ f32_ can_ implement - Implementability check for
embedding_bag_backward_f32. - baracuda_
kernels_ ⚠embedding_ bag_ backward_ f32_ run embedding_bagBW — atomicAdd intodweight. f32.- baracuda_
kernels_ ⚠embedding_ bag_ backward_ f64_ can_ implement - Implementability check for
embedding_bag_backward_f64. - baracuda_
kernels_ ⚠embedding_ bag_ backward_ f64_ run embedding_bagBW — f64.- baracuda_
kernels_ ⚠embedding_ bag_ backward_ i64idx_ f32_ can_ implement - Implementability check for
embedding_bag_backward_i64idx_f32. - baracuda_
kernels_ ⚠embedding_ bag_ backward_ i64idx_ f32_ run embedding_bagBW — f32, i64 indices.- baracuda_
kernels_ ⚠embedding_ bag_ backward_ i64idx_ f64_ can_ implement - Implementability check for
embedding_bag_backward_i64idx_f64. - baracuda_
kernels_ ⚠embedding_ bag_ backward_ i64idx_ f64_ run embedding_bagBW — f64, i64 indices.- baracuda_
kernels_ ⚠embedding_ bag_ bf16_ can_ implement - Implementability check for
embedding_bag_bf16. - baracuda_
kernels_ ⚠embedding_ bag_ bf16_ run embedding_bagFW — bf16.- baracuda_
kernels_ ⚠embedding_ bag_ f16_ can_ implement - Implementability check for
embedding_bag_f16. - baracuda_
kernels_ ⚠embedding_ bag_ f16_ run embedding_bagFW — f16.- baracuda_
kernels_ ⚠embedding_ bag_ f32_ can_ implement - Implementability check for
embedding_bag_f32. - baracuda_
kernels_ ⚠embedding_ bag_ f32_ run embedding_bagFW — f32.- baracuda_
kernels_ ⚠embedding_ bag_ f64_ can_ implement - Implementability check for
embedding_bag_f64. - baracuda_
kernels_ ⚠embedding_ bag_ f64_ run embedding_bagFW — f64.- baracuda_
kernels_ ⚠embedding_ bag_ i64idx_ bf16_ can_ implement - Implementability check for
embedding_bag_i64idx_bf16. - baracuda_
kernels_ ⚠embedding_ bag_ i64idx_ bf16_ run embedding_bagFW — bf16, i64 indices.- baracuda_
kernels_ ⚠embedding_ bag_ i64idx_ f16_ can_ implement - Implementability check for
embedding_bag_i64idx_f16. - baracuda_
kernels_ ⚠embedding_ bag_ i64idx_ f16_ run embedding_bagFW — f16, i64 indices.- baracuda_
kernels_ ⚠embedding_ bag_ i64idx_ f32_ can_ implement - Implementability check for
embedding_bag_i64idx_f32. - baracuda_
kernels_ ⚠embedding_ bag_ i64idx_ f32_ run embedding_bagFW — f32, i64 indices.- baracuda_
kernels_ ⚠embedding_ bag_ i64idx_ f64_ can_ implement - Implementability check for
embedding_bag_i64idx_f64. - baracuda_
kernels_ ⚠embedding_ bag_ i64idx_ f64_ run embedding_bagFW — f64, i64 indices.- baracuda_
kernels_ ⚠embedding_ bag_ max_ backward_ f32_ can_ implement - Implementability check for
embedding_bag_max_backward_f32. - baracuda_
kernels_ ⚠embedding_ bag_ max_ backward_ f32_ run embedding_bag_maxBW — f32. Index dtype is fixed at i32 (set by the FW’sout_indexoutput).- baracuda_
kernels_ ⚠embedding_ bag_ max_ backward_ f64_ can_ implement - Implementability check for
embedding_bag_max_backward_f64. - baracuda_
kernels_ ⚠embedding_ bag_ max_ backward_ f64_ run embedding_bag_maxBW — f64.- baracuda_
kernels_ ⚠embedding_ bag_ max_ bf16_ can_ implement - Implementability check for
embedding_bag_max_bf16. - baracuda_
kernels_ ⚠embedding_ bag_ max_ bf16_ run embedding_bag_maxFW — bf16.- baracuda_
kernels_ ⚠embedding_ bag_ max_ f16_ can_ implement - Implementability check for
embedding_bag_max_f16. - baracuda_
kernels_ ⚠embedding_ bag_ max_ f16_ run embedding_bag_maxFW — f16.- baracuda_
kernels_ ⚠embedding_ bag_ max_ f32_ can_ implement - Implementability check for
embedding_bag_max_f32. - baracuda_
kernels_ ⚠embedding_ bag_ max_ f32_ run embedding_bagMax-mode FW — f32 (i32 indices).- baracuda_
kernels_ ⚠embedding_ bag_ max_ f64_ can_ implement - Implementability check for
embedding_bag_max_f64. - baracuda_
kernels_ ⚠embedding_ bag_ max_ f64_ run embedding_bag_maxFW — f64.- baracuda_
kernels_ ⚠embedding_ bag_ max_ i64idx_ bf16_ can_ implement - Implementability check for
embedding_bag_max_i64idx_bf16. - baracuda_
kernels_ ⚠embedding_ bag_ max_ i64idx_ bf16_ run embedding_bag_maxFW — bf16, i64 indices.- baracuda_
kernels_ ⚠embedding_ bag_ max_ i64idx_ f16_ can_ implement - Implementability check for
embedding_bag_max_i64idx_f16. - baracuda_
kernels_ ⚠embedding_ bag_ max_ i64idx_ f16_ run embedding_bag_maxFW — f16, i64 indices.- baracuda_
kernels_ ⚠embedding_ bag_ max_ i64idx_ f32_ can_ implement - Implementability check for
embedding_bag_max_i64idx_f32. - baracuda_
kernels_ ⚠embedding_ bag_ max_ i64idx_ f32_ run embedding_bag_maxFW — f32, i64 indices.- baracuda_
kernels_ ⚠embedding_ bag_ max_ i64idx_ f64_ can_ implement - Implementability check for
embedding_bag_max_i64idx_f64. - baracuda_
kernels_ ⚠embedding_ bag_ max_ i64idx_ f64_ run embedding_bag_maxFW — f64, i64 indices.- baracuda_
kernels_ ⚠embedding_ bf16_ can_ implement - Implementability check for
embedding_bf16. - baracuda_
kernels_ ⚠embedding_ bf16_ run embeddingFW — bf16.- baracuda_
kernels_ ⚠embedding_ f16_ can_ implement - Implementability check for
embedding_f16. - baracuda_
kernels_ ⚠embedding_ f16_ run embeddingFW — f16.- baracuda_
kernels_ ⚠embedding_ f32_ can_ implement - Implementability check for
embedding_f32. - baracuda_
kernels_ ⚠embedding_ f32_ run embeddingFW — f32 (pure copy).- baracuda_
kernels_ ⚠embedding_ f64_ can_ implement - Implementability check for
embedding_f64. - baracuda_
kernels_ ⚠embedding_ f64_ run embeddingFW — f64.- baracuda_
kernels_ ⚠embedding_ i64idx_ bf16_ can_ implement - Implementability check for
embedding_i64idx_bf16. - baracuda_
kernels_ ⚠embedding_ i64idx_ bf16_ run embeddingFW — bf16, i64 indices.- baracuda_
kernels_ ⚠embedding_ i64idx_ f16_ can_ implement - Implementability check for
embedding_i64idx_f16. - baracuda_
kernels_ ⚠embedding_ i64idx_ f16_ run embeddingFW — f16, i64 indices.- baracuda_
kernels_ ⚠embedding_ i64idx_ f32_ can_ implement - Implementability check for
embedding_i64idx_f32. - baracuda_
kernels_ ⚠embedding_ i64idx_ f32_ run embeddingFW — f32, i64 indices.- baracuda_
kernels_ ⚠embedding_ i64idx_ f64_ can_ implement - Implementability check for
embedding_i64idx_f64. - baracuda_
kernels_ ⚠embedding_ i64idx_ f64_ run embeddingFW — f64, i64 indices.- baracuda_
kernels_ ⚠fake_ quantize_ backward_ bf16_ can_ implement - Implementability check for
fake_quantize_backward_bf16. - baracuda_
kernels_ ⚠fake_ quantize_ backward_ bf16_ run fake_quantize_backward— bf16.- baracuda_
kernels_ ⚠fake_ quantize_ backward_ f16_ can_ implement - Implementability check for
fake_quantize_backward_f16. - baracuda_
kernels_ ⚠fake_ quantize_ backward_ f16_ run fake_quantize_backward— f16.- baracuda_
kernels_ ⚠fake_ quantize_ backward_ f32_ can_ implement - Implementability check for
fake_quantize_backward_f32. - baracuda_
kernels_ ⚠fake_ quantize_ backward_ f32_ run dx = dy * in_range_mask(x). STE, no 1/scale factor. f32.- baracuda_
kernels_ ⚠fake_ quantize_ backward_ f64_ can_ implement - Implementability check for
fake_quantize_backward_f64. - baracuda_
kernels_ ⚠fake_ quantize_ backward_ f64_ run fake_quantize_backward— f64.- baracuda_
kernels_ ⚠fake_ quantize_ bf16_ can_ implement - Implementability check for
fake_quantize_bf16. - baracuda_
kernels_ ⚠fake_ quantize_ bf16_ run fake_quantize— bf16.- baracuda_
kernels_ ⚠fake_ quantize_ f16_ can_ implement - Implementability check for
fake_quantize_f16. - baracuda_
kernels_ ⚠fake_ quantize_ f16_ run fake_quantize— f16.- baracuda_
kernels_ ⚠fake_ quantize_ f32_ can_ implement - Implementability check for
fake_quantize_f32. - baracuda_
kernels_ ⚠fake_ quantize_ f32_ run y = scale * (clamp(round(x/scale)+zp, qmin, qmax) - zp). f32.- baracuda_
kernels_ ⚠fake_ quantize_ f64_ can_ implement - Implementability check for
fake_quantize_f64. - baracuda_
kernels_ ⚠fake_ quantize_ f64_ run fake_quantize— f64 (f64 scale).- baracuda_
kernels_ ⚠fft_ 1d_ c32_ run - 1-D C2C FFT (forward + inverse via flag). Wraps cuFFT’s
cufftExecC2C(c32) /cufftExecZ2Z(c64). For inverse, applies1/nnormalization in-place after exec. - baracuda_
kernels_ ⚠fft_ 1d_ c32_ workspace_ size - 1-D C2C FFT workspace size in bytes. cuFFT manages its own
internal workspace; this entry always writes
0. - baracuda_
kernels_ ⚠fft_ 1d_ c64_ run - 1-D C2C FFT (forward + inverse via flag). Wraps cuFFT’s
cufftExecC2C(c32) /cufftExecZ2Z(c64). For inverse, applies1/nnormalization in-place after exec. - baracuda_
kernels_ ⚠fft_ 1d_ c64_ workspace_ size - 1-D C2C FFT workspace size in bytes. cuFFT manages its own
internal workspace; this entry always writes
0. - baracuda_
kernels_ ⚠fft_ nd_ c32_ run - ND C2C FFT (forward + inverse via flag).
- baracuda_
kernels_ ⚠fft_ nd_ c32_ workspace_ size - ND C2C FFT workspace size in bytes — always
0. - baracuda_
kernels_ ⚠fft_ nd_ c64_ run - ND C2C FFT (forward + inverse via flag).
- baracuda_
kernels_ ⚠fft_ nd_ c64_ workspace_ size - ND C2C FFT workspace size in bytes — always
0. - baracuda_
kernels_ ⚠fftshift_ 4_ can_ implement baracuda_kernels_fftshift_4_can_implement(baracuda kernels fftshift 4 can implement).- baracuda_
kernels_ ⚠fftshift_ 4_ run fftshiftalong the last axis of a[batch, n]tensor:y[b, i] = x[b, (i + n/2) % n]. Element-width specialization (4 bytes per element) — used forBool/f32/ packed-Bool shifts; the same kernel re-instantiated at 8 / 16 bytes coversf64/Complex32andComplex64.- baracuda_
kernels_ ⚠fftshift_ 8_ can_ implement baracuda_kernels_fftshift_8_can_implement(baracuda kernels fftshift 8 can implement).- baracuda_
kernels_ ⚠fftshift_ 8_ run - 8-byte-element
fftshift(coversf64andComplex32). - baracuda_
kernels_ ⚠fftshift_ 16_ can_ implement baracuda_kernels_fftshift_16_can_implement(baracuda kernels fftshift 16 can implement).- baracuda_
kernels_ ⚠fftshift_ 16_ run - 16-byte-element
fftshift(coversComplex64). - baracuda_
kernels_ ⚠fftshift_ nd_ 4_ can_ implement baracuda_kernels_fftshift_nd_4_can_implement(baracuda kernels fftshift nd 4 can implement).- baracuda_
kernels_ ⚠fftshift_ nd_ 4_ run - N-D
fftshift/ifftshift— single-pass general-permutation kernel covering up to rank-8 tensors. The caller passes a per- axisshape, per-axisshift_amt(0 for pass-through axes;n/2for fftshift /n - n/2for ifftshift on shifted axes), and per-axis contiguousstride(in elements). The same kernel covers both directions — the direction lives entirely in theshift_amtarray. - baracuda_
kernels_ ⚠fftshift_ nd_ 8_ can_ implement baracuda_kernels_fftshift_nd_8_can_implement(baracuda kernels fftshift nd 8 can implement).- baracuda_
kernels_ ⚠fftshift_ nd_ 8_ run - 8-byte-cell N-D fftshift (covers
f64andComplex32). - baracuda_
kernels_ ⚠fftshift_ nd_ 16_ can_ implement baracuda_kernels_fftshift_nd_16_can_implement(baracuda kernels fftshift nd 16 can implement).- baracuda_
kernels_ ⚠fftshift_ nd_ 16_ run - 16-byte-cell N-D fftshift (covers
Complex64). - baracuda_
kernels_ ⚠fill_ bf16_ can_ implement - Implementability check for
fill_bf16. Host-side only. - baracuda_
kernels_ ⚠fill_ bf16_ run - Fill
ywithvalue, bf16 dtype.value_bitsis the raw 16-bit pattern of abf16value. - baracuda_
kernels_ ⚠fill_ bf16_ strided_ can_ implement baracuda_kernels_fill_bf16_strided_can_implement(baracuda kernels fill bf16 strided can implement).- baracuda_
kernels_ ⚠fill_ bf16_ strided_ run - Strided fill, bf16.
value_bitsis the raw 16-bit pattern of abf16value. - baracuda_
kernels_ ⚠fill_ f16_ can_ implement - Implementability check for
fill_f16. Host-side only. - baracuda_
kernels_ ⚠fill_ f16_ run - Fill
ywithvalue, f16 dtype.value_bitsis the raw 16-bit pattern of anf16value (transport convention shared with the Pad-constant family). - baracuda_
kernels_ ⚠fill_ f16_ strided_ can_ implement baracuda_kernels_fill_f16_strided_can_implement(baracuda kernels fill f16 strided can implement).- baracuda_
kernels_ ⚠fill_ f16_ strided_ run - Strided fill, f16.
value_bitsis the raw 16-bit pattern of anf16value. - baracuda_
kernels_ ⚠fill_ f32_ can_ implement - Implementability check for
fill_f32. Host-side only. - baracuda_
kernels_ ⚠fill_ f32_ run - Fill
ywithvalue, f32 dtype. This is the fill trailblazer — everyfill_<dt>_run(and_strided_run) variant follows the same write-only contract. - baracuda_
kernels_ ⚠fill_ f32_ strided_ can_ implement baracuda_kernels_fill_f32_strided_can_implement(baracuda kernels fill f32 strided can implement).- baracuda_
kernels_ ⚠fill_ f32_ strided_ run baracuda_kernels_fill_f32_strided_run(baracuda kernels fill f32 strided run).- baracuda_
kernels_ ⚠fill_ f64_ can_ implement - Implementability check for
fill_f64. Host-side only. - baracuda_
kernels_ ⚠fill_ f64_ run - Fill
ywithvalue, f64 dtype. - baracuda_
kernels_ ⚠fill_ f64_ strided_ can_ implement baracuda_kernels_fill_f64_strided_can_implement(baracuda kernels fill f64 strided can implement).- baracuda_
kernels_ ⚠fill_ f64_ strided_ run baracuda_kernels_fill_f64_strided_run(baracuda kernels fill f64 strided run).- baracuda_
kernels_ ⚠fill_ fp8e4m3_ can_ implement baracuda_kernels_fill_fp8e4m3_can_implement(baracuda kernels fill fp8e4m3 can implement).- baracuda_
kernels_ ⚠fill_ fp8e4m3_ run - Fill
ywithvalue, FP8 E4M3 dtype.valueis the raw 8-bit E4M3 encoding (storage is byte-identical tou8); callers compute the encoding via the cast family or__nv_cvt_float_to_fp8. - baracuda_
kernels_ ⚠fill_ fp8e4m3_ strided_ can_ implement baracuda_kernels_fill_fp8e4m3_strided_can_implement(baracuda kernels fill fp8e4m3 strided can implement).- baracuda_
kernels_ ⚠fill_ fp8e4m3_ strided_ run baracuda_kernels_fill_fp8e4m3_strided_run(baracuda kernels fill fp8e4m3 strided run).- baracuda_
kernels_ ⚠fill_ i8_ can_ implement - Implementability check for
fill_i8. Host-side only. - baracuda_
kernels_ ⚠fill_ i8_ run - Fill
ywithvalue, i8 dtype. - baracuda_
kernels_ ⚠fill_ i8_ strided_ can_ implement baracuda_kernels_fill_i8_strided_can_implement(baracuda kernels fill i8 strided can implement).- baracuda_
kernels_ ⚠fill_ i8_ strided_ run baracuda_kernels_fill_i8_strided_run(baracuda kernels fill i8 strided run).- baracuda_
kernels_ ⚠fill_ i16_ can_ implement baracuda_kernels_fill_i16_can_implement(baracuda kernels fill i16 can implement).- baracuda_
kernels_ ⚠fill_ i16_ run - Fill
ywithvalue, i16 dtype. - baracuda_
kernels_ ⚠fill_ i16_ strided_ can_ implement baracuda_kernels_fill_i16_strided_can_implement(baracuda kernels fill i16 strided can implement).- baracuda_
kernels_ ⚠fill_ i16_ strided_ run baracuda_kernels_fill_i16_strided_run(baracuda kernels fill i16 strided run).- baracuda_
kernels_ ⚠fill_ i32_ can_ implement - Implementability check for
fill_i32. Host-side only. - baracuda_
kernels_ ⚠fill_ i32_ run - Fill
ywithvalue, i32 dtype. - baracuda_
kernels_ ⚠fill_ i32_ strided_ can_ implement baracuda_kernels_fill_i32_strided_can_implement(baracuda kernels fill i32 strided can implement).- baracuda_
kernels_ ⚠fill_ i32_ strided_ run baracuda_kernels_fill_i32_strided_run(baracuda kernels fill i32 strided run).- baracuda_
kernels_ ⚠fill_ i64_ can_ implement - Implementability check for
fill_i64. Host-side only. - baracuda_
kernels_ ⚠fill_ i64_ run - Fill
ywithvalue, i64 dtype. - baracuda_
kernels_ ⚠fill_ i64_ strided_ can_ implement baracuda_kernels_fill_i64_strided_can_implement(baracuda kernels fill i64 strided can implement).- baracuda_
kernels_ ⚠fill_ i64_ strided_ run baracuda_kernels_fill_i64_strided_run(baracuda kernels fill i64 strided run).- baracuda_
kernels_ ⚠fill_ u8_ can_ implement - Implementability check for
fill_u8. Host-side only. - baracuda_
kernels_ ⚠fill_ u8_ run - Fill
ywithvalue, u8 dtype. - baracuda_
kernels_ ⚠fill_ u8_ strided_ can_ implement baracuda_kernels_fill_u8_strided_can_implement(baracuda kernels fill u8 strided can implement).- baracuda_
kernels_ ⚠fill_ u8_ strided_ run baracuda_kernels_fill_u8_strided_run(baracuda kernels fill u8 strided run).- baracuda_
kernels_ ⚠fill_ u32_ can_ implement baracuda_kernels_fill_u32_can_implement(baracuda kernels fill u32 can implement).- baracuda_
kernels_ ⚠fill_ u32_ run - Fill
ywithvalue, u32 dtype. - baracuda_
kernels_ ⚠fill_ u32_ strided_ can_ implement baracuda_kernels_fill_u32_strided_can_implement(baracuda kernels fill u32 strided can implement).- baracuda_
kernels_ ⚠fill_ u32_ strided_ run baracuda_kernels_fill_u32_strided_run(baracuda kernels fill u32 strided run).- baracuda_
kernels_ ⚠flash_ decoding_ bf16_ can_ implement - Implementability check for
flash_decoding_bf16. Host-side only. - baracuda_
kernels_ ⚠flash_ decoding_ bf16_ run - FlashDecoding FW, bf16 (f32 accumulators).
- baracuda_
kernels_ ⚠flash_ decoding_ bf16_ workspace_ bytes - Workspace requirement for
flash_decoding_bf16in bytes. - baracuda_
kernels_ ⚠flash_ decoding_ f16_ can_ implement - Implementability check for
flash_decoding_f16. Host-side only. - baracuda_
kernels_ ⚠flash_ decoding_ f16_ run - FlashDecoding FW, f16 (f32 accumulators). seq_q = 1; split-K over chunks of 256 K-rows each, combined via a second kernel.
- baracuda_
kernels_ ⚠flash_ decoding_ f16_ workspace_ bytes - Workspace requirement for
flash_decoding_f16in bytes. - baracuda_
kernels_ ⚠flash_ sdpa_ backward_ bf16_ can_ implement - Implementability check for
flash_sdpa_backward_bf16. Host-side only. - baracuda_
kernels_ ⚠flash_ sdpa_ backward_ bf16_ run - Flash SDPA BW, bf16.
- baracuda_
kernels_ ⚠flash_ sdpa_ backward_ f16_ can_ implement - Implementability check for
flash_sdpa_backward_f16. Host-side only. - baracuda_
kernels_ ⚠flash_ sdpa_ backward_ f16_ run - Flash SDPA BW, f16.
- baracuda_
kernels_ ⚠flash_ sdpa_ backward_ f32_ can_ implement - Implementability check for
flash_sdpa_backward_f32. Host-side only. - baracuda_
kernels_ ⚠flash_ sdpa_ backward_ f32_ run - Flash SDPA BW, f32. Given the FW-saved
y,lse, plus upstreamdy, computesdQ,dK,dV. Thed_wsargument is a caller-allocated[B, H, Q]scratch buffer (overwritten with the per-rowD = rowsum(y ⊙ dy)intermediate; element type matches T). - baracuda_
kernels_ ⚠flash_ sdpa_ backward_ f64_ can_ implement - Implementability check for
flash_sdpa_backward_f64. Host-side only. - baracuda_
kernels_ ⚠flash_ sdpa_ backward_ f64_ run - Flash SDPA BW, f64.
- baracuda_
kernels_ ⚠flash_ sdpa_ bf16_ can_ implement - Implementability check for
flash_sdpa_bf16. Host-side only. - baracuda_
kernels_ ⚠flash_ sdpa_ bf16_ run - Flash SDPA FW, bf16 (f32 accumulators).
- baracuda_
kernels_ ⚠flash_ sdpa_ f16_ can_ implement - Implementability check for
flash_sdpa_f16. Host-side only. - baracuda_
kernels_ ⚠flash_ sdpa_ f16_ run - Flash SDPA FW, f16 (f32 accumulators).
- baracuda_
kernels_ ⚠flash_ sdpa_ f32_ can_ implement - Implementability check for
flash_sdpa_f32. Host-side only. - baracuda_
kernels_ ⚠flash_ sdpa_ f32_ run - Flash SDPA FW, f32. Computes
y = softmax(Q·K^T·scale) · Vvia tiled fused online softmax. Optional upper-triangular causal mask (is_causal = 1); explicit additive mask is not supported in the trailblazer. Writesy: [B, H, Q, D_v]and the savedlse: [B, H, Q]log-sum-exp tensor that BW consumes. - baracuda_
kernels_ ⚠flash_ sdpa_ f64_ can_ implement - Implementability check for
flash_sdpa_f64. Host-side only. - baracuda_
kernels_ ⚠flash_ sdpa_ f64_ run - Flash SDPA FW, f64.
- baracuda_
kernels_ ⚠flip_ bf16_ can_ implement - Pre-launch implementability check for
flip_bf16. - baracuda_
kernels_ ⚠flip_ bf16_ run - Flip, bf16. Pure element copy — no math.
- baracuda_
kernels_ ⚠flip_ bf16_ strided_ can_ implement flip_bf16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠flip_ bf16_ strided_ run - Flip strided sibling, bf16.
- baracuda_
kernels_ ⚠flip_ f16_ can_ implement - Pre-launch implementability check for
flip_f16. - baracuda_
kernels_ ⚠flip_ f16_ run - Flip, f16. Pure element copy — no math.
- baracuda_
kernels_ ⚠flip_ f16_ strided_ can_ implement flip_f16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠flip_ f16_ strided_ run - Flip strided sibling, f16.
- baracuda_
kernels_ ⚠flip_ f32_ can_ implement - Pre-launch implementability check for
flip_f32. - baracuda_
kernels_ ⚠flip_ f32_ run - Flip (reverse along selected axes), f32.
flip_axes[d]is 1 = reverse axis d, 0 = no-op. - baracuda_
kernels_ ⚠flip_ f32_ strided_ can_ implement flip_f32_strided_can_implementcompanion.- baracuda_
kernels_ ⚠flip_ f32_ strided_ run - Flip strided sibling, f32.
- baracuda_
kernels_ ⚠flip_ f64_ can_ implement - Pre-launch implementability check for
flip_f64. - baracuda_
kernels_ ⚠flip_ f64_ run - Flip, f64. Pure element copy — no math.
- baracuda_
kernels_ ⚠flip_ f64_ strided_ can_ implement flip_f64_strided_can_implementcompanion.- baracuda_
kernels_ ⚠flip_ f64_ strided_ run - Flip strided sibling, f64.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ bw_ bf16_ can_ implement baracuda_kernels_fractional_max_pool_2d_bw_bf16_can_implement(baracuda kernels fractional max pool 2d bw bf16 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ bw_ bf16_ run - FractionalMaxPool2d BW, bf16.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ bw_ f16_ can_ implement baracuda_kernels_fractional_max_pool_2d_bw_f16_can_implement(baracuda kernels fractional max pool 2d bw f16 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ bw_ f16_ run - FractionalMaxPool2d BW, f16.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ bw_ f32_ can_ implement baracuda_kernels_fractional_max_pool_2d_bw_f32_can_implement(baracuda kernels fractional max pool 2d bw f32 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ bw_ f32_ run - FractionalMaxPool2d BW, f32.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ bw_ f64_ can_ implement baracuda_kernels_fractional_max_pool_2d_bw_f64_can_implement(baracuda kernels fractional max pool 2d bw f64 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ bw_ f64_ run - FractionalMaxPool2d BW, f64.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ fw_ bf16_ can_ implement baracuda_kernels_fractional_max_pool_2d_fw_bf16_can_implement(baracuda kernels fractional max pool 2d fw bf16 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ fw_ bf16_ run - FractionalMaxPool2d FW, bf16.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ fw_ f16_ can_ implement baracuda_kernels_fractional_max_pool_2d_fw_f16_can_implement(baracuda kernels fractional max pool 2d fw f16 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ fw_ f16_ run - FractionalMaxPool2d FW, f16.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ fw_ f32_ can_ implement baracuda_kernels_fractional_max_pool_2d_fw_f32_can_implement(baracuda kernels fractional max pool 2d fw f32 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ fw_ f32_ run - FractionalMaxPool2d FW, f32.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ fw_ f64_ can_ implement baracuda_kernels_fractional_max_pool_2d_fw_f64_can_implement(baracuda kernels fractional max pool 2d fw f64 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 2d_ fw_ f64_ run - FractionalMaxPool2d FW, f64.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ bw_ bf16_ can_ implement baracuda_kernels_fractional_max_pool_3d_bw_bf16_can_implement(baracuda kernels fractional max pool 3d bw bf16 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ bw_ bf16_ run - FractionalMaxPool3d BW, bf16.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ bw_ f16_ can_ implement baracuda_kernels_fractional_max_pool_3d_bw_f16_can_implement(baracuda kernels fractional max pool 3d bw f16 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ bw_ f16_ run - FractionalMaxPool3d BW, f16.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ bw_ f32_ can_ implement baracuda_kernels_fractional_max_pool_3d_bw_f32_can_implement(baracuda kernels fractional max pool 3d bw f32 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ bw_ f32_ run - FractionalMaxPool3d BW, f32.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ bw_ f64_ can_ implement baracuda_kernels_fractional_max_pool_3d_bw_f64_can_implement(baracuda kernels fractional max pool 3d bw f64 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ bw_ f64_ run - FractionalMaxPool3d BW, f64.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ fw_ bf16_ can_ implement baracuda_kernels_fractional_max_pool_3d_fw_bf16_can_implement(baracuda kernels fractional max pool 3d fw bf16 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ fw_ bf16_ run - FractionalMaxPool3d FW, bf16.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ fw_ f16_ can_ implement baracuda_kernels_fractional_max_pool_3d_fw_f16_can_implement(baracuda kernels fractional max pool 3d fw f16 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ fw_ f16_ run - FractionalMaxPool3d FW, f16.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ fw_ f32_ can_ implement baracuda_kernels_fractional_max_pool_3d_fw_f32_can_implement(baracuda kernels fractional max pool 3d fw f32 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ fw_ f32_ run - FractionalMaxPool3d FW, f32.
- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ fw_ f64_ can_ implement baracuda_kernels_fractional_max_pool_3d_fw_f64_can_implement(baracuda kernels fractional max pool 3d fw f64 can implement).- baracuda_
kernels_ ⚠fractional_ max_ pool_ 3d_ fw_ f64_ run - FractionalMaxPool3d FW, f64.
- baracuda_
kernels_ ⚠gated_ geglu_ backward_ bf16_ can_ implement baracuda_kernels_gated_geglu_backward_bf16_can_implement(baracuda kernels gated geglu backward bf16 can implement).- baracuda_
kernels_ ⚠gated_ geglu_ backward_ bf16_ run - GeGLU backward, bf16.
- baracuda_
kernels_ ⚠gated_ geglu_ backward_ f16_ can_ implement baracuda_kernels_gated_geglu_backward_f16_can_implement(baracuda kernels gated geglu backward f16 can implement).- baracuda_
kernels_ ⚠gated_ geglu_ backward_ f16_ run - GeGLU backward, f16.
- baracuda_
kernels_ ⚠gated_ geglu_ backward_ f32_ can_ implement baracuda_kernels_gated_geglu_backward_f32_can_implement(baracuda kernels gated geglu backward f32 can implement).- baracuda_
kernels_ ⚠gated_ geglu_ backward_ f32_ run - GeGLU backward, f32.
da = dy·gelu(b),db = dy·a·gelu'(b). - baracuda_
kernels_ ⚠gated_ geglu_ backward_ f64_ can_ implement baracuda_kernels_gated_geglu_backward_f64_can_implement(baracuda kernels gated geglu backward f64 can implement).- baracuda_
kernels_ ⚠gated_ geglu_ backward_ f64_ run - GeGLU backward, f64.
- baracuda_
kernels_ ⚠gated_ geglu_ bf16_ can_ implement baracuda_kernels_gated_geglu_bf16_can_implement(baracuda kernels gated geglu bf16 can implement).- baracuda_
kernels_ ⚠gated_ geglu_ bf16_ run - GeGLU forward, bf16.
- baracuda_
kernels_ ⚠gated_ geglu_ f16_ can_ implement baracuda_kernels_gated_geglu_f16_can_implement(baracuda kernels gated geglu f16 can implement).- baracuda_
kernels_ ⚠gated_ geglu_ f16_ run - GeGLU forward, f16.
- baracuda_
kernels_ ⚠gated_ geglu_ f32_ can_ implement baracuda_kernels_gated_geglu_f32_can_implement(baracuda kernels gated geglu f32 can implement).- baracuda_
kernels_ ⚠gated_ geglu_ f32_ run - GeGLU forward, f32.
y = a · gelu(b), exact erf-based. - baracuda_
kernels_ ⚠gated_ geglu_ f64_ can_ implement baracuda_kernels_gated_geglu_f64_can_implement(baracuda kernels gated geglu f64 can implement).- baracuda_
kernels_ ⚠gated_ geglu_ f64_ run - GeGLU forward, f64.
- baracuda_
kernels_ ⚠gated_ glu_ backward_ bf16_ can_ implement baracuda_kernels_gated_glu_backward_bf16_can_implement(baracuda kernels gated glu backward bf16 can implement).- baracuda_
kernels_ ⚠gated_ glu_ backward_ bf16_ run - GLU backward, bf16.
- baracuda_
kernels_ ⚠gated_ glu_ backward_ f16_ can_ implement baracuda_kernels_gated_glu_backward_f16_can_implement(baracuda kernels gated glu backward f16 can implement).- baracuda_
kernels_ ⚠gated_ glu_ backward_ f16_ run - GLU backward, f16.
- baracuda_
kernels_ ⚠gated_ glu_ backward_ f32_ can_ implement baracuda_kernels_gated_glu_backward_f32_can_implement(baracuda kernels gated glu backward f32 can implement).- baracuda_
kernels_ ⚠gated_ glu_ backward_ f32_ run - GLU backward, f32.
da = dy·sigmoid(b),db = dy·a·sigmoid(b)·(1-sigmoid(b)). - baracuda_
kernels_ ⚠gated_ glu_ backward_ f64_ can_ implement baracuda_kernels_gated_glu_backward_f64_can_implement(baracuda kernels gated glu backward f64 can implement).- baracuda_
kernels_ ⚠gated_ glu_ backward_ f64_ run - GLU backward, f64.
- baracuda_
kernels_ ⚠gated_ glu_ bf16_ can_ implement baracuda_kernels_gated_glu_bf16_can_implement(baracuda kernels gated glu bf16 can implement).- baracuda_
kernels_ ⚠gated_ glu_ bf16_ run - GLU forward, bf16.
- baracuda_
kernels_ ⚠gated_ glu_ f16_ can_ implement baracuda_kernels_gated_glu_f16_can_implement(baracuda kernels gated glu f16 can implement).- baracuda_
kernels_ ⚠gated_ glu_ f16_ run - GLU forward, f16.
- baracuda_
kernels_ ⚠gated_ glu_ f32_ can_ implement baracuda_kernels_gated_glu_f32_can_implement(baracuda kernels gated glu f32 can implement).- baracuda_
kernels_ ⚠gated_ glu_ f32_ run - GLU forward, f32.
y = a · sigmoid(b). - baracuda_
kernels_ ⚠gated_ glu_ f64_ can_ implement baracuda_kernels_gated_glu_f64_can_implement(baracuda kernels gated glu f64 can implement).- baracuda_
kernels_ ⚠gated_ glu_ f64_ run - GLU forward, f64.
- baracuda_
kernels_ ⚠gated_ reglu_ backward_ bf16_ can_ implement baracuda_kernels_gated_reglu_backward_bf16_can_implement(baracuda kernels gated reglu backward bf16 can implement).- baracuda_
kernels_ ⚠gated_ reglu_ backward_ bf16_ run - ReGLU backward, bf16.
- baracuda_
kernels_ ⚠gated_ reglu_ backward_ f16_ can_ implement baracuda_kernels_gated_reglu_backward_f16_can_implement(baracuda kernels gated reglu backward f16 can implement).- baracuda_
kernels_ ⚠gated_ reglu_ backward_ f16_ run - ReGLU backward, f16.
- baracuda_
kernels_ ⚠gated_ reglu_ backward_ f32_ can_ implement baracuda_kernels_gated_reglu_backward_f32_can_implement(baracuda kernels gated reglu backward f32 can implement).- baracuda_
kernels_ ⚠gated_ reglu_ backward_ f32_ run - ReGLU backward, f32.
da = (b>0)?dy·b:0,db = (b>0)?dy·a:0. - baracuda_
kernels_ ⚠gated_ reglu_ backward_ f64_ can_ implement baracuda_kernels_gated_reglu_backward_f64_can_implement(baracuda kernels gated reglu backward f64 can implement).- baracuda_
kernels_ ⚠gated_ reglu_ backward_ f64_ run - ReGLU backward, f64.
- baracuda_
kernels_ ⚠gated_ reglu_ bf16_ can_ implement baracuda_kernels_gated_reglu_bf16_can_implement(baracuda kernels gated reglu bf16 can implement).- baracuda_
kernels_ ⚠gated_ reglu_ bf16_ run - ReGLU forward, bf16.
- baracuda_
kernels_ ⚠gated_ reglu_ f16_ can_ implement baracuda_kernels_gated_reglu_f16_can_implement(baracuda kernels gated reglu f16 can implement).- baracuda_
kernels_ ⚠gated_ reglu_ f16_ run - ReGLU forward, f16.
- baracuda_
kernels_ ⚠gated_ reglu_ f32_ can_ implement baracuda_kernels_gated_reglu_f32_can_implement(baracuda kernels gated reglu f32 can implement).- baracuda_
kernels_ ⚠gated_ reglu_ f32_ run - ReGLU forward, f32.
y = a · relu(b) = a · max(b, 0). - baracuda_
kernels_ ⚠gated_ reglu_ f64_ can_ implement baracuda_kernels_gated_reglu_f64_can_implement(baracuda kernels gated reglu f64 can implement).- baracuda_
kernels_ ⚠gated_ reglu_ f64_ run - ReGLU forward, f64.
- baracuda_
kernels_ ⚠gated_ swiglu_ backward_ bf16_ can_ implement baracuda_kernels_gated_swiglu_backward_bf16_can_implement(baracuda kernels gated swiglu backward bf16 can implement).- baracuda_
kernels_ ⚠gated_ swiglu_ backward_ bf16_ run - SwiGLU backward, bf16.
- baracuda_
kernels_ ⚠gated_ swiglu_ backward_ f16_ can_ implement baracuda_kernels_gated_swiglu_backward_f16_can_implement(baracuda kernels gated swiglu backward f16 can implement).- baracuda_
kernels_ ⚠gated_ swiglu_ backward_ f16_ run - SwiGLU backward, f16.
- baracuda_
kernels_ ⚠gated_ swiglu_ backward_ f32_ can_ implement baracuda_kernels_gated_swiglu_backward_f32_can_implement(baracuda kernels gated swiglu backward f32 can implement).- baracuda_
kernels_ ⚠gated_ swiglu_ backward_ f32_ run - SwiGLU backward, f32.
da = dy·silu(b),db = dy·a·silu'(b). - baracuda_
kernels_ ⚠gated_ swiglu_ backward_ f64_ can_ implement baracuda_kernels_gated_swiglu_backward_f64_can_implement(baracuda kernels gated swiglu backward f64 can implement).- baracuda_
kernels_ ⚠gated_ swiglu_ backward_ f64_ run - SwiGLU backward, f64.
- baracuda_
kernels_ ⚠gated_ swiglu_ bf16_ can_ implement baracuda_kernels_gated_swiglu_bf16_can_implement(baracuda kernels gated swiglu bf16 can implement).- baracuda_
kernels_ ⚠gated_ swiglu_ bf16_ run - SwiGLU forward, bf16.
- baracuda_
kernels_ ⚠gated_ swiglu_ f16_ can_ implement baracuda_kernels_gated_swiglu_f16_can_implement(baracuda kernels gated swiglu f16 can implement).- baracuda_
kernels_ ⚠gated_ swiglu_ f16_ run - SwiGLU forward, f16.
- baracuda_
kernels_ ⚠gated_ swiglu_ f32_ can_ implement baracuda_kernels_gated_swiglu_f32_can_implement(baracuda kernels gated swiglu f32 can implement).- baracuda_
kernels_ ⚠gated_ swiglu_ f32_ run - SwiGLU forward, f32.
y = a · b · sigmoid(b). - baracuda_
kernels_ ⚠gated_ swiglu_ f64_ can_ implement baracuda_kernels_gated_swiglu_f64_can_implement(baracuda kernels gated swiglu f64 can implement).- baracuda_
kernels_ ⚠gated_ swiglu_ f64_ run - SwiGLU forward, f64.
- baracuda_
kernels_ ⚠gather_ backward_ f32_ can_ implement - Implementability check for
gather_backward_f32. - baracuda_
kernels_ ⚠gather_ backward_ f32_ run dsrc[..., index[..., j, ...], ...] += dout[..., j, ...]alonggather_dim. f32 (atomicAdd).- baracuda_
kernels_ ⚠gather_ backward_ f64_ can_ implement - Implementability check for
gather_backward_f64. - baracuda_
kernels_ ⚠gather_ backward_ f64_ run gather_backward— f64 (atomicAdd).- baracuda_
kernels_ ⚠gather_ backward_ i64idx_ f32_ can_ implement - Implementability check for
gather_backward_i64idx_f32. - baracuda_
kernels_ ⚠gather_ backward_ i64idx_ f32_ run gatherBW — f32, i64 indices (atomicAdd).- baracuda_
kernels_ ⚠gather_ backward_ i64idx_ f64_ can_ implement - Implementability check for
gather_backward_i64idx_f64. - baracuda_
kernels_ ⚠gather_ backward_ i64idx_ f64_ run gatherBW — f64, i64 indices (atomicAdd).- baracuda_
kernels_ ⚠gather_ f32_ can_ implement - Implementability check for
gather_f32. - baracuda_
kernels_ ⚠gather_ f32_ run out[..., j, ...] = src[..., index[..., j, ...], ...]alonggather_dim. f32.- baracuda_
kernels_ ⚠gather_ f64_ can_ implement - Implementability check for
gather_f64. - baracuda_
kernels_ ⚠gather_ f64_ run gatheralonggather_dim. f64.- baracuda_
kernels_ ⚠gather_ i8_ can_ implement baracuda_kernels_gather_i8_can_implement(baracuda kernels gather i8 can implement).- baracuda_
kernels_ ⚠gather_ i8_ run baracuda_kernels_gather_i8_run(baracuda kernels gather i8 run).- baracuda_
kernels_ ⚠gather_ i16_ can_ implement baracuda_kernels_gather_i16_can_implement(baracuda kernels gather i16 can implement).- baracuda_
kernels_ ⚠gather_ i16_ run baracuda_kernels_gather_i16_run(baracuda kernels gather i16 run).- baracuda_
kernels_ ⚠gather_ i32_ can_ implement - Implementability check for
gather_i32. - baracuda_
kernels_ ⚠gather_ i32_ run gatheralonggather_dim. i32.- baracuda_
kernels_ ⚠gather_ i64_ can_ implement baracuda_kernels_gather_i64_can_implement(baracuda kernels gather i64 can implement).- baracuda_
kernels_ ⚠gather_ i64_ run baracuda_kernels_gather_i64_run(baracuda kernels gather i64 run).- baracuda_
kernels_ ⚠gather_ i64idx_ f32_ can_ implement - Implementability check for
gather_i64idx_f32. - baracuda_
kernels_ ⚠gather_ i64idx_ f32_ run gatherFW — f32, i64 indices.- baracuda_
kernels_ ⚠gather_ i64idx_ f64_ can_ implement - Implementability check for
gather_i64idx_f64. - baracuda_
kernels_ ⚠gather_ i64idx_ f64_ run gatherFW — f64, i64 indices.- baracuda_
kernels_ ⚠gather_ i64idx_ i8_ can_ implement baracuda_kernels_gather_i64idx_i8_can_implement(baracuda kernels gather i64idx i8 can implement).- baracuda_
kernels_ ⚠gather_ i64idx_ i8_ run baracuda_kernels_gather_i64idx_i8_run(baracuda kernels gather i64idx i8 run).- baracuda_
kernels_ ⚠gather_ i64idx_ i16_ can_ implement baracuda_kernels_gather_i64idx_i16_can_implement(baracuda kernels gather i64idx i16 can implement).- baracuda_
kernels_ ⚠gather_ i64idx_ i16_ run baracuda_kernels_gather_i64idx_i16_run(baracuda kernels gather i64idx i16 run).- baracuda_
kernels_ ⚠gather_ i64idx_ i32_ can_ implement - Implementability check for
gather_i64idx_i32. - baracuda_
kernels_ ⚠gather_ i64idx_ i32_ run gatherFW — i32 values, i64 indices.- baracuda_
kernels_ ⚠gather_ i64idx_ i64_ can_ implement baracuda_kernels_gather_i64idx_i64_can_implement(baracuda kernels gather i64idx i64 can implement).- baracuda_
kernels_ ⚠gather_ i64idx_ i64_ run baracuda_kernels_gather_i64idx_i64_run(baracuda kernels gather i64idx i64 run).- baracuda_
kernels_ ⚠gather_ i64idx_ u8_ can_ implement baracuda_kernels_gather_i64idx_u8_can_implement(baracuda kernels gather i64idx u8 can implement).- baracuda_
kernels_ ⚠gather_ i64idx_ u8_ run baracuda_kernels_gather_i64idx_u8_run(baracuda kernels gather i64idx u8 run).- baracuda_
kernels_ ⚠gather_ i64idx_ u16_ can_ implement baracuda_kernels_gather_i64idx_u16_can_implement(baracuda kernels gather i64idx u16 can implement).- baracuda_
kernels_ ⚠gather_ i64idx_ u16_ run baracuda_kernels_gather_i64idx_u16_run(baracuda kernels gather i64idx u16 run).- baracuda_
kernels_ ⚠gather_ i64idx_ u32_ can_ implement baracuda_kernels_gather_i64idx_u32_can_implement(baracuda kernels gather i64idx u32 can implement).- baracuda_
kernels_ ⚠gather_ i64idx_ u32_ run baracuda_kernels_gather_i64idx_u32_run(baracuda kernels gather i64idx u32 run).- baracuda_
kernels_ ⚠gather_ u8_ can_ implement baracuda_kernels_gather_u8_can_implement(baracuda kernels gather u8 can implement).- baracuda_
kernels_ ⚠gather_ u8_ run baracuda_kernels_gather_u8_run(baracuda kernels gather u8 run).- baracuda_
kernels_ ⚠gather_ u8idx_ f32_ can_ implement - Implementability check for
gather_u8idx_f32. - baracuda_
kernels_ ⚠gather_ u8idx_ f32_ run gatherFW — f32, u8 idx.- baracuda_
kernels_ ⚠gather_ u8idx_ f64_ can_ implement - Implementability check for
gather_u8idx_f64. - baracuda_
kernels_ ⚠gather_ u8idx_ f64_ run gatherFW — f64, u8 idx.- baracuda_
kernels_ ⚠gather_ u16_ can_ implement baracuda_kernels_gather_u16_can_implement(baracuda kernels gather u16 can implement).- baracuda_
kernels_ ⚠gather_ u16_ run baracuda_kernels_gather_u16_run(baracuda kernels gather u16 run).- baracuda_
kernels_ ⚠gather_ u32_ can_ implement baracuda_kernels_gather_u32_can_implement(baracuda kernels gather u32 can implement).- baracuda_
kernels_ ⚠gather_ u32_ run baracuda_kernels_gather_u32_run(baracuda kernels gather u32 run).- baracuda_
kernels_ ⚠gemm_ batched_ bf16_ rcr_ sm80_ can_ implement - Pre-launch implementability check (batched).
- baracuda_
kernels_ ⚠gemm_ batched_ bf16_ rcr_ sm80_ run - Launch strided-batched Cutlass GEMM. Batch
ioperates onA + i * stride_a,B + i * stride_b, etc. (strides in elements, not bytes). - baracuda_
kernels_ ⚠gemm_ batched_ bf16_ rcr_ sm80_ workspace_ size - Workspace bytes required for a
batch_count-deep batched launch. - baracuda_
kernels_ ⚠gemm_ batched_ f16_ rcr_ sm80_ can_ implement - Pre-launch implementability check (batched).
- baracuda_
kernels_ ⚠gemm_ batched_ f16_ rcr_ sm80_ run - Launch strided-batched Cutlass GEMM. Batch
ioperates onA + i * stride_a,B + i * stride_b, etc. (strides in elements, not bytes). - baracuda_
kernels_ ⚠gemm_ batched_ f16_ rcr_ sm80_ workspace_ size - Workspace bytes required for a
batch_count-deep batched launch. - baracuda_
kernels_ ⚠gemm_ bf16_ rcr_ sm80_ can_ implement - Pre-launch implementability check for this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ bf16_ rcr_ sm80_ run - Launch this Cutlass GEMM SKU on
stream. - baracuda_
kernels_ ⚠gemm_ bf16_ rcr_ sm80_ workspace_ size - Workspace bytes required by this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ bf16_ rrr_ sm80_ can_ implement - Pre-launch implementability check for this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ bf16_ rrr_ sm80_ run - Launch this Cutlass GEMM SKU on
stream. - baracuda_
kernels_ ⚠gemm_ bf16_ rrr_ sm80_ workspace_ size - Workspace bytes required by this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ bias_ bf16_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ bf16_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ bf16_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ bf16_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ bf16_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ bf16_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ f16_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ f16_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ f16_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ f16_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ f16_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ f16_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ f32_ simt_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ f32_ simt_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ f32_ simt_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ f32_ simt_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ f32_ simt_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ f32_ simt_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ f32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ f32bias_ s8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ f32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ f32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ f32bias_ u8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ f32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ f64_ rcr_ sm80_ can_ implement - Pre-launch implementability check (DGEMM bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ f64_ rcr_ sm80_ run - Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
- baracuda_
kernels_ ⚠gemm_ bias_ f64_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ f64_ rrr_ sm80_ can_ implement - Pre-launch implementability check (DGEMM bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ f64_ rrr_ sm80_ run - Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
- baracuda_
kernels_ ⚠gemm_ bias_ f64_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ bf16_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ bf16_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ gelu_ bf16_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ bf16_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ bf16_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ gelu_ bf16_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f16_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f16_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f16_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f16_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f16_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f16_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f32_ simt_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f32_ simt_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f32_ simt_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f32_ simt_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f32_ simt_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f32_ simt_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f32bias_ s8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f32bias_ u8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f64_ rcr_ sm80_ can_ implement - Pre-launch implementability check (DGEMM bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f64_ rcr_ sm80_ run - Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f64_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f64_ rrr_ sm80_ can_ implement - Pre-launch implementability check (DGEMM bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f64_ rrr_ sm80_ run - Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ f64_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ i32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ i32bias_ s8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ gelu_ i32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ i32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ i32bias_ u8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ gelu_ i32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ tf32_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ tf32_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ gelu_ tf32_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ tf32_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ gelu_ tf32_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ gelu_ tf32_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ i32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ i32bias_ s8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ i32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ i32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ i32bias_ u8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ i32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ bf16_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ bf16_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ relu_ bf16_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ bf16_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ bf16_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ relu_ bf16_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f16_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f16_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ relu_ f16_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f16_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f16_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ relu_ f16_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f32_ simt_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f32_ simt_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ relu_ f32_ simt_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f32_ simt_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f32_ simt_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ relu_ f32_ simt_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f32bias_ s8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ relu_ f32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f32bias_ u8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ relu_ f32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f64_ rcr_ sm80_ can_ implement - Pre-launch implementability check (DGEMM bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f64_ rcr_ sm80_ run - Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f64_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f64_ rrr_ sm80_ can_ implement - Pre-launch implementability check (DGEMM bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f64_ rrr_ sm80_ run - Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ f64_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ i32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ i32bias_ s8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ relu_ i32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ i32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ i32bias_ u8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ relu_ i32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ tf32_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ tf32_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ relu_ tf32_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ tf32_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ relu_ tf32_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ relu_ tf32_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ bf16_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ bf16_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ silu_ bf16_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ bf16_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ bf16_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ silu_ bf16_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f16_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f16_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ silu_ f16_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f16_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f16_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ silu_ f16_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f32_ simt_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f32_ simt_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ silu_ f32_ simt_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f32_ simt_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f32_ simt_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ silu_ f32_ simt_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f32bias_ s8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ silu_ f32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f32bias_ u8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ silu_ f32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f64_ rcr_ sm80_ can_ implement - Pre-launch implementability check (DGEMM bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f64_ rcr_ sm80_ run - Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f64_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f64_ rrr_ sm80_ can_ implement - Pre-launch implementability check (DGEMM bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f64_ rrr_ sm80_ run - Launch bias-fused DGEMM. f64 alpha/beta + f64 bias vector.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ f64_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ i32bias_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ i32bias_ s8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ silu_ i32bias_ s8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ i32bias_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ i32bias_ u8_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ silu_ i32bias_ u8_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ tf32_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ tf32_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ silu_ tf32_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ tf32_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ silu_ tf32_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ silu_ tf32_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ tf32_ rcr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ tf32_ rcr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ tf32_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ bias_ tf32_ rrr_ sm80_ can_ implement - Pre-launch implementability check (bias variant).
- baracuda_
kernels_ ⚠gemm_ bias_ tf32_ rrr_ sm80_ run - Launch bias-fused Cutlass GEMM on
stream.biasis an[N]device vector broadcast across rows of D. - baracuda_
kernels_ ⚠gemm_ bias_ tf32_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ gemm_ dense_ bf16_ can_ implement - Host-side validity check for
baracuda_kernels_gemm_dense_bf16_run. Validates extents, thelayouttag, leading-dim minimums, i32-fit of leading dims, andstride_d != 0atbatch > 1.stride_a/stride_bare accepted unconditionally (any value, including 0-broadcast). - baracuda_
kernels_ ⚠gemm_ dense_ bf16_ run - Dense bf16 GEMM (cuBLAS-backed):
D[g] = α · A[g] · B[g] + β · D[g]forg ∈ [0, batch), accumulating in f32. Row-major problem; see the module docs for thelayouttag (0 = RRR, 1 = RCR, 2 = CRR), leading-dim minimums, and the batch-stride contract (element strides;stride_a/stride_bmay be 0 to broadcast; strides ignored atbatch == 1). - baracuda_
kernels_ gemm_ dense_ bf16_ workspace_ size - Workspace query for
baracuda_kernels_gemm_dense_bf16_run. Always0— cuBLAS allocates its workspace internally per handle. - baracuda_
kernels_ gemm_ dense_ f16_ can_ implement - Host-side validity check for
baracuda_kernels_gemm_dense_f16_run. Validates extents, thelayouttag, leading-dim minimums, i32-fit of leading dims, andstride_d != 0atbatch > 1.stride_a/stride_bare accepted unconditionally (any value, including 0-broadcast). - baracuda_
kernels_ ⚠gemm_ dense_ f16_ run - Dense f16 GEMM (cuBLAS-backed):
D[g] = α · A[g] · B[g] + β · D[g]forg ∈ [0, batch), accumulating in f32. Row-major problem; see the module docs for thelayouttag (0 = RRR, 1 = RCR, 2 = CRR), leading-dim minimums, and the batch-stride contract (element strides;stride_a/stride_bmay be 0 to broadcast; strides ignored atbatch == 1). - baracuda_
kernels_ gemm_ dense_ f16_ workspace_ size - Workspace query for
baracuda_kernels_gemm_dense_f16_run. Always0— cuBLAS allocates its workspace internally per handle. - baracuda_
kernels_ gemm_ dense_ f32_ can_ implement - Host-side validity check for
baracuda_kernels_gemm_dense_f32_run. Validates extents, thelayouttag, leading-dim minimums, i32-fit of leading dims, andstride_d != 0atbatch > 1.stride_a/stride_bare accepted unconditionally (any value, including 0-broadcast). - baracuda_
kernels_ ⚠gemm_ dense_ f32_ run - Dense f32 GEMM (cuBLAS-backed):
D[g] = α · A[g] · B[g] + β · D[g]forg ∈ [0, batch), accumulating in IEEE binary32 (default math mode — NOT TF32). Row-major problem; see the module docs for thelayouttag (0 = RRR, 1 = RCR, 2 = CRR), leading-dim minimums, and the batch-stride contract (element strides;stride_a/stride_bmay be 0 to broadcast; strides ignored atbatch == 1). - baracuda_
kernels_ gemm_ dense_ f32_ workspace_ size - Workspace query for
baracuda_kernels_gemm_dense_f32_run. Always0— cuBLAS allocates its workspace internally per handle. - baracuda_
kernels_ gemm_ dense_ f64_ can_ implement - Host-side validity check for
baracuda_kernels_gemm_dense_f64_run. Validates extents, thelayouttag, leading-dim minimums, i32-fit of leading dims, andstride_d != 0atbatch > 1.stride_a/stride_bare accepted unconditionally (any value, including 0-broadcast). - baracuda_
kernels_ ⚠gemm_ dense_ f64_ run - Dense f64 GEMM (cuBLAS-backed):
D[g] = α · A[g] · B[g] + β · D[g]forg ∈ [0, batch), accumulating in f64. Row-major problem; see the module docs for thelayouttag (0 = RRR, 1 = RCR, 2 = CRR), leading-dim minimums, and the batch-stride contract (element strides;stride_a/stride_bmay be 0 to broadcast; strides ignored atbatch == 1). - baracuda_
kernels_ gemm_ dense_ f64_ workspace_ size - Workspace query for
baracuda_kernels_gemm_dense_f64_run. Always0— cuBLAS allocates its workspace internally per handle. - baracuda_
kernels_ ⚠gemm_ f16_ rcr_ sm80_ can_ implement - Pre-launch implementability check for this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ f16_ rcr_ sm80_ run - Launch this Cutlass GEMM SKU on
stream. - baracuda_
kernels_ ⚠gemm_ f16_ rcr_ sm80_ workspace_ size - Workspace bytes required by this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ f16_ rrr_ sm80_ can_ implement - Pre-launch implementability check for this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ f16_ rrr_ sm80_ run - Launch this Cutlass GEMM SKU on
stream. - baracuda_
kernels_ ⚠gemm_ f16_ rrr_ sm80_ workspace_ size - Workspace bytes required by this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ f32_ simt_ rcr_ sm80_ can_ implement - Pre-launch implementability check for this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ f32_ simt_ rcr_ sm80_ run - Launch this Cutlass GEMM SKU on
stream. - baracuda_
kernels_ ⚠gemm_ f32_ simt_ rcr_ sm80_ workspace_ size - Workspace bytes required by this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ f32_ simt_ rrr_ sm80_ can_ implement - Pre-launch implementability check for this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ f32_ simt_ rrr_ sm80_ run - Launch this Cutlass GEMM SKU on
stream. - baracuda_
kernels_ ⚠gemm_ f32_ simt_ rrr_ sm80_ workspace_ size - Workspace bytes required by this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ f64_ rcr_ sm80_ can_ implement - Pre-launch implementability check.
- baracuda_
kernels_ ⚠gemm_ f64_ rcr_ sm80_ run - Launch DGEMM. f64 alpha/beta.
- baracuda_
kernels_ ⚠gemm_ f64_ rcr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ f64_ rrr_ sm80_ can_ implement - Pre-launch implementability check.
- baracuda_
kernels_ ⚠gemm_ f64_ rrr_ sm80_ run - Launch DGEMM. f64 alpha/beta.
- baracuda_
kernels_ ⚠gemm_ f64_ rrr_ sm80_ workspace_ size - Workspace bytes required.
- baracuda_
kernels_ ⚠gemm_ s8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ s8_ rcr_ sm80_ run - Launch this Cutlass GEMM SKU on
stream. - baracuda_
kernels_ ⚠gemm_ s8_ rcr_ sm80_ workspace_ size - Workspace bytes required by this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ f32_ can_ implement baracuda_kernels_gemm_s8_rrr_sm80_bias_f32_can_implement(baracuda kernels gemm s8 rrr sm80 bias f32 can implement).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ f32_ run baracuda_kernels_gemm_s8_rrr_sm80_bias_f32_run(baracuda kernels gemm s8 rrr sm80 bias f32 run).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ gelu_ f32_ can_ implement baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_f32_can_implement(baracuda kernels gemm s8 rrr sm80 bias gelu f32 can implement).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ gelu_ f32_ run baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_f32_run(baracuda kernels gemm s8 rrr sm80 bias gelu f32 run).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ gelu_ i32_ can_ implement baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_i32_can_implement(baracuda kernels gemm s8 rrr sm80 bias gelu i32 can implement).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ gelu_ i32_ run baracuda_kernels_gemm_s8_rrr_sm80_bias_gelu_i32_run(baracuda kernels gemm s8 rrr sm80 bias gelu i32 run).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ i32_ can_ implement baracuda_kernels_gemm_s8_rrr_sm80_bias_i32_can_implement(baracuda kernels gemm s8 rrr sm80 bias i32 can implement).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ i32_ run baracuda_kernels_gemm_s8_rrr_sm80_bias_i32_run(baracuda kernels gemm s8 rrr sm80 bias i32 run).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ relu_ f32_ can_ implement baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_f32_can_implement(baracuda kernels gemm s8 rrr sm80 bias relu f32 can implement).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ relu_ f32_ run baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_f32_run(baracuda kernels gemm s8 rrr sm80 bias relu f32 run).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ relu_ i32_ can_ implement baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_i32_can_implement(baracuda kernels gemm s8 rrr sm80 bias relu i32 can implement).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ relu_ i32_ run baracuda_kernels_gemm_s8_rrr_sm80_bias_relu_i32_run(baracuda kernels gemm s8 rrr sm80 bias relu i32 run).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ silu_ f32_ can_ implement baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_f32_can_implement(baracuda kernels gemm s8 rrr sm80 bias silu f32 can implement).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ silu_ f32_ run baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_f32_run(baracuda kernels gemm s8 rrr sm80 bias silu f32 run).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ silu_ i32_ can_ implement baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_i32_can_implement(baracuda kernels gemm s8 rrr sm80 bias silu i32 can implement).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ bias_ silu_ i32_ run baracuda_kernels_gemm_s8_rrr_sm80_bias_silu_i32_run(baracuda kernels gemm s8 rrr sm80 bias silu i32 run).- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ can_ implement - Pre-launch implementability check for the
S8RRR sm_80 Identity SKU. - baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ run S8GEMM, RRR layout, Identity epilogue, sm_80.- baracuda_
kernels_ ⚠gemm_ s8_ rrr_ sm80_ workspace_ size - Workspace size in bytes for the
S8RRR sm_80 Identity SKU at the given problem size. Always returns zero today; reserved for future SKUs that need scratch. - baracuda_
kernels_ ⚠gemm_ tf32_ rcr_ sm80_ can_ implement - Pre-launch implementability check for this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ tf32_ rcr_ sm80_ run - Launch this Cutlass GEMM SKU on
stream. - baracuda_
kernels_ ⚠gemm_ tf32_ rcr_ sm80_ workspace_ size - Workspace bytes required by this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ tf32_ rrr_ sm80_ can_ implement - Pre-launch implementability check for this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ tf32_ rrr_ sm80_ run - Launch this Cutlass GEMM SKU on
stream. - baracuda_
kernels_ ⚠gemm_ tf32_ rrr_ sm80_ workspace_ size - Workspace bytes required by this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ u8_ rcr_ sm80_ can_ implement - Pre-launch implementability check for this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ u8_ rcr_ sm80_ run - Launch this Cutlass GEMM SKU on
stream. - baracuda_
kernels_ ⚠gemm_ u8_ rcr_ sm80_ workspace_ size - Workspace bytes required by this Cutlass GEMM SKU.
- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ f32_ can_ implement baracuda_kernels_gemm_u8_rrr_sm80_bias_f32_can_implement(baracuda kernels gemm u8 rrr sm80 bias f32 can implement).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ f32_ run baracuda_kernels_gemm_u8_rrr_sm80_bias_f32_run(baracuda kernels gemm u8 rrr sm80 bias f32 run).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ gelu_ f32_ can_ implement baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_f32_can_implement(baracuda kernels gemm u8 rrr sm80 bias gelu f32 can implement).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ gelu_ f32_ run baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_f32_run(baracuda kernels gemm u8 rrr sm80 bias gelu f32 run).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ gelu_ i32_ can_ implement baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_i32_can_implement(baracuda kernels gemm u8 rrr sm80 bias gelu i32 can implement).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ gelu_ i32_ run baracuda_kernels_gemm_u8_rrr_sm80_bias_gelu_i32_run(baracuda kernels gemm u8 rrr sm80 bias gelu i32 run).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ i32_ can_ implement baracuda_kernels_gemm_u8_rrr_sm80_bias_i32_can_implement(baracuda kernels gemm u8 rrr sm80 bias i32 can implement).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ i32_ run baracuda_kernels_gemm_u8_rrr_sm80_bias_i32_run(baracuda kernels gemm u8 rrr sm80 bias i32 run).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ relu_ f32_ can_ implement baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_f32_can_implement(baracuda kernels gemm u8 rrr sm80 bias relu f32 can implement).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ relu_ f32_ run baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_f32_run(baracuda kernels gemm u8 rrr sm80 bias relu f32 run).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ relu_ i32_ can_ implement baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_i32_can_implement(baracuda kernels gemm u8 rrr sm80 bias relu i32 can implement).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ relu_ i32_ run baracuda_kernels_gemm_u8_rrr_sm80_bias_relu_i32_run(baracuda kernels gemm u8 rrr sm80 bias relu i32 run).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ silu_ f32_ can_ implement baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_f32_can_implement(baracuda kernels gemm u8 rrr sm80 bias silu f32 can implement).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ silu_ f32_ run baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_f32_run(baracuda kernels gemm u8 rrr sm80 bias silu f32 run).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ silu_ i32_ can_ implement baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_i32_can_implement(baracuda kernels gemm u8 rrr sm80 bias silu i32 can implement).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ bias_ silu_ i32_ run baracuda_kernels_gemm_u8_rrr_sm80_bias_silu_i32_run(baracuda kernels gemm u8 rrr sm80 bias silu i32 run).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ can_ implement baracuda_kernels_gemm_u8_rrr_sm80_can_implement(baracuda kernels gemm u8 rrr sm80 can implement).- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ run U8GEMM, RRR layout, Identity epilogue, sm_80.- baracuda_
kernels_ ⚠gemm_ u8_ rrr_ sm80_ workspace_ size baracuda_kernels_gemm_u8_rrr_sm80_workspace_size(baracuda kernels gemm u8 rrr sm80 workspace size).- baracuda_
kernels_ ⚠grid_ sample_ 2d_ backward_ f32_ can_ implement baracuda_kernels_grid_sample_2d_backward_f32_can_implement(baracuda kernels grid sample 2d backward f32 can implement).- baracuda_
kernels_ ⚠grid_ sample_ 2d_ backward_ f32_ run grid_sample_2dBW, f32. Caller pre-zerosdinputanddgrid.dgrid:[N, OH, OW, 2]. # Safety: as FW.- baracuda_
kernels_ ⚠grid_ sample_ 2d_ backward_ f64_ can_ implement baracuda_kernels_grid_sample_2d_backward_f64_can_implement(baracuda kernels grid sample 2d backward f64 can implement).- baracuda_
kernels_ ⚠grid_ sample_ 2d_ backward_ f64_ run grid_sample_2dBW, f64. # Safety: as f32 BW.- baracuda_
kernels_ ⚠grid_ sample_ 2d_ f32_ can_ implement baracuda_kernels_grid_sample_2d_f32_can_implement(baracuda kernels grid sample 2d f32 can implement).- baracuda_
kernels_ ⚠grid_ sample_ 2d_ f32_ run grid_sample(input, grid)FW, f32.grid:[N, OH, OW, 2]with (x, y) normalized in [-1, 1]. # Safety: asinterpolate_*.- baracuda_
kernels_ ⚠grid_ sample_ 2d_ f64_ can_ implement baracuda_kernels_grid_sample_2d_f64_can_implement(baracuda kernels grid sample 2d f64 can implement).- baracuda_
kernels_ ⚠grid_ sample_ 2d_ f64_ run grid_sample_2dFW, f64. # Safety: as f32.- baracuda_
kernels_ ⚠group_ norm_ backward_ bf16_ can_ implement baracuda_kernels_group_norm_backward_bf16_can_implement(baracuda kernels group norm backward bf16 can implement).- baracuda_
kernels_ ⚠group_ norm_ backward_ bf16_ run - GroupNorm BW, bf16.
- baracuda_
kernels_ ⚠group_ norm_ backward_ f16_ can_ implement baracuda_kernels_group_norm_backward_f16_can_implement(baracuda kernels group norm backward f16 can implement).- baracuda_
kernels_ ⚠group_ norm_ backward_ f16_ run - GroupNorm BW, f16.
- baracuda_
kernels_ ⚠group_ norm_ backward_ f32_ can_ implement baracuda_kernels_group_norm_backward_f32_can_implement(baracuda kernels group norm backward f32 can implement).- baracuda_
kernels_ ⚠group_ norm_ backward_ f32_ run - GroupNorm BW, f32. Workspace size:
2 * (n_extent * num_groups) * sizeof(float)bytes for the stage-1 partial sums. - baracuda_
kernels_ ⚠group_ norm_ backward_ f64_ can_ implement baracuda_kernels_group_norm_backward_f64_can_implement(baracuda kernels group norm backward f64 can implement).- baracuda_
kernels_ ⚠group_ norm_ backward_ f64_ run - GroupNorm BW, f64.
- baracuda_
kernels_ ⚠group_ norm_ bf16_ can_ implement baracuda_kernels_group_norm_bf16_can_implement(baracuda kernels group norm bf16 can implement).- baracuda_
kernels_ ⚠group_ norm_ bf16_ run - GroupNorm FW, bf16.
- baracuda_
kernels_ ⚠group_ norm_ f16_ can_ implement baracuda_kernels_group_norm_f16_can_implement(baracuda kernels group norm f16 can implement).- baracuda_
kernels_ ⚠group_ norm_ f16_ run - GroupNorm FW, f16.
- baracuda_
kernels_ ⚠group_ norm_ f32_ can_ implement baracuda_kernels_group_norm_f32_can_implement(baracuda kernels group norm f32 can implement).- baracuda_
kernels_ ⚠group_ norm_ f32_ run - GroupNorm FW, f32. Per
(sample, group)mean / inv_std, per-channel affine.num_groupsmust dividec_extent.group_kind = 1selects the GN dispatch (also used by InstanceNorm withnum_groups == c_extent). - baracuda_
kernels_ ⚠group_ norm_ f64_ can_ implement baracuda_kernels_group_norm_f64_can_implement(baracuda kernels group norm f64 can implement).- baracuda_
kernels_ ⚠group_ norm_ f64_ run - GroupNorm FW, f64.
- baracuda_
kernels_ ⚠gumbel_ softmax_ bf16_ can_ implement baracuda_kernels_gumbel_softmax_bf16_can_implement(baracuda kernels gumbel softmax bf16 can implement).- baracuda_
kernels_ ⚠gumbel_ softmax_ bf16_ run - GumbelSoftmax FW, bf16.
- baracuda_
kernels_ ⚠gumbel_ softmax_ f16_ can_ implement baracuda_kernels_gumbel_softmax_f16_can_implement(baracuda kernels gumbel softmax f16 can implement).- baracuda_
kernels_ ⚠gumbel_ softmax_ f16_ run - GumbelSoftmax FW, f16.
- baracuda_
kernels_ ⚠gumbel_ softmax_ f32_ can_ implement baracuda_kernels_gumbel_softmax_f32_can_implement(baracuda kernels gumbel softmax f32 can implement).- baracuda_
kernels_ ⚠gumbel_ softmax_ f32_ run - GumbelSoftmax FW, f32.
y = softmax((x + g) / τ)whereg = -log(-log(u))anduis a caller-supplied cuRAND uniform buffer (one f32 per output cell, dense / contiguous layout).inv_tau = 1/τ.hard != 0→ one-hot at the noisy argmax. - baracuda_
kernels_ ⚠gumbel_ softmax_ f64_ can_ implement baracuda_kernels_gumbel_softmax_f64_can_implement(baracuda kernels gumbel softmax f64 can implement).- baracuda_
kernels_ ⚠gumbel_ softmax_ f64_ run - GumbelSoftmax FW, f64.
- baracuda_
kernels_ ⚠histogram_ f32_ can_ implement baracuda_kernels_histogram_f32_can_implement(baracuda kernels histogram f32 can implement).- baracuda_
kernels_ ⚠histogram_ f32_ run - 1-D histogram, f32 input.
lo/hipassed asdouble— kernel casts toT(keeps the FFI shape uniform across dtypes). - baracuda_
kernels_ ⚠histogram_ f64_ can_ implement baracuda_kernels_histogram_f64_can_implement(baracuda kernels histogram f64 can implement).- baracuda_
kernels_ ⚠histogram_ f64_ run - 1-D histogram, f64 input.
- baracuda_
kernels_ ⚠ifftshift_ 4_ can_ implement baracuda_kernels_ifftshift_4_can_implement(baracuda kernels ifftshift 4 can implement).- baracuda_
kernels_ ⚠ifftshift_ 4_ run - Inverse
fftshiftalong the last axis of a[batch, n]tensor:y[b, i] = x[b, (i + (n + 1) / 2) % n]. Differs fromfftshiftonly for oddn; for evennthe two are identical (each permutation is self-inverse). 4-byte cells. - baracuda_
kernels_ ⚠ifftshift_ 8_ can_ implement baracuda_kernels_ifftshift_8_can_implement(baracuda kernels ifftshift 8 can implement).- baracuda_
kernels_ ⚠ifftshift_ 8_ run - 8-byte-element inverse
fftshift. - baracuda_
kernels_ ⚠ifftshift_ 16_ can_ implement baracuda_kernels_ifftshift_16_can_implement(baracuda kernels ifftshift 16 can implement).- baracuda_
kernels_ ⚠ifftshift_ 16_ run - 16-byte-element inverse
fftshift. - baracuda_
kernels_ ⚠im2col_ 1d_ bf16_ can_ implement baracuda_kernels_im2col_1d_bf16_can_implement(baracuda kernels im2col 1d bf16 can implement).- baracuda_
kernels_ ⚠im2col_ 1d_ bf16_ run - im2col 1-D, bf16.
- baracuda_
kernels_ ⚠im2col_ 1d_ f16_ can_ implement baracuda_kernels_im2col_1d_f16_can_implement(baracuda kernels im2col 1d f16 can implement).- baracuda_
kernels_ ⚠im2col_ 1d_ f16_ run - im2col 1-D, f16.
- baracuda_
kernels_ ⚠im2col_ 1d_ f32_ can_ implement baracuda_kernels_im2col_1d_f32_can_implement(baracuda kernels im2col 1d f32 can implement).- baracuda_
kernels_ ⚠im2col_ 1d_ f32_ run - im2col 1-D, f32.
- baracuda_
kernels_ ⚠im2col_ 1d_ f64_ can_ implement baracuda_kernels_im2col_1d_f64_can_implement(baracuda kernels im2col 1d f64 can implement).- baracuda_
kernels_ ⚠im2col_ 1d_ f64_ run - im2col 1-D, f64.
- baracuda_
kernels_ ⚠im2col_ 2d_ bf16_ can_ implement baracuda_kernels_im2col_2d_bf16_can_implement(baracuda kernels im2col 2d bf16 can implement).- baracuda_
kernels_ ⚠im2col_ 2d_ bf16_ run - im2col 2-D, bf16.
- baracuda_
kernels_ ⚠im2col_ 2d_ f16_ can_ implement baracuda_kernels_im2col_2d_f16_can_implement(baracuda kernels im2col 2d f16 can implement).- baracuda_
kernels_ ⚠im2col_ 2d_ f16_ run - im2col 2-D, f16.
- baracuda_
kernels_ ⚠im2col_ 2d_ f32_ can_ implement baracuda_kernels_im2col_2d_f32_can_implement(baracuda kernels im2col 2d f32 can implement).- baracuda_
kernels_ ⚠im2col_ 2d_ f32_ run - im2col 2-D, f32.
- baracuda_
kernels_ ⚠im2col_ 2d_ f64_ can_ implement baracuda_kernels_im2col_2d_f64_can_implement(baracuda kernels im2col 2d f64 can implement).- baracuda_
kernels_ ⚠im2col_ 2d_ f64_ run - im2col 2-D, f64.
- baracuda_
kernels_ ⚠index_ add_ bf16_ can_ implement - Implementability check for
index_add_bf16. - baracuda_
kernels_ ⚠index_ add_ bf16_ run index_add— bf16, i32 idx.atomicCAS-via-baracuda::atomic::add<__nv_bfloat16>(same caveats as f16).- baracuda_
kernels_ ⚠index_ add_ f16_ can_ implement - Implementability check for
index_add_f16. - baracuda_
kernels_ ⚠index_ add_ f16_ run index_add— f16, i32 idx. UsesatomicCAS-via-baracuda::atomic::add<__half>(deterministic per-thread arithmetic regardless of CUDA toolkit; non-deterministic accumulation order).- baracuda_
kernels_ ⚠index_ add_ f32_ can_ implement - Implementability check for
index_add_f32. - baracuda_
kernels_ ⚠index_ add_ f32_ run index_add—dst[idx[i], ...] += src[i, ...], f32, i32 idx.- baracuda_
kernels_ ⚠index_ add_ f64_ can_ implement - Implementability check for
index_add_f64. - baracuda_
kernels_ ⚠index_ add_ f64_ run index_add— f64, i32 idx.- baracuda_
kernels_ ⚠index_ add_ i32_ can_ implement baracuda_kernels_index_add_i32_can_implement(baracuda kernels index add i32 can implement).- baracuda_
kernels_ ⚠index_ add_ i32_ run baracuda_kernels_index_add_i32_run(baracuda kernels index add i32 run).- baracuda_
kernels_ ⚠index_ add_ i64_ can_ implement baracuda_kernels_index_add_i64_can_implement(baracuda kernels index add i64 can implement).- baracuda_
kernels_ ⚠index_ add_ i64_ run baracuda_kernels_index_add_i64_run(baracuda kernels index add i64 run).- baracuda_
kernels_ ⚠index_ add_ i64idx_ bf16_ can_ implement - Implementability check for
index_add_i64idx_bf16. - baracuda_
kernels_ ⚠index_ add_ i64idx_ bf16_ run index_add— bf16, i64 idx.- baracuda_
kernels_ ⚠index_ add_ i64idx_ f16_ can_ implement - Implementability check for
index_add_i64idx_f16. - baracuda_
kernels_ ⚠index_ add_ i64idx_ f16_ run index_add— f16, i64 idx.- baracuda_
kernels_ ⚠index_ add_ i64idx_ f32_ can_ implement - Implementability check for
index_add_i64idx_f32. - baracuda_
kernels_ ⚠index_ add_ i64idx_ f32_ run index_add— f32, i64 idx.- baracuda_
kernels_ ⚠index_ add_ i64idx_ f64_ can_ implement - Implementability check for
index_add_i64idx_f64. - baracuda_
kernels_ ⚠index_ add_ i64idx_ f64_ run index_add— f64, i64 idx.- baracuda_
kernels_ ⚠index_ add_ i64idx_ i32_ can_ implement baracuda_kernels_index_add_i64idx_i32_can_implement(baracuda kernels index add i64idx i32 can implement).- baracuda_
kernels_ ⚠index_ add_ i64idx_ i32_ run baracuda_kernels_index_add_i64idx_i32_run(baracuda kernels index add i64idx i32 run).- baracuda_
kernels_ ⚠index_ add_ i64idx_ i64_ can_ implement baracuda_kernels_index_add_i64idx_i64_can_implement(baracuda kernels index add i64idx i64 can implement).- baracuda_
kernels_ ⚠index_ add_ i64idx_ i64_ run baracuda_kernels_index_add_i64idx_i64_run(baracuda kernels index add i64idx i64 run).- baracuda_
kernels_ ⚠index_ add_ i64idx_ u32_ can_ implement baracuda_kernels_index_add_i64idx_u32_can_implement(baracuda kernels index add i64idx u32 can implement).- baracuda_
kernels_ ⚠index_ add_ i64idx_ u32_ run baracuda_kernels_index_add_i64idx_u32_run(baracuda kernels index add i64idx u32 run).- baracuda_
kernels_ ⚠index_ add_ u32_ can_ implement baracuda_kernels_index_add_u32_can_implement(baracuda kernels index add u32 can implement).- baracuda_
kernels_ ⚠index_ add_ u32_ run baracuda_kernels_index_add_u32_run(baracuda kernels index add u32 run).- baracuda_
kernels_ ⚠index_ select_ backward_ f32_ can_ implement - Implementability check for
index_select_backward_f32. - baracuda_
kernels_ ⚠index_ select_ backward_ f32_ run dsrc[..., idx[j], ...] += dout[..., j, ...]alongselect_dim. f32 (atomicAdd).- baracuda_
kernels_ ⚠index_ select_ backward_ f64_ can_ implement - Implementability check for
index_select_backward_f64. - baracuda_
kernels_ ⚠index_ select_ backward_ f64_ run index_select_backward— f64 (atomicAdd).- baracuda_
kernels_ ⚠index_ select_ backward_ i64idx_ f32_ can_ implement - Implementability check for
index_select_backward_i64idx_f32. - baracuda_
kernels_ ⚠index_ select_ backward_ i64idx_ f32_ run index_selectBW — f32, i64 indices.- baracuda_
kernels_ ⚠index_ select_ backward_ i64idx_ f64_ can_ implement - Implementability check for
index_select_backward_i64idx_f64. - baracuda_
kernels_ ⚠index_ select_ backward_ i64idx_ f64_ run index_selectBW — f64, i64 indices.- baracuda_
kernels_ ⚠index_ select_ f32_ can_ implement - Implementability check for
index_select_f32. - baracuda_
kernels_ ⚠index_ select_ f32_ run out[..., j, ...] = src[..., idx[j], ...]alongselect_dim.idxis 1-D i32. f32.- baracuda_
kernels_ ⚠index_ select_ f64_ can_ implement - Implementability check for
index_select_f64. - baracuda_
kernels_ ⚠index_ select_ f64_ run index_select— f64.- baracuda_
kernels_ ⚠index_ select_ i8_ can_ implement baracuda_kernels_index_select_i8_can_implement(baracuda kernels index select i8 can implement).- baracuda_
kernels_ ⚠index_ select_ i8_ run baracuda_kernels_index_select_i8_run(baracuda kernels index select i8 run).- baracuda_
kernels_ ⚠index_ select_ i16_ can_ implement baracuda_kernels_index_select_i16_can_implement(baracuda kernels index select i16 can implement).- baracuda_
kernels_ ⚠index_ select_ i16_ run baracuda_kernels_index_select_i16_run(baracuda kernels index select i16 run).- baracuda_
kernels_ ⚠index_ select_ i32_ can_ implement - Implementability check for
index_select_i32. - baracuda_
kernels_ ⚠index_ select_ i32_ run index_select— i32.- baracuda_
kernels_ ⚠index_ select_ i64_ can_ implement baracuda_kernels_index_select_i64_can_implement(baracuda kernels index select i64 can implement).- baracuda_
kernels_ ⚠index_ select_ i64_ run baracuda_kernels_index_select_i64_run(baracuda kernels index select i64 run).- baracuda_
kernels_ ⚠index_ select_ i64idx_ f32_ can_ implement - Implementability check for
index_select_i64idx_f32. - baracuda_
kernels_ ⚠index_ select_ i64idx_ f32_ run index_select— f32, i64 indices.- baracuda_
kernels_ ⚠index_ select_ i64idx_ f64_ can_ implement - Implementability check for
index_select_i64idx_f64. - baracuda_
kernels_ ⚠index_ select_ i64idx_ f64_ run index_select— f64, i64 indices.- baracuda_
kernels_ ⚠index_ select_ i64idx_ i8_ can_ implement baracuda_kernels_index_select_i64idx_i8_can_implement(baracuda kernels index select i64idx i8 can implement).- baracuda_
kernels_ ⚠index_ select_ i64idx_ i8_ run baracuda_kernels_index_select_i64idx_i8_run(baracuda kernels index select i64idx i8 run).- baracuda_
kernels_ ⚠index_ select_ i64idx_ i16_ can_ implement baracuda_kernels_index_select_i64idx_i16_can_implement(baracuda kernels index select i64idx i16 can implement).- baracuda_
kernels_ ⚠index_ select_ i64idx_ i16_ run baracuda_kernels_index_select_i64idx_i16_run(baracuda kernels index select i64idx i16 run).- baracuda_
kernels_ ⚠index_ select_ i64idx_ i32_ can_ implement - Implementability check for
index_select_i64idx_i32. - baracuda_
kernels_ ⚠index_ select_ i64idx_ i32_ run index_select— i32 values, i64 indices.- baracuda_
kernels_ ⚠index_ select_ i64idx_ i64_ can_ implement baracuda_kernels_index_select_i64idx_i64_can_implement(baracuda kernels index select i64idx i64 can implement).- baracuda_
kernels_ ⚠index_ select_ i64idx_ i64_ run baracuda_kernels_index_select_i64idx_i64_run(baracuda kernels index select i64idx i64 run).- baracuda_
kernels_ ⚠index_ select_ i64idx_ u8_ can_ implement baracuda_kernels_index_select_i64idx_u8_can_implement(baracuda kernels index select i64idx u8 can implement).- baracuda_
kernels_ ⚠index_ select_ i64idx_ u8_ run baracuda_kernels_index_select_i64idx_u8_run(baracuda kernels index select i64idx u8 run).- baracuda_
kernels_ ⚠index_ select_ i64idx_ u16_ can_ implement baracuda_kernels_index_select_i64idx_u16_can_implement(baracuda kernels index select i64idx u16 can implement).- baracuda_
kernels_ ⚠index_ select_ i64idx_ u16_ run baracuda_kernels_index_select_i64idx_u16_run(baracuda kernels index select i64idx u16 run).- baracuda_
kernels_ ⚠index_ select_ i64idx_ u32_ can_ implement baracuda_kernels_index_select_i64idx_u32_can_implement(baracuda kernels index select i64idx u32 can implement).- baracuda_
kernels_ ⚠index_ select_ i64idx_ u32_ run baracuda_kernels_index_select_i64idx_u32_run(baracuda kernels index select i64idx u32 run).- baracuda_
kernels_ ⚠index_ select_ u8_ can_ implement baracuda_kernels_index_select_u8_can_implement(baracuda kernels index select u8 can implement).- baracuda_
kernels_ ⚠index_ select_ u8_ run baracuda_kernels_index_select_u8_run(baracuda kernels index select u8 run).- baracuda_
kernels_ ⚠index_ select_ u16_ can_ implement baracuda_kernels_index_select_u16_can_implement(baracuda kernels index select u16 can implement).- baracuda_
kernels_ ⚠index_ select_ u16_ run baracuda_kernels_index_select_u16_run(baracuda kernels index select u16 run).- baracuda_
kernels_ ⚠index_ select_ u32_ can_ implement baracuda_kernels_index_select_u32_can_implement(baracuda kernels index select u32 can implement).- baracuda_
kernels_ ⚠index_ select_ u32_ run baracuda_kernels_index_select_u32_run(baracuda kernels index select u32 run).- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ backward_ bf16_ can_ implement baracuda_kernels_interpolate_bilinear_2d_backward_bf16_can_implement(baracuda kernels interpolate bilinear 2d backward bf16 can implement).- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ backward_ bf16_ run interpolate_bilinear_2dBW, bf16. Caller pre-zerosdinput.atomicCAS-based bf16 atomic add. # Safety: as f32 BW.- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ backward_ f16_ can_ implement baracuda_kernels_interpolate_bilinear_2d_backward_f16_can_implement(baracuda kernels interpolate bilinear 2d backward f16 can implement).- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ backward_ f16_ run interpolate_bilinear_2dBW, f16. Caller pre-zerosdinput.atomicCAS-based half atomic add. # Safety: as f32 BW.- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ backward_ f32_ can_ implement baracuda_kernels_interpolate_bilinear_2d_backward_f32_can_implement(baracuda kernels interpolate bilinear 2d backward f32 can implement).- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ backward_ f32_ run interpolate_bilinear_2dBW, f32. Caller pre-zerosdinput.- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ backward_ f64_ can_ implement baracuda_kernels_interpolate_bilinear_2d_backward_f64_can_implement(baracuda kernels interpolate bilinear 2d backward f64 can implement).- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ backward_ f64_ run interpolate_bilinear_2dBW, f64. # Safety: as f32 BW.- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ bf16_ can_ implement baracuda_kernels_interpolate_bilinear_2d_bf16_can_implement(baracuda kernels interpolate bilinear 2d bf16 can implement).- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ bf16_ run interpolate_bilinear_2dFW, bf16. Cast-at-read / f32 accumulator / cast-at-write. # Safety: as f32.- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ f16_ can_ implement baracuda_kernels_interpolate_bilinear_2d_f16_can_implement(baracuda kernels interpolate bilinear 2d f16 can implement).- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ f16_ run interpolate_bilinear_2dFW, f16 (half). Cast-at-read / f32 accumulator / cast-at-write. # Safety: as f32.- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ f32_ can_ implement baracuda_kernels_interpolate_bilinear_2d_f32_can_implement(baracuda kernels interpolate bilinear 2d f32 can implement).- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ f32_ run interpolate(x, mode='bilinear')FW, f32.input:[N, C, IH, IW];output:[N, C, OH, OW]. NCHW.align_corners: 0 = false (PyTorch default), nonzero = true.scale_h_factor/scale_w_factor: 0.0 = derive from sizes; nonzero = use as SCALE override.- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ f64_ can_ implement baracuda_kernels_interpolate_bilinear_2d_f64_can_implement(baracuda kernels interpolate bilinear 2d f64 can implement).- baracuda_
kernels_ ⚠interpolate_ bilinear_ 2d_ f64_ run interpolate_bilinear_2dFW, f64. # Safety: as f32.- baracuda_
kernels_ ⚠inverse_ f32_ run - Matrix inverse via
getrf+getrsover caller-staged identity ininv_inout. The caller MUST pre-stage ann × nidentity ininv_inout(column-major) before invoking. After the call,inv_inoutholdsA^{-1}anda_inoutholds the packedLUfactors. - baracuda_
kernels_ ⚠inverse_ f32_ workspace_ size - Inverse workspace size (==
getrfworkspace). - baracuda_
kernels_ ⚠inverse_ f64_ run - Matrix inverse via
getrf+getrsover caller-staged identity ininv_inout. The caller MUST pre-stage ann × nidentity ininv_inout(column-major) before invoking. After the call,inv_inoutholdsA^{-1}anda_inoutholds the packedLUfactors. - baracuda_
kernels_ ⚠inverse_ f64_ workspace_ size - Inverse workspace size (==
getrfworkspace). - baracuda_
kernels_ ⚠irfft_ 1d_ f32_ run - 1-D C2R FFT (Hermitian-half complex → real). Applies
1/nnormalization in-place (PyTorchnorm="backward").nis the real-side output length; complex input shape is[batch, n/2 + 1]. - baracuda_
kernels_ ⚠irfft_ 1d_ f32_ workspace_ size - 1-D C2R FFT workspace size in bytes — always
0. - baracuda_
kernels_ ⚠irfft_ 1d_ f64_ run - 1-D C2R FFT (Hermitian-half complex → real). Applies
1/nnormalization in-place (PyTorchnorm="backward").nis the real-side output length; complex input shape is[batch, n/2 + 1]. - baracuda_
kernels_ ⚠irfft_ 1d_ f64_ workspace_ size - 1-D C2R FFT workspace size in bytes — always
0. - baracuda_
kernels_ ⚠irfft_ nd_ f32_ run - ND C2R FFT (Hermitian-half complex → real). Applies
1/product(dims[..rank])normalization in-place.dimscarries the real-side extents. - baracuda_
kernels_ ⚠irfft_ nd_ f32_ workspace_ size - ND C2R FFT workspace size in bytes — always
0. - baracuda_
kernels_ ⚠irfft_ nd_ f64_ run - ND C2R FFT (Hermitian-half complex → real). Applies
1/product(dims[..rank])normalization in-place.dimscarries the real-side extents. - baracuda_
kernels_ ⚠irfft_ nd_ f64_ workspace_ size - ND C2R FFT workspace size in bytes — always
0. - baracuda_
kernels_ ⚠kv_ cache_ append_ bf16_ can_ implement - Implementability check for
kv_cache_append_bf16. Host-side only. - baracuda_
kernels_ ⚠kv_ cache_ append_ bf16_ run - KV-cache append, bf16.
- baracuda_
kernels_ ⚠kv_ cache_ append_ f16_ can_ implement - Implementability check for
kv_cache_append_f16. Host-side only. - baracuda_
kernels_ ⚠kv_ cache_ append_ f16_ run - KV-cache append, f16.
- baracuda_
kernels_ ⚠kv_ cache_ append_ f32_ can_ implement - Implementability check for
kv_cache_append_f32. Host-side only. - baracuda_
kernels_ ⚠kv_ cache_ append_ f32_ run - KV-cache append, f32.
- baracuda_
kernels_ ⚠kv_ cache_ append_ f64_ can_ implement - Implementability check for
kv_cache_append_f64. Host-side only. - baracuda_
kernels_ ⚠kv_ cache_ append_ f64_ run - KV-cache append, f64.
- baracuda_
kernels_ ⚠layer_ norm_ backward_ bf16_ can_ implement baracuda_kernels_layer_norm_backward_bf16_can_implement(baracuda kernels layer norm backward bf16 can implement).- baracuda_
kernels_ ⚠layer_ norm_ backward_ bf16_ run - LayerNorm BW, bf16.
- baracuda_
kernels_ ⚠layer_ norm_ backward_ bf16_ strided_ can_ implement layer_norm_backward_bf16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠layer_ norm_ backward_ bf16_ strided_ run - LayerNorm BW strided sibling, bf16.
- baracuda_
kernels_ ⚠layer_ norm_ backward_ f16_ can_ implement baracuda_kernels_layer_norm_backward_f16_can_implement(baracuda kernels layer norm backward f16 can implement).- baracuda_
kernels_ ⚠layer_ norm_ backward_ f16_ run - LayerNorm BW, f16.
- baracuda_
kernels_ ⚠layer_ norm_ backward_ f16_ strided_ can_ implement layer_norm_backward_f16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠layer_ norm_ backward_ f16_ strided_ run - LayerNorm BW strided sibling, f16.
- baracuda_
kernels_ ⚠layer_ norm_ backward_ f32_ can_ implement baracuda_kernels_layer_norm_backward_f32_can_implement(baracuda kernels layer norm backward f32 can implement).- baracuda_
kernels_ ⚠layer_ norm_ backward_ f32_ run - LayerNorm BW, f32. Computes
dxand (when non-null)dgamma/dbetareductions. Caller passes savedmean+inv_stdfrom FW. - baracuda_
kernels_ ⚠layer_ norm_ backward_ f32_ strided_ can_ implement layer_norm_backward_f32_strided_can_implementcompanion.- baracuda_
kernels_ ⚠layer_ norm_ backward_ f32_ strided_ run - LayerNorm BW strided sibling, f32. Same contract as
baracuda_kernels_layer_norm_backward_f32_run; identical launcher. - baracuda_
kernels_ ⚠layer_ norm_ backward_ f64_ can_ implement baracuda_kernels_layer_norm_backward_f64_can_implement(baracuda kernels layer norm backward f64 can implement).- baracuda_
kernels_ ⚠layer_ norm_ backward_ f64_ run - LayerNorm BW, f64.
- baracuda_
kernels_ ⚠layer_ norm_ backward_ f64_ strided_ can_ implement layer_norm_backward_f64_strided_can_implementcompanion.- baracuda_
kernels_ ⚠layer_ norm_ backward_ f64_ strided_ run - LayerNorm BW strided sibling, f64.
- baracuda_
kernels_ ⚠layer_ norm_ bf16_ can_ implement baracuda_kernels_layer_norm_bf16_can_implement(baracuda kernels layer norm bf16 can implement).- baracuda_
kernels_ ⚠layer_ norm_ bf16_ run - LayerNorm FW, bf16. f32 accumulator inside the kernel.
- baracuda_
kernels_ ⚠layer_ norm_ bf16_ strided_ can_ implement layer_norm_bf16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠layer_ norm_ bf16_ strided_ run - LayerNorm FW strided sibling, bf16.
- baracuda_
kernels_ ⚠layer_ norm_ f16_ can_ implement baracuda_kernels_layer_norm_f16_can_implement(baracuda kernels layer norm f16 can implement).- baracuda_
kernels_ ⚠layer_ norm_ f16_ run - LayerNorm FW, f16. f32 accumulator inside the kernel.
- baracuda_
kernels_ ⚠layer_ norm_ f16_ strided_ can_ implement layer_norm_f16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠layer_ norm_ f16_ strided_ run - LayerNorm FW strided sibling, f16.
- baracuda_
kernels_ ⚠layer_ norm_ f32_ can_ implement baracuda_kernels_layer_norm_f32_can_implement(baracuda kernels layer norm f32 can implement).- baracuda_
kernels_ ⚠layer_ norm_ f32_ run - LayerNorm FW, f32.
y = (x - mean) / sqrt(var + eps) * gamma + beta.gamma/betaindependently optional. Biased (“population”) variance. Save buffersmean_out/inv_std_outsharestride_save, each shape == input with norm axes collapsed to 1. - baracuda_
kernels_ ⚠layer_ norm_ f32_ strided_ can_ implement layer_norm_f32_strided_can_implementcompanion.- baracuda_
kernels_ ⚠layer_ norm_ f32_ strided_ run - LayerNorm FW strided sibling, f32. Same contract as
baracuda_kernels_layer_norm_f32_run; identical underlying launcher. - baracuda_
kernels_ ⚠layer_ norm_ f64_ can_ implement baracuda_kernels_layer_norm_f64_can_implement(baracuda kernels layer norm f64 can implement).- baracuda_
kernels_ ⚠layer_ norm_ f64_ run - LayerNorm FW, f64.
- baracuda_
kernels_ ⚠layer_ norm_ f64_ strided_ can_ implement layer_norm_f64_strided_can_implementcompanion.- baracuda_
kernels_ ⚠layer_ norm_ f64_ strided_ run - LayerNorm FW strided sibling, f64.
- baracuda_
kernels_ ⚠log_ softmax_ backward_ bf16_ can_ implement baracuda_kernels_log_softmax_backward_bf16_can_implement(baracuda kernels log softmax backward bf16 can implement).- baracuda_
kernels_ ⚠log_ softmax_ backward_ bf16_ run - LogSoftmax BW, bf16.
- baracuda_
kernels_ ⚠log_ softmax_ backward_ bf16_ strided_ can_ implement log_softmax_backward_bf16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠log_ softmax_ backward_ bf16_ strided_ run - LogSoftmax BW strided sibling, bf16.
- baracuda_
kernels_ ⚠log_ softmax_ backward_ f16_ can_ implement baracuda_kernels_log_softmax_backward_f16_can_implement(baracuda kernels log softmax backward f16 can implement).- baracuda_
kernels_ ⚠log_ softmax_ backward_ f16_ run - LogSoftmax BW, f16.
- baracuda_
kernels_ ⚠log_ softmax_ backward_ f16_ strided_ can_ implement log_softmax_backward_f16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠log_ softmax_ backward_ f16_ strided_ run - LogSoftmax BW strided sibling, f16.
- baracuda_
kernels_ ⚠log_ softmax_ backward_ f32_ can_ implement baracuda_kernels_log_softmax_backward_f32_can_implement(baracuda kernels log softmax backward f32 can implement).- baracuda_
kernels_ ⚠log_ softmax_ backward_ f32_ run - LogSoftmax BW, f32.
dx[k] = dy[k] - exp(y[k]) * Σ_j dy[j]. Caller passes the saved forward outputy(log-softmax values). - baracuda_
kernels_ ⚠log_ softmax_ backward_ f32_ strided_ can_ implement log_softmax_backward_f32_strided_can_implementcompanion.- baracuda_
kernels_ ⚠log_ softmax_ backward_ f32_ strided_ run - LogSoftmax BW strided sibling, f32. ABI identical to softmax BW.
- baracuda_
kernels_ ⚠log_ softmax_ backward_ f64_ can_ implement baracuda_kernels_log_softmax_backward_f64_can_implement(baracuda kernels log softmax backward f64 can implement).- baracuda_
kernels_ ⚠log_ softmax_ backward_ f64_ run - LogSoftmax BW, f64.
- baracuda_
kernels_ ⚠log_ softmax_ backward_ f64_ strided_ can_ implement log_softmax_backward_f64_strided_can_implementcompanion.- baracuda_
kernels_ ⚠log_ softmax_ backward_ f64_ strided_ run - LogSoftmax BW strided sibling, f64.
- baracuda_
kernels_ ⚠log_ softmax_ bf16_ can_ implement baracuda_kernels_log_softmax_bf16_can_implement(baracuda kernels log softmax bf16 can implement).- baracuda_
kernels_ ⚠log_ softmax_ bf16_ run - LogSoftmax FW, bf16.
- baracuda_
kernels_ ⚠log_ softmax_ bf16_ strided_ can_ implement log_softmax_bf16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠log_ softmax_ bf16_ strided_ run - LogSoftmax FW strided sibling, bf16.
- baracuda_
kernels_ ⚠log_ softmax_ f16_ can_ implement baracuda_kernels_log_softmax_f16_can_implement(baracuda kernels log softmax f16 can implement).- baracuda_
kernels_ ⚠log_ softmax_ f16_ run - LogSoftmax FW, f16. f32 accumulator inside the kernel.
- baracuda_
kernels_ ⚠log_ softmax_ f16_ strided_ can_ implement log_softmax_f16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠log_ softmax_ f16_ strided_ run - LogSoftmax FW strided sibling, f16.
- baracuda_
kernels_ ⚠log_ softmax_ f32_ can_ implement baracuda_kernels_log_softmax_f32_can_implement(baracuda kernels log softmax f32 can implement).- baracuda_
kernels_ ⚠log_ softmax_ f32_ run - LogSoftmax FW, f32.
y[k] = (x[k] - max) - log(Σ exp(x[j] - max))alongsoftmax_axis. Numerically stable. - baracuda_
kernels_ ⚠log_ softmax_ f32_ strided_ can_ implement log_softmax_f32_strided_can_implementcompanion.- baracuda_
kernels_ ⚠log_ softmax_ f32_ strided_ run - LogSoftmax FW strided sibling, f32. ABI identical to softmax FW.
- baracuda_
kernels_ ⚠log_ softmax_ f64_ can_ implement baracuda_kernels_log_softmax_f64_can_implement(baracuda kernels log softmax f64 can implement).- baracuda_
kernels_ ⚠log_ softmax_ f64_ run - LogSoftmax FW, f64.
- baracuda_
kernels_ ⚠log_ softmax_ f64_ strided_ can_ implement log_softmax_f64_strided_can_implementcompanion.- baracuda_
kernels_ ⚠log_ softmax_ f64_ strided_ run - LogSoftmax FW strided sibling, f64.
- baracuda_
kernels_ ⚠loss_ bce_ backward_ bf16_ can_ implement - BCE BW
_can_implement, bf16. - baracuda_
kernels_ ⚠loss_ bce_ backward_ bf16_ run - BCE BW, bf16.
- baracuda_
kernels_ ⚠loss_ bce_ backward_ f16_ can_ implement - BCE BW
_can_implement, f16. - baracuda_
kernels_ ⚠loss_ bce_ backward_ f16_ run - BCE BW, f16.
- baracuda_
kernels_ ⚠loss_ bce_ backward_ f32_ can_ implement - BCE BW
_can_implement, f32. - baracuda_
kernels_ ⚠loss_ bce_ backward_ f32_ run - BCE BW, f32.
dpred = (pred - target) / (pred·(1-pred)) · scale. - baracuda_
kernels_ ⚠loss_ bce_ backward_ f64_ can_ implement - BCE BW
_can_implement, f64. - baracuda_
kernels_ ⚠loss_ bce_ backward_ f64_ run - BCE BW, f64.
- baracuda_
kernels_ ⚠loss_ bce_ bf16_ can_ implement - BCE FW
_can_implement, bf16. - baracuda_
kernels_ ⚠loss_ bce_ bf16_ run - BCE FW, bf16.
- baracuda_
kernels_ ⚠loss_ bce_ f16_ can_ implement - BCE FW
_can_implement, f16. - baracuda_
kernels_ ⚠loss_ bce_ f16_ run - BCE FW, f16.
- baracuda_
kernels_ ⚠loss_ bce_ f32_ can_ implement - BCE FW
_can_implement, f32. - baracuda_
kernels_ ⚠loss_ bce_ f32_ run - BCE FW, f32.
-(t·log(p) + (1-t)·log(1-p))per-cell, then reduce. Caller ensures pred ∈ (0, 1). - baracuda_
kernels_ ⚠loss_ bce_ f64_ can_ implement - BCE FW
_can_implement, f64. - baracuda_
kernels_ ⚠loss_ bce_ f64_ run - BCE FW, f64.
- baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ backward_ bf16_ can_ implement baracuda_kernels_loss_bce_with_logits_backward_bf16_can_implement(baracuda kernels loss bce with logits backward bf16 can implement).- baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ backward_ bf16_ run - BCEWithLogits BW, bf16.
- baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ backward_ f16_ can_ implement baracuda_kernels_loss_bce_with_logits_backward_f16_can_implement(baracuda kernels loss bce with logits backward f16 can implement).- baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ backward_ f16_ run - BCEWithLogits BW, f16.
- baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ backward_ f32_ can_ implement baracuda_kernels_loss_bce_with_logits_backward_f32_can_implement(baracuda kernels loss bce with logits backward f32 can implement).- baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ backward_ f32_ run - BCEWithLogits BW, f32.
dlogits = (sigmoid(x) - target) · scale. - baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ backward_ f64_ can_ implement baracuda_kernels_loss_bce_with_logits_backward_f64_can_implement(baracuda kernels loss bce with logits backward f64 can implement).- baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ backward_ f64_ run - BCEWithLogits BW, f64.
- baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ bf16_ can_ implement baracuda_kernels_loss_bce_with_logits_bf16_can_implement(baracuda kernels loss bce with logits bf16 can implement).- baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ bf16_ run - BCEWithLogits FW, bf16.
- baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ f16_ can_ implement baracuda_kernels_loss_bce_with_logits_f16_can_implement(baracuda kernels loss bce with logits f16 can implement).- baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ f16_ run - BCEWithLogits FW, f16.
- baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ f32_ can_ implement baracuda_kernels_loss_bce_with_logits_f32_can_implement(baracuda kernels loss bce with logits f32 can implement).- baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ f32_ run - BCEWithLogits FW, f32. Stable BCE for raw logits.
- baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ f64_ can_ implement baracuda_kernels_loss_bce_with_logits_f64_can_implement(baracuda kernels loss bce with logits f64 can implement).- baracuda_
kernels_ ⚠loss_ bce_ with_ logits_ f64_ run - BCEWithLogits FW, f64.
- baracuda_
kernels_ ⚠loss_ cosine_ embedding_ backward_ bf16_ can_ implement baracuda_kernels_loss_cosine_embedding_backward_bf16_can_implement(baracuda kernels loss cosine embedding backward bf16 can implement).- baracuda_
kernels_ ⚠loss_ cosine_ embedding_ backward_ bf16_ run - CosineEmbedding BW, bf16.
- baracuda_
kernels_ ⚠loss_ cosine_ embedding_ backward_ f16_ can_ implement baracuda_kernels_loss_cosine_embedding_backward_f16_can_implement(baracuda kernels loss cosine embedding backward f16 can implement).- baracuda_
kernels_ ⚠loss_ cosine_ embedding_ backward_ f16_ run - CosineEmbedding BW, f16.
- baracuda_
kernels_ ⚠loss_ cosine_ embedding_ backward_ f32_ can_ implement baracuda_kernels_loss_cosine_embedding_backward_f32_can_implement(baracuda kernels loss cosine embedding backward f32 can implement).- baracuda_
kernels_ ⚠loss_ cosine_ embedding_ backward_ f32_ run - CosineEmbedding BW.
- baracuda_
kernels_ ⚠loss_ cosine_ embedding_ backward_ f64_ can_ implement baracuda_kernels_loss_cosine_embedding_backward_f64_can_implement(baracuda kernels loss cosine embedding backward f64 can implement).- baracuda_
kernels_ ⚠loss_ cosine_ embedding_ backward_ f64_ run - CosineEmbedding BW, f64.
- baracuda_
kernels_ ⚠loss_ cosine_ embedding_ bf16_ can_ implement baracuda_kernels_loss_cosine_embedding_bf16_can_implement(baracuda kernels loss cosine embedding bf16 can implement).- baracuda_
kernels_ ⚠loss_ cosine_ embedding_ bf16_ run - CosineEmbedding FW, bf16.
- baracuda_
kernels_ ⚠loss_ cosine_ embedding_ f16_ can_ implement baracuda_kernels_loss_cosine_embedding_f16_can_implement(baracuda kernels loss cosine embedding f16 can implement).- baracuda_
kernels_ ⚠loss_ cosine_ embedding_ f16_ run - CosineEmbedding FW, f16.
- baracuda_
kernels_ ⚠loss_ cosine_ embedding_ f32_ can_ implement baracuda_kernels_loss_cosine_embedding_f32_can_implement(baracuda kernels loss cosine embedding f32 can implement).- baracuda_
kernels_ ⚠loss_ cosine_ embedding_ f32_ run - CosineEmbedding FW (per-row). ABI:
(n_rows, d_extent, row_stride_x, reduction_mode, margin, x1, x2, t, out, workspace, workspace_bytes, stream). - baracuda_
kernels_ ⚠loss_ cosine_ embedding_ f64_ can_ implement baracuda_kernels_loss_cosine_embedding_f64_can_implement(baracuda kernels loss cosine embedding f64 can implement).- baracuda_
kernels_ ⚠loss_ cosine_ embedding_ f64_ run - CosineEmbedding FW, f64.
- baracuda_
kernels_ ⚠loss_ cross_ entropy_ backward_ bf16_ can_ implement baracuda_kernels_loss_cross_entropy_backward_bf16_can_implement(baracuda kernels loss cross entropy backward bf16 can implement).- baracuda_
kernels_ ⚠loss_ cross_ entropy_ backward_ bf16_ run - CrossEntropy BW, bf16.
- baracuda_
kernels_ ⚠loss_ cross_ entropy_ backward_ f16_ can_ implement baracuda_kernels_loss_cross_entropy_backward_f16_can_implement(baracuda kernels loss cross entropy backward f16 can implement).- baracuda_
kernels_ ⚠loss_ cross_ entropy_ backward_ f16_ run - CrossEntropy BW, f16.
- baracuda_
kernels_ ⚠loss_ cross_ entropy_ backward_ f32_ can_ implement - CrossEntropy BW
_can_implement, f32. - baracuda_
kernels_ ⚠loss_ cross_ entropy_ backward_ f32_ run - CrossEntropy BW, f32.
dinput[i, c] = (softmax(input)[i, c] - 1{c==t[i]}) · scale. - baracuda_
kernels_ ⚠loss_ cross_ entropy_ backward_ f64_ can_ implement baracuda_kernels_loss_cross_entropy_backward_f64_can_implement(baracuda kernels loss cross entropy backward f64 can implement).- baracuda_
kernels_ ⚠loss_ cross_ entropy_ backward_ f64_ run - CrossEntropy BW, f64.
- baracuda_
kernels_ ⚠loss_ cross_ entropy_ bf16_ can_ implement - CrossEntropy FW
_can_implement, bf16. - baracuda_
kernels_ ⚠loss_ cross_ entropy_ bf16_ run - CrossEntropy FW, bf16.
- baracuda_
kernels_ ⚠loss_ cross_ entropy_ f16_ can_ implement - CrossEntropy FW
_can_implement, f16. - baracuda_
kernels_ ⚠loss_ cross_ entropy_ f16_ run - CrossEntropy FW, f16.
- baracuda_
kernels_ ⚠loss_ cross_ entropy_ f32_ can_ implement - CrossEntropy FW
_can_implement, f32. - baracuda_
kernels_ ⚠loss_ cross_ entropy_ f32_ run - CrossEntropy FW, f32. Fused LogSoftmax + NLL. Numerically stable per-row two-pass max subtraction.
- baracuda_
kernels_ ⚠loss_ cross_ entropy_ f64_ can_ implement - CrossEntropy FW
_can_implement, f64. - baracuda_
kernels_ ⚠loss_ cross_ entropy_ f64_ run - CrossEntropy FW, f64.
- baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ backward_ bf16_ can_ implement baracuda_kernels_loss_cross_entropy_soft_backward_bf16_can_implement(baracuda kernels loss cross entropy soft backward bf16 can implement).- baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ backward_ bf16_ run - Soft-target CrossEntropy BW, bf16.
- baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ backward_ f16_ can_ implement baracuda_kernels_loss_cross_entropy_soft_backward_f16_can_implement(baracuda kernels loss cross entropy soft backward f16 can implement).- baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ backward_ f16_ run - Soft-target CrossEntropy BW, f16.
- baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ backward_ f32_ can_ implement baracuda_kernels_loss_cross_entropy_soft_backward_f32_can_implement(baracuda kernels loss cross entropy soft backward f32 can implement).- baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ backward_ f32_ run - Soft-target CrossEntropy BW, f32.
- baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ backward_ f64_ can_ implement baracuda_kernels_loss_cross_entropy_soft_backward_f64_can_implement(baracuda kernels loss cross entropy soft backward f64 can implement).- baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ backward_ f64_ run - Soft-target CrossEntropy BW, f64.
- baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ bf16_ can_ implement baracuda_kernels_loss_cross_entropy_soft_bf16_can_implement(baracuda kernels loss cross entropy soft bf16 can implement).- baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ bf16_ run - Soft-target CrossEntropy FW, bf16.
- baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ f16_ can_ implement baracuda_kernels_loss_cross_entropy_soft_f16_can_implement(baracuda kernels loss cross entropy soft f16 can implement).- baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ f16_ run - Soft-target CrossEntropy FW, f16.
- baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ f32_ can_ implement baracuda_kernels_loss_cross_entropy_soft_f32_can_implement(baracuda kernels loss cross entropy soft f32 can implement).- baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ f32_ run - Soft-target CrossEntropy FW, f32. Target is
T-typed prob tensor. - baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ f64_ can_ implement baracuda_kernels_loss_cross_entropy_soft_f64_can_implement(baracuda kernels loss cross entropy soft f64 can implement).- baracuda_
kernels_ ⚠loss_ cross_ entropy_ soft_ f64_ run - Soft-target CrossEntropy FW, f64.
- baracuda_
kernels_ ⚠loss_ ctc_ backward_ bf16_ can_ implement - CTCLoss BW
_can_implement, bf16. - baracuda_
kernels_ ⚠loss_ ctc_ backward_ bf16_ run - CTCLoss BW, bf16.
- baracuda_
kernels_ ⚠loss_ ctc_ backward_ f16_ can_ implement - CTCLoss BW
_can_implement, f16. - baracuda_
kernels_ ⚠loss_ ctc_ backward_ f16_ run - CTCLoss BW, f16.
- baracuda_
kernels_ ⚠loss_ ctc_ backward_ f32_ can_ implement - CTCLoss BW
_can_implement, f32. - baracuda_
kernels_ ⚠loss_ ctc_ backward_ f32_ run - CTCLoss BW, f32.
- baracuda_
kernels_ ⚠loss_ ctc_ backward_ f64_ can_ implement - CTCLoss BW
_can_implement, f64 (F64_ACC). - baracuda_
kernels_ ⚠loss_ ctc_ backward_ f64_ run - CTCLoss BW, f64.
- baracuda_
kernels_ ⚠loss_ ctc_ bf16_ can_ implement - CTCLoss FW
_can_implement, bf16 (F32_ACC). - baracuda_
kernels_ ⚠loss_ ctc_ bf16_ run - CTCLoss FW, bf16.
- baracuda_
kernels_ ⚠loss_ ctc_ f16_ can_ implement - CTCLoss FW
_can_implement, f16 (F32_ACC). - baracuda_
kernels_ ⚠loss_ ctc_ f16_ run - CTCLoss FW, f16.
- baracuda_
kernels_ ⚠loss_ ctc_ f32_ can_ implement - CTCLoss FW
_can_implement, f32. Validatesnum_classes <= 32,max_target_len <= 256,blank ∈ [0, num_classes),reduction_mode ∈ [0,2]. - baracuda_
kernels_ ⚠loss_ ctc_ f32_ run - CTCLoss FW, f32.
- baracuda_
kernels_ ⚠loss_ ctc_ f64_ can_ implement - CTCLoss FW
_can_implement, f64 (F64_ACC). - baracuda_
kernels_ ⚠loss_ ctc_ f64_ run - CTCLoss FW, f64.
- baracuda_
kernels_ ⚠loss_ flce_ count_ non_ ignore_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_count_non_ignore. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ count_ non_ ignore_ run - FLCE count-non-ignore. Single-block tree reduction; writes the
target[i] != ignore_indexcount intocount_out[0](i64). - baracuda_
kernels_ ⚠loss_ flce_ inplace_ scale_ bf16_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_inplace_scale_bf16. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ inplace_ scale_ bf16_ run - FLCE in-place scale, bf16.
- baracuda_
kernels_ ⚠loss_ flce_ inplace_ scale_ f16_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_inplace_scale_f16. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ inplace_ scale_ f16_ run - FLCE in-place scale, f16.
- baracuda_
kernels_ ⚠loss_ flce_ inplace_ scale_ f32_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_inplace_scale_f32. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ inplace_ scale_ f32_ run - FLCE in-place scale, f32. Multiplies
bufin place byscalar. - baracuda_
kernels_ ⚠loss_ flce_ inplace_ scale_ f64_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_inplace_scale_f64. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ inplace_ scale_ f64_ run - FLCE in-place scale, f64.
- baracuda_
kernels_ ⚠loss_ flce_ per_ row_ bf16_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_per_row_bf16. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ per_ row_ bf16_ run - FLCE per-row fused step, bf16.
- baracuda_
kernels_ ⚠loss_ flce_ per_ row_ cast_ bf16_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_per_row_cast_bf16. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ per_ row_ cast_ bf16_ run - FLCE per-row cast, f32 → bf16.
- baracuda_
kernels_ ⚠loss_ flce_ per_ row_ cast_ f16_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_per_row_cast_f16. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ per_ row_ cast_ f16_ run - FLCE per-row cast, f32 → f16.
- baracuda_
kernels_ ⚠loss_ flce_ per_ row_ cast_ f32_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_per_row_cast_f32. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ per_ row_ cast_ f32_ run - FLCE per-row cast (None mode finalizer), f32 → f32.
- baracuda_
kernels_ ⚠loss_ flce_ per_ row_ cast_ f64_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_per_row_cast_f64. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ per_ row_ cast_ f64_ run - FLCE per-row cast, f32 → f64.
- baracuda_
kernels_ ⚠loss_ flce_ per_ row_ f16_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_per_row_f16. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ per_ row_ f16_ run - FLCE per-row fused step, f16.
- baracuda_
kernels_ ⚠loss_ flce_ per_ row_ f32_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_per_row_f32. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ per_ row_ f32_ run - FLCE per-row fused step, f32. Mutates
logitsin place tograd_logits = (softmax - one_hot) · scale_per_row; writes per-row-log_softmax[target]intoloss_1d(f32 accumulator). - baracuda_
kernels_ ⚠loss_ flce_ per_ row_ f64_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_per_row_f64. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ per_ row_ f64_ run - FLCE per-row fused step, f64.
- baracuda_
kernels_ ⚠loss_ flce_ scalar_ finalize_ bf16_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_scalar_finalize_bf16. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ scalar_ finalize_ bf16_ run - FLCE scalar finalize, f32 → bf16.
- baracuda_
kernels_ ⚠loss_ flce_ scalar_ finalize_ f16_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_scalar_finalize_f16. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ scalar_ finalize_ f16_ run - FLCE scalar finalize, f32 → f16.
- baracuda_
kernels_ ⚠loss_ flce_ scalar_ finalize_ f32_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_scalar_finalize_f32. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ scalar_ finalize_ f32_ run - FLCE scalar finalize (Mean/Sum), f32 → f32.
- baracuda_
kernels_ ⚠loss_ flce_ scalar_ finalize_ f64_ can_ implement - Implementability check for
baracuda_kernels_loss_flce_scalar_finalize_f64. Host-side only. - baracuda_
kernels_ ⚠loss_ flce_ scalar_ finalize_ f64_ run - FLCE scalar finalize, f32 → f64.
- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ backward_ bf16_ can_ implement baracuda_kernels_loss_gaussian_nll_backward_bf16_can_implement(baracuda kernels loss gaussian nll backward bf16 can implement).- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ backward_ bf16_ run - GaussianNLL BW, bf16.
- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ backward_ f16_ can_ implement baracuda_kernels_loss_gaussian_nll_backward_f16_can_implement(baracuda kernels loss gaussian nll backward f16 can implement).- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ backward_ f16_ run - GaussianNLL BW, f16.
- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ backward_ f32_ can_ implement baracuda_kernels_loss_gaussian_nll_backward_f32_can_implement(baracuda kernels loss gaussian nll backward f32 can implement).- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ backward_ f32_ run - GaussianNLL BW, f32.
- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ backward_ f64_ can_ implement baracuda_kernels_loss_gaussian_nll_backward_f64_can_implement(baracuda kernels loss gaussian nll backward f64 can implement).- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ backward_ f64_ run - GaussianNLL BW, f64.
- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ bf16_ can_ implement baracuda_kernels_loss_gaussian_nll_bf16_can_implement(baracuda kernels loss gaussian nll bf16 can implement).- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ bf16_ run - GaussianNLL FW, bf16.
- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ f16_ can_ implement baracuda_kernels_loss_gaussian_nll_f16_can_implement(baracuda kernels loss gaussian nll f16 can implement).- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ f16_ run - GaussianNLL FW, f16.
- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ f32_ can_ implement baracuda_kernels_loss_gaussian_nll_f32_can_implement(baracuda kernels loss gaussian nll f32 can implement).- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ f32_ run - GaussianNLL FW, f32. 3-tensor input (input, target, var).
- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ f64_ can_ implement baracuda_kernels_loss_gaussian_nll_f64_can_implement(baracuda kernels loss gaussian nll f64 can implement).- baracuda_
kernels_ ⚠loss_ gaussian_ nll_ f64_ run - GaussianNLL FW, f64.
- baracuda_
kernels_ ⚠loss_ hinge_ embedding_ backward_ bf16_ can_ implement baracuda_kernels_loss_hinge_embedding_backward_bf16_can_implement(baracuda kernels loss hinge embedding backward bf16 can implement).- baracuda_
kernels_ ⚠loss_ hinge_ embedding_ backward_ bf16_ run - HingeEmbedding BW, bf16.
- baracuda_
kernels_ ⚠loss_ hinge_ embedding_ backward_ f16_ can_ implement baracuda_kernels_loss_hinge_embedding_backward_f16_can_implement(baracuda kernels loss hinge embedding backward f16 can implement).- baracuda_
kernels_ ⚠loss_ hinge_ embedding_ backward_ f16_ run - HingeEmbedding BW, f16.
- baracuda_
kernels_ ⚠loss_ hinge_ embedding_ backward_ f32_ can_ implement baracuda_kernels_loss_hinge_embedding_backward_f32_can_implement(baracuda kernels loss hinge embedding backward f32 can implement).- baracuda_
kernels_ ⚠loss_ hinge_ embedding_ backward_ f32_ run - HingeEmbedding BW, f32.
- baracuda_
kernels_ ⚠loss_ hinge_ embedding_ backward_ f64_ can_ implement baracuda_kernels_loss_hinge_embedding_backward_f64_can_implement(baracuda kernels loss hinge embedding backward f64 can implement).- baracuda_
kernels_ ⚠loss_ hinge_ embedding_ backward_ f64_ run - HingeEmbedding BW, f64.
- baracuda_
kernels_ ⚠loss_ hinge_ embedding_ bf16_ can_ implement baracuda_kernels_loss_hinge_embedding_bf16_can_implement(baracuda kernels loss hinge embedding bf16 can implement).- baracuda_
kernels_ ⚠loss_ hinge_ embedding_ bf16_ run - HingeEmbedding FW, bf16.
- baracuda_
kernels_ ⚠loss_ hinge_ embedding_ f16_ can_ implement baracuda_kernels_loss_hinge_embedding_f16_can_implement(baracuda kernels loss hinge embedding f16 can implement).- baracuda_
kernels_ ⚠loss_ hinge_ embedding_ f16_ run - HingeEmbedding FW, f16.
- baracuda_
kernels_ ⚠loss_ hinge_ embedding_ f32_ can_ implement baracuda_kernels_loss_hinge_embedding_f32_can_implement(baracuda kernels loss hinge embedding f32 can implement).- baracuda_
kernels_ ⚠loss_ hinge_ embedding_ f32_ run - HingeEmbedding FW, f32. ABI:
(numel, reduction_mode, margin, input, target_i64, out, workspace, workspace_bytes, stream). - baracuda_
kernels_ ⚠loss_ hinge_ embedding_ f64_ can_ implement baracuda_kernels_loss_hinge_embedding_f64_can_implement(baracuda kernels loss hinge embedding f64 can implement).- baracuda_
kernels_ ⚠loss_ hinge_ embedding_ f64_ run - HingeEmbedding FW, f64.
- baracuda_
kernels_ ⚠loss_ huber_ backward_ bf16_ can_ implement baracuda_kernels_loss_huber_backward_bf16_can_implement(baracuda kernels loss huber backward bf16 can implement).- baracuda_
kernels_ ⚠loss_ huber_ backward_ bf16_ run - Huber BW, bf16.
- baracuda_
kernels_ ⚠loss_ huber_ backward_ f16_ can_ implement baracuda_kernels_loss_huber_backward_f16_can_implement(baracuda kernels loss huber backward f16 can implement).- baracuda_
kernels_ ⚠loss_ huber_ backward_ f16_ run - Huber BW, f16.
- baracuda_
kernels_ ⚠loss_ huber_ backward_ f32_ can_ implement baracuda_kernels_loss_huber_backward_f32_can_implement(baracuda kernels loss huber backward f32 can implement).- baracuda_
kernels_ ⚠loss_ huber_ backward_ f32_ run - Huber BW, f32.
- baracuda_
kernels_ ⚠loss_ huber_ backward_ f64_ can_ implement baracuda_kernels_loss_huber_backward_f64_can_implement(baracuda kernels loss huber backward f64 can implement).- baracuda_
kernels_ ⚠loss_ huber_ backward_ f64_ run - Huber BW, f64.
- baracuda_
kernels_ ⚠loss_ huber_ bf16_ can_ implement baracuda_kernels_loss_huber_bf16_can_implement(baracuda kernels loss huber bf16 can implement).- baracuda_
kernels_ ⚠loss_ huber_ bf16_ run - Huber FW, bf16.
- baracuda_
kernels_ ⚠loss_ huber_ f16_ can_ implement baracuda_kernels_loss_huber_f16_can_implement(baracuda kernels loss huber f16 can implement).- baracuda_
kernels_ ⚠loss_ huber_ f16_ run - Huber FW, f16.
- baracuda_
kernels_ ⚠loss_ huber_ f32_ can_ implement baracuda_kernels_loss_huber_f32_can_implement(baracuda kernels loss huber f32 can implement).- baracuda_
kernels_ ⚠loss_ huber_ f32_ run - Huber FW, f32.
param = δ. - baracuda_
kernels_ ⚠loss_ huber_ f64_ can_ implement baracuda_kernels_loss_huber_f64_can_implement(baracuda kernels loss huber f64 can implement).- baracuda_
kernels_ ⚠loss_ huber_ f64_ run - Huber FW, f64.
- baracuda_
kernels_ ⚠loss_ kl_ div_ backward_ bf16_ can_ implement - KLDiv BW
_can_implement, bf16. - baracuda_
kernels_ ⚠loss_ kl_ div_ backward_ bf16_ run - KLDiv BW, bf16.
- baracuda_
kernels_ ⚠loss_ kl_ div_ backward_ f16_ can_ implement - KLDiv BW
_can_implement, f16. - baracuda_
kernels_ ⚠loss_ kl_ div_ backward_ f16_ run - KLDiv BW, f16.
- baracuda_
kernels_ ⚠loss_ kl_ div_ backward_ f32_ can_ implement - KLDiv BW
_can_implement, f32. - baracuda_
kernels_ ⚠loss_ kl_ div_ backward_ f32_ run - KLDiv BW, f32.
dinput = -target · scale. - baracuda_
kernels_ ⚠loss_ kl_ div_ backward_ f64_ can_ implement - KLDiv BW
_can_implement, f64. - baracuda_
kernels_ ⚠loss_ kl_ div_ backward_ f64_ run - KLDiv BW, f64.
- baracuda_
kernels_ ⚠loss_ kl_ div_ bf16_ can_ implement - KLDiv FW
_can_implement, bf16. - baracuda_
kernels_ ⚠loss_ kl_ div_ bf16_ run - KLDiv FW, bf16.
- baracuda_
kernels_ ⚠loss_ kl_ div_ f16_ can_ implement - KLDiv FW
_can_implement, f16. - baracuda_
kernels_ ⚠loss_ kl_ div_ f16_ run - KLDiv FW, f16.
- baracuda_
kernels_ ⚠loss_ kl_ div_ f32_ can_ implement - KLDiv FW
_can_implement, f32. - baracuda_
kernels_ ⚠loss_ kl_ div_ f32_ run - KLDiv FW, f32.
target·(log(target) - input)per-cell. PyTorch convention: input is already log-prob. - baracuda_
kernels_ ⚠loss_ kl_ div_ f64_ can_ implement - KLDiv FW
_can_implement, f64. - baracuda_
kernels_ ⚠loss_ kl_ div_ f64_ run - KLDiv FW, f64.
- baracuda_
kernels_ ⚠loss_ l1_ backward_ bf16_ can_ implement baracuda_kernels_loss_l1_backward_bf16_can_implement(baracuda kernels loss l1 backward bf16 can implement).- baracuda_
kernels_ ⚠loss_ l1_ backward_ bf16_ run - L1 BW, bf16.
- baracuda_
kernels_ ⚠loss_ l1_ backward_ f16_ can_ implement baracuda_kernels_loss_l1_backward_f16_can_implement(baracuda kernels loss l1 backward f16 can implement).- baracuda_
kernels_ ⚠loss_ l1_ backward_ f16_ run - L1 BW, f16.
- baracuda_
kernels_ ⚠loss_ l1_ backward_ f32_ can_ implement - L1 BW
_can_implement, f32. - baracuda_
kernels_ ⚠loss_ l1_ backward_ f32_ run - L1 BW, f32.
dpred = sign(pred - target) · scale. - baracuda_
kernels_ ⚠loss_ l1_ backward_ f64_ can_ implement baracuda_kernels_loss_l1_backward_f64_can_implement(baracuda kernels loss l1 backward f64 can implement).- baracuda_
kernels_ ⚠loss_ l1_ backward_ f64_ run - L1 BW, f64.
- baracuda_
kernels_ ⚠loss_ l1_ bf16_ can_ implement - L1 FW
_can_implement, bf16. - baracuda_
kernels_ ⚠loss_ l1_ bf16_ run - L1 FW, bf16.
- baracuda_
kernels_ ⚠loss_ l1_ f16_ can_ implement - L1 FW
_can_implement, f16. - baracuda_
kernels_ ⚠loss_ l1_ f16_ run - L1 FW, f16.
- baracuda_
kernels_ ⚠loss_ l1_ f32_ can_ implement - L1 FW
_can_implement, f32. - baracuda_
kernels_ ⚠loss_ l1_ f32_ run - L1 FW, f32.
y = |pred - target|per-cell; mean/sum reduce to scalar. - baracuda_
kernels_ ⚠loss_ l1_ f64_ can_ implement - L1 FW
_can_implement, f64. - baracuda_
kernels_ ⚠loss_ l1_ f64_ run - L1 FW, f64.
- baracuda_
kernels_ ⚠loss_ margin_ ranking_ backward_ bf16_ can_ implement baracuda_kernels_loss_margin_ranking_backward_bf16_can_implement(baracuda kernels loss margin ranking backward bf16 can implement).- baracuda_
kernels_ ⚠loss_ margin_ ranking_ backward_ bf16_ run - MarginRanking BW, bf16.
- baracuda_
kernels_ ⚠loss_ margin_ ranking_ backward_ f16_ can_ implement baracuda_kernels_loss_margin_ranking_backward_f16_can_implement(baracuda kernels loss margin ranking backward f16 can implement).- baracuda_
kernels_ ⚠loss_ margin_ ranking_ backward_ f16_ run - MarginRanking BW, f16.
- baracuda_
kernels_ ⚠loss_ margin_ ranking_ backward_ f32_ can_ implement baracuda_kernels_loss_margin_ranking_backward_f32_can_implement(baracuda kernels loss margin ranking backward f32 can implement).- baracuda_
kernels_ ⚠loss_ margin_ ranking_ backward_ f32_ run - MarginRanking BW, f32. ABI:
(numel, reduction_mode, scale, margin, x1, x2, t, dy, dx1, dx2, workspace, workspace_bytes, stream). - baracuda_
kernels_ ⚠loss_ margin_ ranking_ backward_ f64_ can_ implement baracuda_kernels_loss_margin_ranking_backward_f64_can_implement(baracuda kernels loss margin ranking backward f64 can implement).- baracuda_
kernels_ ⚠loss_ margin_ ranking_ backward_ f64_ run - MarginRanking BW, f64.
- baracuda_
kernels_ ⚠loss_ margin_ ranking_ bf16_ can_ implement baracuda_kernels_loss_margin_ranking_bf16_can_implement(baracuda kernels loss margin ranking bf16 can implement).- baracuda_
kernels_ ⚠loss_ margin_ ranking_ bf16_ run - MarginRanking FW, bf16.
- baracuda_
kernels_ ⚠loss_ margin_ ranking_ f16_ can_ implement baracuda_kernels_loss_margin_ranking_f16_can_implement(baracuda kernels loss margin ranking f16 can implement).- baracuda_
kernels_ ⚠loss_ margin_ ranking_ f16_ run - MarginRanking FW, f16.
- baracuda_
kernels_ ⚠loss_ margin_ ranking_ f32_ can_ implement baracuda_kernels_loss_margin_ranking_f32_can_implement(baracuda kernels loss margin ranking f32 can implement).- baracuda_
kernels_ ⚠loss_ margin_ ranking_ f32_ run - MarginRanking FW, f32. ABI:
(numel, reduction_mode, margin, x1, x2, t, out, workspace, workspace_bytes, stream). - baracuda_
kernels_ ⚠loss_ margin_ ranking_ f64_ can_ implement baracuda_kernels_loss_margin_ranking_f64_can_implement(baracuda kernels loss margin ranking f64 can implement).- baracuda_
kernels_ ⚠loss_ margin_ ranking_ f64_ run - MarginRanking FW, f64.
- baracuda_
kernels_ ⚠loss_ mse_ backward_ bf16_ can_ implement - MSE BW
_can_implement, bf16. - baracuda_
kernels_ ⚠loss_ mse_ backward_ bf16_ run - MSE BW, bf16.
- baracuda_
kernels_ ⚠loss_ mse_ backward_ f16_ can_ implement - MSE BW
_can_implement, f16. - baracuda_
kernels_ ⚠loss_ mse_ backward_ f16_ run - MSE BW, f16.
- baracuda_
kernels_ ⚠loss_ mse_ backward_ f32_ can_ implement - MSE BW
_can_implement, f32. - baracuda_
kernels_ ⚠loss_ mse_ backward_ f32_ run - MSE BW, f32.
dpred = 2·(pred - target) · scale. - baracuda_
kernels_ ⚠loss_ mse_ backward_ f64_ can_ implement - MSE BW
_can_implement, f64. - baracuda_
kernels_ ⚠loss_ mse_ backward_ f64_ run - MSE BW, f64.
- baracuda_
kernels_ ⚠loss_ mse_ bf16_ can_ implement - MSE FW
_can_implement, bf16. - baracuda_
kernels_ ⚠loss_ mse_ bf16_ run - MSE FW, bf16.
- baracuda_
kernels_ ⚠loss_ mse_ f16_ can_ implement - MSE FW
_can_implement, f16. - baracuda_
kernels_ ⚠loss_ mse_ f16_ run - MSE FW, f16.
- baracuda_
kernels_ ⚠loss_ mse_ f32_ can_ implement - MSE FW
_can_implement, f32. Host-side validator (no launch). - baracuda_
kernels_ ⚠loss_ mse_ f32_ run - MSE FW, f32.
(pred - target)²per-cell; mean/sum reduce to scalar. Workspace:numel * sizeof(T)bytes for Mean/Sum; unused for None. - baracuda_
kernels_ ⚠loss_ mse_ f64_ can_ implement - MSE FW
_can_implement, f64. - baracuda_
kernels_ ⚠loss_ mse_ f64_ run - MSE FW, f64.
- baracuda_
kernels_ ⚠loss_ multi_ margin_ backward_ bf16_ can_ implement baracuda_kernels_loss_multi_margin_backward_bf16_can_implement(baracuda kernels loss multi margin backward bf16 can implement).- baracuda_
kernels_ ⚠loss_ multi_ margin_ backward_ bf16_ run - MultiMargin BW, bf16.
- baracuda_
kernels_ ⚠loss_ multi_ margin_ backward_ f16_ can_ implement baracuda_kernels_loss_multi_margin_backward_f16_can_implement(baracuda kernels loss multi margin backward f16 can implement).- baracuda_
kernels_ ⚠loss_ multi_ margin_ backward_ f16_ run - MultiMargin BW, f16.
- baracuda_
kernels_ ⚠loss_ multi_ margin_ backward_ f32_ can_ implement baracuda_kernels_loss_multi_margin_backward_f32_can_implement(baracuda kernels loss multi margin backward f32 can implement).- baracuda_
kernels_ ⚠loss_ multi_ margin_ backward_ f32_ run - MultiMargin BW.
- baracuda_
kernels_ ⚠loss_ multi_ margin_ backward_ f64_ can_ implement baracuda_kernels_loss_multi_margin_backward_f64_can_implement(baracuda kernels loss multi margin backward f64 can implement).- baracuda_
kernels_ ⚠loss_ multi_ margin_ backward_ f64_ run - MultiMargin BW, f64.
- baracuda_
kernels_ ⚠loss_ multi_ margin_ bf16_ can_ implement baracuda_kernels_loss_multi_margin_bf16_can_implement(baracuda kernels loss multi margin bf16 can implement).- baracuda_
kernels_ ⚠loss_ multi_ margin_ bf16_ run - MultiMargin FW, bf16.
- baracuda_
kernels_ ⚠loss_ multi_ margin_ f16_ can_ implement baracuda_kernels_loss_multi_margin_f16_can_implement(baracuda kernels loss multi margin f16 can implement).- baracuda_
kernels_ ⚠loss_ multi_ margin_ f16_ run - MultiMargin FW, f16.
- baracuda_
kernels_ ⚠loss_ multi_ margin_ f32_ can_ implement baracuda_kernels_loss_multi_margin_f32_can_implement(baracuda kernels loss multi margin f32 can implement).- baracuda_
kernels_ ⚠loss_ multi_ margin_ f32_ run - MultiMargin FW (per-row). ABI:
(n_rows, class_extent, row_stride, reduction_mode, margin, p_norm, input, target_i64, out, workspace, workspace_bytes, stream). - baracuda_
kernels_ ⚠loss_ multi_ margin_ f64_ can_ implement baracuda_kernels_loss_multi_margin_f64_can_implement(baracuda kernels loss multi margin f64 can implement).- baracuda_
kernels_ ⚠loss_ multi_ margin_ f64_ run - MultiMargin FW, f64.
- baracuda_
kernels_ ⚠loss_ multilabel_ margin_ backward_ bf16_ can_ implement baracuda_kernels_loss_multilabel_margin_backward_bf16_can_implement(baracuda kernels loss multilabel margin backward bf16 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ margin_ backward_ bf16_ run - MultilabelMargin BW, bf16.
- baracuda_
kernels_ ⚠loss_ multilabel_ margin_ backward_ f16_ can_ implement baracuda_kernels_loss_multilabel_margin_backward_f16_can_implement(baracuda kernels loss multilabel margin backward f16 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ margin_ backward_ f16_ run - MultilabelMargin BW, f16.
- baracuda_
kernels_ ⚠loss_ multilabel_ margin_ backward_ f32_ can_ implement baracuda_kernels_loss_multilabel_margin_backward_f32_can_implement(baracuda kernels loss multilabel margin backward f32 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ margin_ backward_ f32_ run - MultilabelMargin BW.
- baracuda_
kernels_ ⚠loss_ multilabel_ margin_ backward_ f64_ can_ implement baracuda_kernels_loss_multilabel_margin_backward_f64_can_implement(baracuda kernels loss multilabel margin backward f64 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ margin_ backward_ f64_ run - MultilabelMargin BW, f64.
- baracuda_
kernels_ ⚠loss_ multilabel_ margin_ bf16_ can_ implement baracuda_kernels_loss_multilabel_margin_bf16_can_implement(baracuda kernels loss multilabel margin bf16 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ margin_ bf16_ run - MultilabelMargin FW, bf16.
- baracuda_
kernels_ ⚠loss_ multilabel_ margin_ f16_ can_ implement baracuda_kernels_loss_multilabel_margin_f16_can_implement(baracuda kernels loss multilabel margin f16 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ margin_ f16_ run - MultilabelMargin FW, f16.
- baracuda_
kernels_ ⚠loss_ multilabel_ margin_ f32_ can_ implement baracuda_kernels_loss_multilabel_margin_f32_can_implement(baracuda kernels loss multilabel margin f32 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ margin_ f32_ run - MultilabelMargin FW (per-row). ABI:
(n_rows, class_extent, row_stride_in, row_stride_tgt, reduction_mode, input, target_i64, out, workspace, workspace_bytes, stream). - baracuda_
kernels_ ⚠loss_ multilabel_ margin_ f64_ can_ implement baracuda_kernels_loss_multilabel_margin_f64_can_implement(baracuda kernels loss multilabel margin f64 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ margin_ f64_ run - MultilabelMargin FW, f64.
- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ backward_ bf16_ can_ implement baracuda_kernels_loss_multilabel_soft_margin_backward_bf16_can_implement(baracuda kernels loss multilabel soft margin backward bf16 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ backward_ bf16_ run - MultilabelSoftMargin BW, bf16.
- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ backward_ f16_ can_ implement baracuda_kernels_loss_multilabel_soft_margin_backward_f16_can_implement(baracuda kernels loss multilabel soft margin backward f16 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ backward_ f16_ run - MultilabelSoftMargin BW, f16.
- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ backward_ f32_ can_ implement baracuda_kernels_loss_multilabel_soft_margin_backward_f32_can_implement(baracuda kernels loss multilabel soft margin backward f32 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ backward_ f32_ run - MultilabelSoftMargin BW.
- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ backward_ f64_ can_ implement baracuda_kernels_loss_multilabel_soft_margin_backward_f64_can_implement(baracuda kernels loss multilabel soft margin backward f64 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ backward_ f64_ run - MultilabelSoftMargin BW, f64.
- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ bf16_ can_ implement baracuda_kernels_loss_multilabel_soft_margin_bf16_can_implement(baracuda kernels loss multilabel soft margin bf16 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ bf16_ run - MultilabelSoftMargin FW, bf16.
- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ f16_ can_ implement baracuda_kernels_loss_multilabel_soft_margin_f16_can_implement(baracuda kernels loss multilabel soft margin f16 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ f16_ run - MultilabelSoftMargin FW, f16.
- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ f32_ can_ implement baracuda_kernels_loss_multilabel_soft_margin_f32_can_implement(baracuda kernels loss multilabel soft margin f32 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ f32_ run - MultilabelSoftMargin FW.
- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ f64_ can_ implement baracuda_kernels_loss_multilabel_soft_margin_f64_can_implement(baracuda kernels loss multilabel soft margin f64 can implement).- baracuda_
kernels_ ⚠loss_ multilabel_ soft_ margin_ f64_ run - MultilabelSoftMargin FW, f64.
- baracuda_
kernels_ ⚠loss_ nll_ backward_ bf16_ can_ implement - NLL BW
_can_implement, bf16. - baracuda_
kernels_ ⚠loss_ nll_ backward_ bf16_ run - NLL BW, bf16.
- baracuda_
kernels_ ⚠loss_ nll_ backward_ f16_ can_ implement - NLL BW
_can_implement, f16. - baracuda_
kernels_ ⚠loss_ nll_ backward_ f16_ run - NLL BW, f16.
- baracuda_
kernels_ ⚠loss_ nll_ backward_ f32_ can_ implement - NLL BW
_can_implement, f32. - baracuda_
kernels_ ⚠loss_ nll_ backward_ f32_ run - NLL BW, f32. Pre-zeros
dinput(sizedinput_numel · sizeof(T)), then writesdinput[i, target[i]] = -dy_or_scale. - baracuda_
kernels_ ⚠loss_ nll_ backward_ f64_ can_ implement - NLL BW
_can_implement, f64. - baracuda_
kernels_ ⚠loss_ nll_ backward_ f64_ run - NLL BW, f64.
- baracuda_
kernels_ ⚠loss_ nll_ bf16_ can_ implement - NLL FW
_can_implement, bf16. - baracuda_
kernels_ ⚠loss_ nll_ bf16_ run - NLL FW, bf16.
- baracuda_
kernels_ ⚠loss_ nll_ f16_ can_ implement - NLL FW
_can_implement, f16. - baracuda_
kernels_ ⚠loss_ nll_ f16_ run - NLL FW, f16.
- baracuda_
kernels_ ⚠loss_ nll_ f32_ can_ implement - NLL FW
_can_implement, f32. - baracuda_
kernels_ ⚠loss_ nll_ f32_ run - NLL FW, f32.
-input[i, target[i]]per row. Heterogeneous-dtype: input isT, target isi64.row_stride_inputis the i64 stride between adjacent rows ofinput(must equalclass_extentfor contiguous input). - baracuda_
kernels_ ⚠loss_ nll_ f64_ can_ implement - NLL FW
_can_implement, f64. - baracuda_
kernels_ ⚠loss_ nll_ f64_ run - NLL FW, f64.
- baracuda_
kernels_ ⚠loss_ poisson_ nll_ backward_ bf16_ can_ implement baracuda_kernels_loss_poisson_nll_backward_bf16_can_implement(baracuda kernels loss poisson nll backward bf16 can implement).- baracuda_
kernels_ ⚠loss_ poisson_ nll_ backward_ bf16_ run - PoissonNLL BW, bf16.
- baracuda_
kernels_ ⚠loss_ poisson_ nll_ backward_ f16_ can_ implement baracuda_kernels_loss_poisson_nll_backward_f16_can_implement(baracuda kernels loss poisson nll backward f16 can implement).- baracuda_
kernels_ ⚠loss_ poisson_ nll_ backward_ f16_ run - PoissonNLL BW, f16.
- baracuda_
kernels_ ⚠loss_ poisson_ nll_ backward_ f32_ can_ implement baracuda_kernels_loss_poisson_nll_backward_f32_can_implement(baracuda kernels loss poisson nll backward f32 can implement).- baracuda_
kernels_ ⚠loss_ poisson_ nll_ backward_ f32_ run - PoissonNLL BW, f32.
- baracuda_
kernels_ ⚠loss_ poisson_ nll_ backward_ f64_ can_ implement baracuda_kernels_loss_poisson_nll_backward_f64_can_implement(baracuda kernels loss poisson nll backward f64 can implement).- baracuda_
kernels_ ⚠loss_ poisson_ nll_ backward_ f64_ run - PoissonNLL BW, f64.
- baracuda_
kernels_ ⚠loss_ poisson_ nll_ bf16_ can_ implement baracuda_kernels_loss_poisson_nll_bf16_can_implement(baracuda kernels loss poisson nll bf16 can implement).- baracuda_
kernels_ ⚠loss_ poisson_ nll_ bf16_ run - PoissonNLL FW, bf16.
- baracuda_
kernels_ ⚠loss_ poisson_ nll_ f16_ can_ implement baracuda_kernels_loss_poisson_nll_f16_can_implement(baracuda kernels loss poisson nll f16 can implement).- baracuda_
kernels_ ⚠loss_ poisson_ nll_ f16_ run - PoissonNLL FW, f16.
- baracuda_
kernels_ ⚠loss_ poisson_ nll_ f32_ can_ implement baracuda_kernels_loss_poisson_nll_f32_can_implement(baracuda kernels loss poisson nll f32 can implement).- baracuda_
kernels_ ⚠loss_ poisson_ nll_ f32_ run - PoissonNLL FW, f32.
log_input_flag0/1. - baracuda_
kernels_ ⚠loss_ poisson_ nll_ f64_ can_ implement baracuda_kernels_loss_poisson_nll_f64_can_implement(baracuda kernels loss poisson nll f64 can implement).- baracuda_
kernels_ ⚠loss_ poisson_ nll_ f64_ run - PoissonNLL FW, f64.
- baracuda_
kernels_ ⚠loss_ smooth_ l1_ backward_ bf16_ can_ implement baracuda_kernels_loss_smooth_l1_backward_bf16_can_implement(baracuda kernels loss smooth l1 backward bf16 can implement).- baracuda_
kernels_ ⚠loss_ smooth_ l1_ backward_ bf16_ run - SmoothL1 BW, bf16.
- baracuda_
kernels_ ⚠loss_ smooth_ l1_ backward_ f16_ can_ implement baracuda_kernels_loss_smooth_l1_backward_f16_can_implement(baracuda kernels loss smooth l1 backward f16 can implement).- baracuda_
kernels_ ⚠loss_ smooth_ l1_ backward_ f16_ run - SmoothL1 BW, f16.
- baracuda_
kernels_ ⚠loss_ smooth_ l1_ backward_ f32_ can_ implement baracuda_kernels_loss_smooth_l1_backward_f32_can_implement(baracuda kernels loss smooth l1 backward f32 can implement).- baracuda_
kernels_ ⚠loss_ smooth_ l1_ backward_ f32_ run - SmoothL1 BW, f32.
- baracuda_
kernels_ ⚠loss_ smooth_ l1_ backward_ f64_ can_ implement baracuda_kernels_loss_smooth_l1_backward_f64_can_implement(baracuda kernels loss smooth l1 backward f64 can implement).- baracuda_
kernels_ ⚠loss_ smooth_ l1_ backward_ f64_ run - SmoothL1 BW, f64.
- baracuda_
kernels_ ⚠loss_ smooth_ l1_ bf16_ can_ implement baracuda_kernels_loss_smooth_l1_bf16_can_implement(baracuda kernels loss smooth l1 bf16 can implement).- baracuda_
kernels_ ⚠loss_ smooth_ l1_ bf16_ run - SmoothL1 FW, bf16.
- baracuda_
kernels_ ⚠loss_ smooth_ l1_ f16_ can_ implement baracuda_kernels_loss_smooth_l1_f16_can_implement(baracuda kernels loss smooth l1 f16 can implement).- baracuda_
kernels_ ⚠loss_ smooth_ l1_ f16_ run - SmoothL1 FW, f16.
- baracuda_
kernels_ ⚠loss_ smooth_ l1_ f32_ can_ implement baracuda_kernels_loss_smooth_l1_f32_can_implement(baracuda kernels loss smooth l1 f32 can implement).- baracuda_
kernels_ ⚠loss_ smooth_ l1_ f32_ run - SmoothL1 FW, f32.
param = β. - baracuda_
kernels_ ⚠loss_ smooth_ l1_ f64_ can_ implement baracuda_kernels_loss_smooth_l1_f64_can_implement(baracuda kernels loss smooth l1 f64 can implement).- baracuda_
kernels_ ⚠loss_ smooth_ l1_ f64_ run - SmoothL1 FW, f64.
- baracuda_
kernels_ ⚠loss_ triplet_ margin_ backward_ bf16_ can_ implement baracuda_kernels_loss_triplet_margin_backward_bf16_can_implement(baracuda kernels loss triplet margin backward bf16 can implement).- baracuda_
kernels_ ⚠loss_ triplet_ margin_ backward_ bf16_ run - TripletMargin BW, bf16.
- baracuda_
kernels_ ⚠loss_ triplet_ margin_ backward_ f16_ can_ implement baracuda_kernels_loss_triplet_margin_backward_f16_can_implement(baracuda kernels loss triplet margin backward f16 can implement).- baracuda_
kernels_ ⚠loss_ triplet_ margin_ backward_ f16_ run - TripletMargin BW, f16.
- baracuda_
kernels_ ⚠loss_ triplet_ margin_ backward_ f32_ can_ implement baracuda_kernels_loss_triplet_margin_backward_f32_can_implement(baracuda kernels loss triplet margin backward f32 can implement).- baracuda_
kernels_ ⚠loss_ triplet_ margin_ backward_ f32_ run - TripletMargin BW.
- baracuda_
kernels_ ⚠loss_ triplet_ margin_ backward_ f64_ can_ implement baracuda_kernels_loss_triplet_margin_backward_f64_can_implement(baracuda kernels loss triplet margin backward f64 can implement).- baracuda_
kernels_ ⚠loss_ triplet_ margin_ backward_ f64_ run - TripletMargin BW, f64.
- baracuda_
kernels_ ⚠loss_ triplet_ margin_ bf16_ can_ implement baracuda_kernels_loss_triplet_margin_bf16_can_implement(baracuda kernels loss triplet margin bf16 can implement).- baracuda_
kernels_ ⚠loss_ triplet_ margin_ bf16_ run - TripletMargin FW, bf16.
- baracuda_
kernels_ ⚠loss_ triplet_ margin_ f16_ can_ implement baracuda_kernels_loss_triplet_margin_f16_can_implement(baracuda kernels loss triplet margin f16 can implement).- baracuda_
kernels_ ⚠loss_ triplet_ margin_ f16_ run - TripletMargin FW, f16.
- baracuda_
kernels_ ⚠loss_ triplet_ margin_ f32_ can_ implement baracuda_kernels_loss_triplet_margin_f32_can_implement(baracuda kernels loss triplet margin f32 can implement).- baracuda_
kernels_ ⚠loss_ triplet_ margin_ f32_ run - TripletMargin FW (per-row). ABI:
(n_rows, d_extent, row_stride, reduction_mode, margin, p_norm, a, p, n, out, workspace, workspace_bytes, stream). - baracuda_
kernels_ ⚠loss_ triplet_ margin_ f64_ can_ implement baracuda_kernels_loss_triplet_margin_f64_can_implement(baracuda kernels loss triplet margin f64 can implement).- baracuda_
kernels_ ⚠loss_ triplet_ margin_ f64_ run - TripletMargin FW, f64.
- baracuda_
kernels_ ⚠lp_ pool_ 1d_ bf16_ backward_ can_ implement baracuda_kernels_lp_pool_1d_bf16_backward_can_implement(baracuda kernels lp pool 1d bf16 backward can implement).- baracuda_
kernels_ ⚠lp_ pool_ 1d_ bf16_ backward_ run - LpPool 1d BW, bf16.
- baracuda_
kernels_ ⚠lp_ pool_ 1d_ bf16_ can_ implement baracuda_kernels_lp_pool_1d_bf16_can_implement(baracuda kernels lp pool 1d bf16 can implement).- baracuda_
kernels_ ⚠lp_ pool_ 1d_ bf16_ run - LpPool 1d FW, bf16.
- baracuda_
kernels_ ⚠lp_ pool_ 1d_ f16_ backward_ can_ implement baracuda_kernels_lp_pool_1d_f16_backward_can_implement(baracuda kernels lp pool 1d f16 backward can implement).- baracuda_
kernels_ ⚠lp_ pool_ 1d_ f16_ backward_ run - LpPool 1d BW, f16.
- baracuda_
kernels_ ⚠lp_ pool_ 1d_ f16_ can_ implement baracuda_kernels_lp_pool_1d_f16_can_implement(baracuda kernels lp pool 1d f16 can implement).- baracuda_
kernels_ ⚠lp_ pool_ 1d_ f16_ run - LpPool 1d FW, f16.
- baracuda_
kernels_ ⚠lp_ pool_ 1d_ f32_ backward_ can_ implement baracuda_kernels_lp_pool_1d_f32_backward_can_implement(baracuda kernels lp pool 1d f32 backward can implement).- baracuda_
kernels_ ⚠lp_ pool_ 1d_ f32_ backward_ run - LpPool 1d BW, f32. Caller must zero
dxfirst. - baracuda_
kernels_ ⚠lp_ pool_ 1d_ f32_ can_ implement baracuda_kernels_lp_pool_1d_f32_can_implement(baracuda kernels lp pool 1d f32 can implement).- baracuda_
kernels_ ⚠lp_ pool_ 1d_ f32_ run - LpPool 1d FW, f32.
- baracuda_
kernels_ ⚠lp_ pool_ 1d_ f64_ backward_ can_ implement baracuda_kernels_lp_pool_1d_f64_backward_can_implement(baracuda kernels lp pool 1d f64 backward can implement).- baracuda_
kernels_ ⚠lp_ pool_ 1d_ f64_ backward_ run - LpPool 1d BW, f64.
- baracuda_
kernels_ ⚠lp_ pool_ 1d_ f64_ can_ implement baracuda_kernels_lp_pool_1d_f64_can_implement(baracuda kernels lp pool 1d f64 can implement).- baracuda_
kernels_ ⚠lp_ pool_ 1d_ f64_ run - LpPool 1d FW, f64.
- baracuda_
kernels_ ⚠lp_ pool_ 2d_ bf16_ backward_ can_ implement baracuda_kernels_lp_pool_2d_bf16_backward_can_implement(baracuda kernels lp pool 2d bf16 backward can implement).- baracuda_
kernels_ ⚠lp_ pool_ 2d_ bf16_ backward_ run - LpPool 2d BW, bf16.
- baracuda_
kernels_ ⚠lp_ pool_ 2d_ bf16_ can_ implement baracuda_kernels_lp_pool_2d_bf16_can_implement(baracuda kernels lp pool 2d bf16 can implement).- baracuda_
kernels_ ⚠lp_ pool_ 2d_ bf16_ run - LpPool 2d FW, bf16.
- baracuda_
kernels_ ⚠lp_ pool_ 2d_ f16_ backward_ can_ implement baracuda_kernels_lp_pool_2d_f16_backward_can_implement(baracuda kernels lp pool 2d f16 backward can implement).- baracuda_
kernels_ ⚠lp_ pool_ 2d_ f16_ backward_ run - LpPool 2d BW, f16.
- baracuda_
kernels_ ⚠lp_ pool_ 2d_ f16_ can_ implement baracuda_kernels_lp_pool_2d_f16_can_implement(baracuda kernels lp pool 2d f16 can implement).- baracuda_
kernels_ ⚠lp_ pool_ 2d_ f16_ run - LpPool 2d FW, f16.
- baracuda_
kernels_ ⚠lp_ pool_ 2d_ f32_ backward_ can_ implement baracuda_kernels_lp_pool_2d_f32_backward_can_implement(baracuda kernels lp pool 2d f32 backward can implement).- baracuda_
kernels_ ⚠lp_ pool_ 2d_ f32_ backward_ run - LpPool 2d BW, f32. Caller must zero
dxfirst. - baracuda_
kernels_ ⚠lp_ pool_ 2d_ f32_ can_ implement baracuda_kernels_lp_pool_2d_f32_can_implement(baracuda kernels lp pool 2d f32 can implement).- baracuda_
kernels_ ⚠lp_ pool_ 2d_ f32_ run - LpPool 2d FW, f32.
- baracuda_
kernels_ ⚠lp_ pool_ 2d_ f64_ backward_ can_ implement baracuda_kernels_lp_pool_2d_f64_backward_can_implement(baracuda kernels lp pool 2d f64 backward can implement).- baracuda_
kernels_ ⚠lp_ pool_ 2d_ f64_ backward_ run - LpPool 2d BW, f64.
- baracuda_
kernels_ ⚠lp_ pool_ 2d_ f64_ can_ implement baracuda_kernels_lp_pool_2d_f64_can_implement(baracuda kernels lp pool 2d f64 can implement).- baracuda_
kernels_ ⚠lp_ pool_ 2d_ f64_ run - LpPool 2d FW, f64.
- baracuda_
kernels_ ⚠lstsq_ f32_ run - Least-squares solve via iterative
_gels(no QR fallback). On convergence,niters_out >= 0. On non-convergence,niters_out < 0and the caller should retry via the Rust plan layer (which holds the QR fallback path). - baracuda_
kernels_ ⚠lstsq_ f32_ workspace_ size - LstSq workspace size in BYTES (not elements — cuSOLVER’s
_gelsAPI differs from the others on this point). - baracuda_
kernels_ ⚠lstsq_ f64_ run - Least-squares solve via iterative
_gels(no QR fallback). On convergence,niters_out >= 0. On non-convergence,niters_out < 0and the caller should retry via the Rust plan layer (which holds the QR fallback path). - baracuda_
kernels_ ⚠lstsq_ f64_ workspace_ size - LstSq workspace size in BYTES (not elements — cuSOLVER’s
_gelsAPI differs from the others on this point). - baracuda_
kernels_ ⚠lu_ f32_ run - LU factorization with partial pivoting (non-batched).
a_inoutis overwritten with the packedLUfactors;pivots_outreceives the 1-based row swaps (lengthmin(m, n));info_outis a singlei32. - baracuda_
kernels_ ⚠lu_ f32_ workspace_ size - LU factorization workspace size in bytes for
getrf. - baracuda_
kernels_ ⚠lu_ f64_ run - LU factorization with partial pivoting (non-batched).
a_inoutis overwritten with the packedLUfactors;pivots_outreceives the 1-based row swaps (lengthmin(m, n));info_outis a singlei32. - baracuda_
kernels_ ⚠lu_ f64_ workspace_ size - LU factorization workspace size in bytes for
getrf. - baracuda_
kernels_ ⚠masked_ fill_ backward_ bool_ can_ implement - Implementability check for
masked_fill_backward_bool. - baracuda_
kernels_ ⚠masked_ fill_ backward_ bool_ run masked_fill_backward— bool (u8 storage).- baracuda_
kernels_ ⚠masked_ fill_ backward_ f32_ can_ implement - Implementability check for
masked_fill_backward_f32. - baracuda_
kernels_ ⚠masked_ fill_ backward_ f32_ run dsrc[i] = mask[i] ? 0 : dout[i]. f32.- baracuda_
kernels_ ⚠masked_ fill_ backward_ f64_ can_ implement - Implementability check for
masked_fill_backward_f64. - baracuda_
kernels_ ⚠masked_ fill_ backward_ f64_ run masked_fill_backward— f64.- baracuda_
kernels_ ⚠masked_ fill_ backward_ i32_ can_ implement - Implementability check for
masked_fill_backward_i32. - baracuda_
kernels_ ⚠masked_ fill_ backward_ i32_ run masked_fill_backward— i32.- baracuda_
kernels_ ⚠masked_ fill_ bool_ can_ implement - Implementability check for
masked_fill_bool. - baracuda_
kernels_ ⚠masked_ fill_ bool_ run masked_fill— bool (u8 storage).- baracuda_
kernels_ ⚠masked_ fill_ f32_ can_ implement - Implementability check for
masked_fill_f32. - baracuda_
kernels_ ⚠masked_ fill_ f32_ run out[i] = mask[i] ? fill_value : src[i]. f32 (caller passesfill_value.to_bits() as i64).- baracuda_
kernels_ ⚠masked_ fill_ f64_ can_ implement - Implementability check for
masked_fill_f64. - baracuda_
kernels_ ⚠masked_ fill_ f64_ run masked_fill— f64.- baracuda_
kernels_ ⚠masked_ fill_ i32_ can_ implement - Implementability check for
masked_fill_i32. - baracuda_
kernels_ ⚠masked_ fill_ i32_ run masked_fill— i32.- baracuda_
kernels_ ⚠mmvq_ batched_ bf16_ can_ implement baracuda_kernels_mmvq_batched_bf16_can_implement(baracuda kernels mmvq batched bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ batched_ bf16_ run - Batched MMV (non-quant) — bf16. # Safety: as f32.
- baracuda_
kernels_ ⚠mmvq_ batched_ f16_ can_ implement baracuda_kernels_mmvq_batched_f16_can_implement(baracuda kernels mmvq batched f16 can implement).- baracuda_
kernels_ ⚠mmvq_ batched_ f16_ run - Batched MMV (non-quant) — f16. # Safety: as f32.
- baracuda_
kernels_ ⚠mmvq_ batched_ f32_ can_ implement baracuda_kernels_mmvq_batched_f32_can_implement(baracuda kernels mmvq batched f32 can implement).- baracuda_
kernels_ ⚠mmvq_ batched_ f32_ run - Batched MMV (non-quant) — f32 weights + activation + output.
- baracuda_
kernels_ ⚠mmvq_ multim_ q2_ K_ m1_ can_ implement baracuda_kernels_mmvq_multim_q2_K_m1_can_implement(baracuda kernels mmvq multim q2 k m1 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q2_ K_ m1_ run baracuda_kernels_mmvq_multim_q2_K_m1_run(baracuda kernels mmvq multim q2 k m1 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q2_ K_ m2_ can_ implement baracuda_kernels_mmvq_multim_q2_K_m2_can_implement(baracuda kernels mmvq multim q2 k m2 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q2_ K_ m2_ run baracuda_kernels_mmvq_multim_q2_K_m2_run(baracuda kernels mmvq multim q2 k m2 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q2_ K_ m4_ can_ implement baracuda_kernels_mmvq_multim_q2_K_m4_can_implement(baracuda kernels mmvq multim q2 k m4 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q2_ K_ m4_ run baracuda_kernels_mmvq_multim_q2_K_m4_run(baracuda kernels mmvq multim q2 k m4 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q2_ K_ m8_ can_ implement baracuda_kernels_mmvq_multim_q2_K_m8_can_implement(baracuda kernels mmvq multim q2 k m8 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q2_ K_ m8_ run baracuda_kernels_mmvq_multim_q2_K_m8_run(baracuda kernels mmvq multim q2 k m8 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q3_ K_ m1_ can_ implement baracuda_kernels_mmvq_multim_q3_K_m1_can_implement(baracuda kernels mmvq multim q3 k m1 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q3_ K_ m1_ run baracuda_kernels_mmvq_multim_q3_K_m1_run(baracuda kernels mmvq multim q3 k m1 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q3_ K_ m2_ can_ implement baracuda_kernels_mmvq_multim_q3_K_m2_can_implement(baracuda kernels mmvq multim q3 k m2 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q3_ K_ m2_ run baracuda_kernels_mmvq_multim_q3_K_m2_run(baracuda kernels mmvq multim q3 k m2 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q3_ K_ m4_ can_ implement baracuda_kernels_mmvq_multim_q3_K_m4_can_implement(baracuda kernels mmvq multim q3 k m4 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q3_ K_ m4_ run baracuda_kernels_mmvq_multim_q3_K_m4_run(baracuda kernels mmvq multim q3 k m4 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q3_ K_ m8_ can_ implement baracuda_kernels_mmvq_multim_q3_K_m8_can_implement(baracuda kernels mmvq multim q3 k m8 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q3_ K_ m8_ run baracuda_kernels_mmvq_multim_q3_K_m8_run(baracuda kernels mmvq multim q3 k m8 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 0_ m1_ can_ implement baracuda_kernels_mmvq_multim_q4_0_m1_can_implement(baracuda kernels mmvq multim q4 0 m1 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 0_ m1_ run baracuda_kernels_mmvq_multim_q4_0_m1_run(baracuda kernels mmvq multim q4 0 m1 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 0_ m2_ can_ implement baracuda_kernels_mmvq_multim_q4_0_m2_can_implement(baracuda kernels mmvq multim q4 0 m2 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 0_ m2_ run baracuda_kernels_mmvq_multim_q4_0_m2_run(baracuda kernels mmvq multim q4 0 m2 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 0_ m4_ can_ implement baracuda_kernels_mmvq_multim_q4_0_m4_can_implement(baracuda kernels mmvq multim q4 0 m4 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 0_ m4_ run baracuda_kernels_mmvq_multim_q4_0_m4_run(baracuda kernels mmvq multim q4 0 m4 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 0_ m8_ can_ implement baracuda_kernels_mmvq_multim_q4_0_m8_can_implement(baracuda kernels mmvq multim q4 0 m8 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 0_ m8_ run baracuda_kernels_mmvq_multim_q4_0_m8_run(baracuda kernels mmvq multim q4 0 m8 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 1_ m1_ can_ implement baracuda_kernels_mmvq_multim_q4_1_m1_can_implement(baracuda kernels mmvq multim q4 1 m1 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 1_ m1_ run baracuda_kernels_mmvq_multim_q4_1_m1_run(baracuda kernels mmvq multim q4 1 m1 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 1_ m2_ can_ implement baracuda_kernels_mmvq_multim_q4_1_m2_can_implement(baracuda kernels mmvq multim q4 1 m2 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 1_ m2_ run baracuda_kernels_mmvq_multim_q4_1_m2_run(baracuda kernels mmvq multim q4 1 m2 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 1_ m4_ can_ implement baracuda_kernels_mmvq_multim_q4_1_m4_can_implement(baracuda kernels mmvq multim q4 1 m4 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 1_ m4_ run baracuda_kernels_mmvq_multim_q4_1_m4_run(baracuda kernels mmvq multim q4 1 m4 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 1_ m8_ can_ implement baracuda_kernels_mmvq_multim_q4_1_m8_can_implement(baracuda kernels mmvq multim q4 1 m8 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ 1_ m8_ run baracuda_kernels_mmvq_multim_q4_1_m8_run(baracuda kernels mmvq multim q4 1 m8 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ K_ m1_ can_ implement baracuda_kernels_mmvq_multim_q4_K_m1_can_implement(baracuda kernels mmvq multim q4 k m1 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ K_ m1_ run baracuda_kernels_mmvq_multim_q4_K_m1_run(baracuda kernels mmvq multim q4 k m1 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ K_ m2_ can_ implement baracuda_kernels_mmvq_multim_q4_K_m2_can_implement(baracuda kernels mmvq multim q4 k m2 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ K_ m2_ run baracuda_kernels_mmvq_multim_q4_K_m2_run(baracuda kernels mmvq multim q4 k m2 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ K_ m4_ can_ implement baracuda_kernels_mmvq_multim_q4_K_m4_can_implement(baracuda kernels mmvq multim q4 k m4 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ K_ m4_ run baracuda_kernels_mmvq_multim_q4_K_m4_run(baracuda kernels mmvq multim q4 k m4 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ K_ m8_ can_ implement baracuda_kernels_mmvq_multim_q4_K_m8_can_implement(baracuda kernels mmvq multim q4 k m8 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q4_ K_ m8_ run baracuda_kernels_mmvq_multim_q4_K_m8_run(baracuda kernels mmvq multim q4 k m8 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 0_ m1_ can_ implement baracuda_kernels_mmvq_multim_q5_0_m1_can_implement(baracuda kernels mmvq multim q5 0 m1 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 0_ m1_ run baracuda_kernels_mmvq_multim_q5_0_m1_run(baracuda kernels mmvq multim q5 0 m1 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 0_ m2_ can_ implement baracuda_kernels_mmvq_multim_q5_0_m2_can_implement(baracuda kernels mmvq multim q5 0 m2 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 0_ m2_ run baracuda_kernels_mmvq_multim_q5_0_m2_run(baracuda kernels mmvq multim q5 0 m2 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 0_ m4_ can_ implement baracuda_kernels_mmvq_multim_q5_0_m4_can_implement(baracuda kernels mmvq multim q5 0 m4 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 0_ m4_ run baracuda_kernels_mmvq_multim_q5_0_m4_run(baracuda kernels mmvq multim q5 0 m4 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 0_ m8_ can_ implement baracuda_kernels_mmvq_multim_q5_0_m8_can_implement(baracuda kernels mmvq multim q5 0 m8 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 0_ m8_ run baracuda_kernels_mmvq_multim_q5_0_m8_run(baracuda kernels mmvq multim q5 0 m8 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 1_ m1_ can_ implement baracuda_kernels_mmvq_multim_q5_1_m1_can_implement(baracuda kernels mmvq multim q5 1 m1 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 1_ m1_ run baracuda_kernels_mmvq_multim_q5_1_m1_run(baracuda kernels mmvq multim q5 1 m1 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 1_ m2_ can_ implement baracuda_kernels_mmvq_multim_q5_1_m2_can_implement(baracuda kernels mmvq multim q5 1 m2 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 1_ m2_ run baracuda_kernels_mmvq_multim_q5_1_m2_run(baracuda kernels mmvq multim q5 1 m2 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 1_ m4_ can_ implement baracuda_kernels_mmvq_multim_q5_1_m4_can_implement(baracuda kernels mmvq multim q5 1 m4 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 1_ m4_ run baracuda_kernels_mmvq_multim_q5_1_m4_run(baracuda kernels mmvq multim q5 1 m4 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 1_ m8_ can_ implement baracuda_kernels_mmvq_multim_q5_1_m8_can_implement(baracuda kernels mmvq multim q5 1 m8 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ 1_ m8_ run baracuda_kernels_mmvq_multim_q5_1_m8_run(baracuda kernels mmvq multim q5 1 m8 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ K_ m1_ can_ implement baracuda_kernels_mmvq_multim_q5_K_m1_can_implement(baracuda kernels mmvq multim q5 k m1 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ K_ m1_ run baracuda_kernels_mmvq_multim_q5_K_m1_run(baracuda kernels mmvq multim q5 k m1 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ K_ m2_ can_ implement baracuda_kernels_mmvq_multim_q5_K_m2_can_implement(baracuda kernels mmvq multim q5 k m2 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ K_ m2_ run baracuda_kernels_mmvq_multim_q5_K_m2_run(baracuda kernels mmvq multim q5 k m2 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ K_ m4_ can_ implement baracuda_kernels_mmvq_multim_q5_K_m4_can_implement(baracuda kernels mmvq multim q5 k m4 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ K_ m4_ run baracuda_kernels_mmvq_multim_q5_K_m4_run(baracuda kernels mmvq multim q5 k m4 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ K_ m8_ can_ implement baracuda_kernels_mmvq_multim_q5_K_m8_can_implement(baracuda kernels mmvq multim q5 k m8 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q5_ K_ m8_ run baracuda_kernels_mmvq_multim_q5_K_m8_run(baracuda kernels mmvq multim q5 k m8 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q6_ K_ m1_ can_ implement baracuda_kernels_mmvq_multim_q6_K_m1_can_implement(baracuda kernels mmvq multim q6 k m1 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q6_ K_ m1_ run baracuda_kernels_mmvq_multim_q6_K_m1_run(baracuda kernels mmvq multim q6 k m1 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q6_ K_ m2_ can_ implement baracuda_kernels_mmvq_multim_q6_K_m2_can_implement(baracuda kernels mmvq multim q6 k m2 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q6_ K_ m2_ run baracuda_kernels_mmvq_multim_q6_K_m2_run(baracuda kernels mmvq multim q6 k m2 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q6_ K_ m4_ can_ implement baracuda_kernels_mmvq_multim_q6_K_m4_can_implement(baracuda kernels mmvq multim q6 k m4 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q6_ K_ m4_ run baracuda_kernels_mmvq_multim_q6_K_m4_run(baracuda kernels mmvq multim q6 k m4 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q6_ K_ m8_ can_ implement baracuda_kernels_mmvq_multim_q6_K_m8_can_implement(baracuda kernels mmvq multim q6 k m8 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q6_ K_ m8_ run baracuda_kernels_mmvq_multim_q6_K_m8_run(baracuda kernels mmvq multim q6 k m8 run).- baracuda_
kernels_ ⚠mmvq_ multim_ q8_ 0_ m1_ can_ implement baracuda_kernels_mmvq_multim_q8_0_m1_can_implement(baracuda kernels mmvq multim q8 0 m1 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q8_ 0_ m1_ run - Multi-M MMVQ for Q8_0 weights, M=1 (decode regime). Computes
dst[0, r] = Σ_c W[r, c] * y[0, c]for r ∈ [0, nrows_x). - baracuda_
kernels_ ⚠mmvq_ multim_ q8_ 0_ m2_ can_ implement baracuda_kernels_mmvq_multim_q8_0_m2_can_implement(baracuda kernels mmvq multim q8 0 m2 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q8_ 0_ m2_ run - Multi-M MMVQ for Q8_0 weights, M=2. # Safety: as M=1.
- baracuda_
kernels_ ⚠mmvq_ multim_ q8_ 0_ m4_ can_ implement baracuda_kernels_mmvq_multim_q8_0_m4_can_implement(baracuda kernels mmvq multim q8 0 m4 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q8_ 0_ m4_ run - Multi-M MMVQ for Q8_0 weights, M=4. # Safety: as M=1.
- baracuda_
kernels_ ⚠mmvq_ multim_ q8_ 0_ m8_ can_ implement baracuda_kernels_mmvq_multim_q8_0_m8_can_implement(baracuda kernels mmvq multim q8 0 m8 can implement).- baracuda_
kernels_ ⚠mmvq_ multim_ q8_ 0_ m8_ run - Multi-M MMVQ for Q8_0 weights, M=8 (prefill regime, target 3-7× vs the per-token M=1 dispatch). # Safety: as M=1.
- baracuda_
kernels_ ⚠mmvq_ q2_ K_ actstrided_ bf16_ can_ implement baracuda_kernels_mmvq_q2_K_actstrided_bf16_can_implement(baracuda kernels mmvq q2 k actstrided bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q2_ K_ actstrided_ bf16_ run - Strided MMVQ — Q2_K, bf16. # Safety: as Q2_K strided f16.
- baracuda_
kernels_ ⚠mmvq_ q2_ K_ actstrided_ can_ implement baracuda_kernels_mmvq_q2_K_actstrided_can_implement(baracuda kernels mmvq q2 k actstrided can implement).- baracuda_
kernels_ ⚠mmvq_ q2_ K_ actstrided_ f16_ can_ implement baracuda_kernels_mmvq_q2_K_actstrided_f16_can_implement(baracuda kernels mmvq q2 k actstrided f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q2_ K_ actstrided_ f16_ run - Strided MMVQ — Q2_K, f16. # Safety: as Q4_0 strided f16, ncols mul of 256.
- baracuda_
kernels_ ⚠mmvq_ q2_ K_ actstrided_ run - Strided MMVQ — GGUF
Q2_K. # Safety: as the contig sibling. - baracuda_
kernels_ ⚠mmvq_ q2_ K_ batched_ bf16_ can_ implement baracuda_kernels_mmvq_q2_K_batched_bf16_can_implement(baracuda kernels mmvq q2 k batched bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q2_ K_ batched_ bf16_ run - Batched MMVQ — Q2_K, bf16. # Safety: as Q2_K f32.
- baracuda_
kernels_ ⚠mmvq_ q2_ K_ batched_ can_ implement baracuda_kernels_mmvq_q2_K_batched_can_implement(baracuda kernels mmvq q2 k batched can implement).- baracuda_
kernels_ ⚠mmvq_ q2_ K_ batched_ f16_ can_ implement baracuda_kernels_mmvq_q2_K_batched_f16_can_implement(baracuda kernels mmvq q2 k batched f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q2_ K_ batched_ f16_ run - Batched MMVQ — Q2_K, f16. # Safety: as Q2_K f32.
- baracuda_
kernels_ ⚠mmvq_ q2_ K_ batched_ run - Batched MMVQ — Q2_K, f32. # Safety: as Q4_0, ncols mul of 256.
- baracuda_
kernels_ ⚠mmvq_ q2_ K_ bf16_ can_ implement baracuda_kernels_mmvq_q2_K_bf16_can_implement(baracuda kernels mmvq q2 k bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q2_ K_ bf16_ run - MMVQ — Q2_K, bf16. # Safety: as Q4_0 bf16, ncols must be multiple of 256.
- baracuda_
kernels_ ⚠mmvq_ q2_ K_ can_ implement baracuda_kernels_mmvq_q2_K_can_implement(baracuda kernels mmvq q2 k can implement).- baracuda_
kernels_ ⚠mmvq_ q2_ K_ f16_ can_ implement baracuda_kernels_mmvq_q2_K_f16_can_implement(baracuda kernels mmvq q2 k f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q2_ K_ f16_ run - MMVQ — Q2_K, f16. # Safety: as Q4_0 f16, ncols must be multiple of 256.
- baracuda_
kernels_ ⚠mmvq_ q2_ K_ run - GGUF
Q2_KMMVQ — FP-activation matrix-vector mul.ncolsmust be a multiple of 256. - baracuda_
kernels_ ⚠mmvq_ q3_ K_ actstrided_ bf16_ can_ implement baracuda_kernels_mmvq_q3_K_actstrided_bf16_can_implement(baracuda kernels mmvq q3 k actstrided bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q3_ K_ actstrided_ bf16_ run - Strided MMVQ — Q3_K, bf16. # Safety: as Q2_K strided bf16.
- baracuda_
kernels_ ⚠mmvq_ q3_ K_ actstrided_ can_ implement baracuda_kernels_mmvq_q3_K_actstrided_can_implement(baracuda kernels mmvq q3 k actstrided can implement).- baracuda_
kernels_ ⚠mmvq_ q3_ K_ actstrided_ f16_ can_ implement baracuda_kernels_mmvq_q3_K_actstrided_f16_can_implement(baracuda kernels mmvq q3 k actstrided f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q3_ K_ actstrided_ f16_ run - Strided MMVQ — Q3_K, f16. # Safety: as Q2_K strided f16.
- baracuda_
kernels_ ⚠mmvq_ q3_ K_ actstrided_ run - Strided MMVQ — GGUF
Q3_K. # Safety: as the contig sibling. - baracuda_
kernels_ ⚠mmvq_ q3_ K_ batched_ bf16_ can_ implement baracuda_kernels_mmvq_q3_K_batched_bf16_can_implement(baracuda kernels mmvq q3 k batched bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q3_ K_ batched_ bf16_ run - Batched MMVQ — Q3_K, bf16. # Safety: as Q2_K.
- baracuda_
kernels_ ⚠mmvq_ q3_ K_ batched_ can_ implement baracuda_kernels_mmvq_q3_K_batched_can_implement(baracuda kernels mmvq q3 k batched can implement).- baracuda_
kernels_ ⚠mmvq_ q3_ K_ batched_ f16_ can_ implement baracuda_kernels_mmvq_q3_K_batched_f16_can_implement(baracuda kernels mmvq q3 k batched f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q3_ K_ batched_ f16_ run - Batched MMVQ — Q3_K, f16. # Safety: as Q2_K.
- baracuda_
kernels_ ⚠mmvq_ q3_ K_ batched_ run - Batched MMVQ — Q3_K, f32. # Safety: as Q2_K.
- baracuda_
kernels_ ⚠mmvq_ q3_ K_ bf16_ can_ implement baracuda_kernels_mmvq_q3_K_bf16_can_implement(baracuda kernels mmvq q3 k bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q3_ K_ bf16_ run - MMVQ — Q3_K, bf16. # Safety: as Q2_K bf16.
- baracuda_
kernels_ ⚠mmvq_ q3_ K_ can_ implement baracuda_kernels_mmvq_q3_K_can_implement(baracuda kernels mmvq q3 k can implement).- baracuda_
kernels_ ⚠mmvq_ q3_ K_ f16_ can_ implement baracuda_kernels_mmvq_q3_K_f16_can_implement(baracuda kernels mmvq q3 k f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q3_ K_ f16_ run - MMVQ — Q3_K, f16. # Safety: as Q2_K f16.
- baracuda_
kernels_ ⚠mmvq_ q3_ K_ run - GGUF
Q3_KMMVQ. # Safety: asQ2_K. - baracuda_
kernels_ ⚠mmvq_ q4_ 0_ actstrided_ bf16_ can_ implement baracuda_kernels_mmvq_q4_0_actstrided_bf16_can_implement(baracuda kernels mmvq q4 0 actstrided bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 0_ actstrided_ bf16_ run - Strided MMVQ — Q4_0, bf16. # Safety: as the f32 strided sibling.
- baracuda_
kernels_ ⚠mmvq_ q4_ 0_ actstrided_ can_ implement baracuda_kernels_mmvq_q4_0_actstrided_can_implement(baracuda kernels mmvq q4 0 actstrided can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 0_ actstrided_ f16_ can_ implement baracuda_kernels_mmvq_q4_0_actstrided_f16_can_implement(baracuda kernels mmvq q4 0 actstrided f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 0_ actstrided_ f16_ run - Strided MMVQ — Q4_0, f16. # Safety: as the f32 strided sibling.
- baracuda_
kernels_ ⚠mmvq_ q4_ 0_ actstrided_ run - Strided MMVQ — GGUF
Q4_0. # Safety: as the contig Q4_0 variant, plus(y[k * stride_y])_k=0..ncolsmust be a valid f32 read. - baracuda_
kernels_ ⚠mmvq_ q4_ 0_ batched_ bf16_ can_ implement baracuda_kernels_mmvq_q4_0_batched_bf16_can_implement(baracuda kernels mmvq q4 0 batched bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 0_ batched_ bf16_ run - Batched MMVQ — Q4_0, bf16. # Safety: as Q4_0 f32.
- baracuda_
kernels_ ⚠mmvq_ q4_ 0_ batched_ can_ implement baracuda_kernels_mmvq_q4_0_batched_can_implement(baracuda kernels mmvq q4 0 batched can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 0_ batched_ f16_ can_ implement baracuda_kernels_mmvq_q4_0_batched_f16_can_implement(baracuda kernels mmvq q4 0 batched f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 0_ batched_ f16_ run - Batched MMVQ — Q4_0, f16. # Safety: as Q4_0 f32.
- baracuda_
kernels_ ⚠mmvq_ q4_ 0_ batched_ run - Batched MMVQ — Q4_0, f32 activation + output. # Safety: device-resident
pointers; valid stream;
workspace≥m_total * 4bytes. - baracuda_
kernels_ ⚠mmvq_ q4_ 0_ bf16_ can_ implement baracuda_kernels_mmvq_q4_0_bf16_can_implement(baracuda kernels mmvq q4 0 bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 0_ bf16_ run - MMVQ — Q4_0, bf16 activation + bf16 output. # Safety: as the f32 sibling
with
y/dsttyped__nv_bfloat16device-resident. - baracuda_
kernels_ ⚠mmvq_ q4_ 0_ can_ implement baracuda_kernels_mmvq_q4_0_can_implement(baracuda kernels mmvq q4 0 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 0_ f16_ can_ implement baracuda_kernels_mmvq_q4_0_f16_can_implement(baracuda kernels mmvq q4 0 f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 0_ f16_ run - MMVQ — Q4_0, f16 activation + f16 output. # Safety: as the f32 sibling
with
y/dsttyped__halfdevice-resident. - baracuda_
kernels_ ⚠mmvq_ q4_ 0_ run - GGUF
Q4_0MMVQ — FP-activation matrix-vector mul.ncolsmust be a multiple of 32. - baracuda_
kernels_ ⚠mmvq_ q4_ 1_ actstrided_ bf16_ can_ implement baracuda_kernels_mmvq_q4_1_actstrided_bf16_can_implement(baracuda kernels mmvq q4 1 actstrided bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ actstrided_ bf16_ run - Strided MMVQ — Q4_1, bf16. # Safety: as Q4_0 strided bf16.
- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ actstrided_ can_ implement baracuda_kernels_mmvq_q4_1_actstrided_can_implement(baracuda kernels mmvq q4 1 actstrided can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ actstrided_ f16_ can_ implement baracuda_kernels_mmvq_q4_1_actstrided_f16_can_implement(baracuda kernels mmvq q4 1 actstrided f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ actstrided_ f16_ run - Strided MMVQ — Q4_1, f16. # Safety: as Q4_0 strided f16.
- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ actstrided_ run - Strided MMVQ — GGUF
Q4_1. # Safety: as the contig sibling. - baracuda_
kernels_ ⚠mmvq_ q4_ 1_ batched_ bf16_ can_ implement baracuda_kernels_mmvq_q4_1_batched_bf16_can_implement(baracuda kernels mmvq q4 1 batched bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ batched_ bf16_ run - Batched MMVQ — Q4_1, bf16. # Safety: as Q4_0.
- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ batched_ can_ implement baracuda_kernels_mmvq_q4_1_batched_can_implement(baracuda kernels mmvq q4 1 batched can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ batched_ f16_ can_ implement baracuda_kernels_mmvq_q4_1_batched_f16_can_implement(baracuda kernels mmvq q4 1 batched f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ batched_ f16_ run - Batched MMVQ — Q4_1, f16. # Safety: as Q4_0.
- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ batched_ run - Batched MMVQ — Q4_1, f32. # Safety: as Q4_0.
- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ bf16_ can_ implement baracuda_kernels_mmvq_q4_1_bf16_can_implement(baracuda kernels mmvq q4 1 bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ bf16_ run - MMVQ — Q4_1, bf16 activation + bf16 output. # Safety: as Q4_0 bf16.
- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ can_ implement baracuda_kernels_mmvq_q4_1_can_implement(baracuda kernels mmvq q4 1 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ f16_ can_ implement baracuda_kernels_mmvq_q4_1_f16_can_implement(baracuda kernels mmvq q4 1 f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ f16_ run - MMVQ — Q4_1, f16 activation + f16 output. # Safety: as Q4_0 f16.
- baracuda_
kernels_ ⚠mmvq_ q4_ 1_ run - GGUF
Q4_1MMVQ. # Safety: asQ4_0. - baracuda_
kernels_ ⚠mmvq_ q4_ K_ actstrided_ bf16_ can_ implement baracuda_kernels_mmvq_q4_K_actstrided_bf16_can_implement(baracuda kernels mmvq q4 k actstrided bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ K_ actstrided_ bf16_ run - Strided MMVQ — Q4_K, bf16. # Safety: as Q2_K strided bf16.
- baracuda_
kernels_ ⚠mmvq_ q4_ K_ actstrided_ can_ implement baracuda_kernels_mmvq_q4_K_actstrided_can_implement(baracuda kernels mmvq q4 k actstrided can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ K_ actstrided_ f16_ can_ implement baracuda_kernels_mmvq_q4_K_actstrided_f16_can_implement(baracuda kernels mmvq q4 k actstrided f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ K_ actstrided_ f16_ run - Strided MMVQ — Q4_K, f16. # Safety: as Q2_K strided f16.
- baracuda_
kernels_ ⚠mmvq_ q4_ K_ actstrided_ run - Strided MMVQ — GGUF
Q4_K. # Safety: as the contig sibling. - baracuda_
kernels_ ⚠mmvq_ q4_ K_ batched_ bf16_ can_ implement baracuda_kernels_mmvq_q4_K_batched_bf16_can_implement(baracuda kernels mmvq q4 k batched bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ K_ batched_ bf16_ run - Batched MMVQ — Q4_K, bf16. # Safety: as Q2_K.
- baracuda_
kernels_ ⚠mmvq_ q4_ K_ batched_ can_ implement baracuda_kernels_mmvq_q4_K_batched_can_implement(baracuda kernels mmvq q4 k batched can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ K_ batched_ f16_ can_ implement baracuda_kernels_mmvq_q4_K_batched_f16_can_implement(baracuda kernels mmvq q4 k batched f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ K_ batched_ f16_ run - Batched MMVQ — Q4_K, f16. # Safety: as Q2_K.
- baracuda_
kernels_ ⚠mmvq_ q4_ K_ batched_ run - Batched MMVQ — Q4_K, f32. # Safety: as Q2_K.
- baracuda_
kernels_ ⚠mmvq_ q4_ K_ bf16_ can_ implement baracuda_kernels_mmvq_q4_K_bf16_can_implement(baracuda kernels mmvq q4 k bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ K_ bf16_ run - MMVQ — Q4_K, bf16. # Safety: as Q2_K bf16.
- baracuda_
kernels_ ⚠mmvq_ q4_ K_ can_ implement baracuda_kernels_mmvq_q4_K_can_implement(baracuda kernels mmvq q4 k can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ K_ f16_ can_ implement baracuda_kernels_mmvq_q4_K_f16_can_implement(baracuda kernels mmvq q4 k f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q4_ K_ f16_ run - MMVQ — Q4_K, f16. # Safety: as Q2_K f16.
- baracuda_
kernels_ ⚠mmvq_ q4_ K_ run - GGUF
Q4_KMMVQ. # Safety: asQ2_K. - baracuda_
kernels_ ⚠mmvq_ q5_ 0_ actstrided_ bf16_ can_ implement baracuda_kernels_mmvq_q5_0_actstrided_bf16_can_implement(baracuda kernels mmvq q5 0 actstrided bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ actstrided_ bf16_ run - Strided MMVQ — Q5_0, bf16. # Safety: as Q4_0 strided bf16.
- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ actstrided_ can_ implement baracuda_kernels_mmvq_q5_0_actstrided_can_implement(baracuda kernels mmvq q5 0 actstrided can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ actstrided_ f16_ can_ implement baracuda_kernels_mmvq_q5_0_actstrided_f16_can_implement(baracuda kernels mmvq q5 0 actstrided f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ actstrided_ f16_ run - Strided MMVQ — Q5_0, f16. # Safety: as Q4_0 strided f16.
- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ actstrided_ run - Strided MMVQ — GGUF
Q5_0. # Safety: as the contig sibling. - baracuda_
kernels_ ⚠mmvq_ q5_ 0_ batched_ bf16_ can_ implement baracuda_kernels_mmvq_q5_0_batched_bf16_can_implement(baracuda kernels mmvq q5 0 batched bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ batched_ bf16_ run - Batched MMVQ — Q5_0, bf16. # Safety: as Q4_0.
- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ batched_ can_ implement baracuda_kernels_mmvq_q5_0_batched_can_implement(baracuda kernels mmvq q5 0 batched can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ batched_ f16_ can_ implement baracuda_kernels_mmvq_q5_0_batched_f16_can_implement(baracuda kernels mmvq q5 0 batched f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ batched_ f16_ run - Batched MMVQ — Q5_0, f16. # Safety: as Q4_0.
- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ batched_ run - Batched MMVQ — Q5_0, f32. # Safety: as Q4_0.
- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ bf16_ can_ implement baracuda_kernels_mmvq_q5_0_bf16_can_implement(baracuda kernels mmvq q5 0 bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ bf16_ run - MMVQ — Q5_0, bf16. # Safety: as Q4_0 bf16.
- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ can_ implement baracuda_kernels_mmvq_q5_0_can_implement(baracuda kernels mmvq q5 0 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ f16_ can_ implement baracuda_kernels_mmvq_q5_0_f16_can_implement(baracuda kernels mmvq q5 0 f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ f16_ run - MMVQ — Q5_0, f16. # Safety: as Q4_0 f16.
- baracuda_
kernels_ ⚠mmvq_ q5_ 0_ run - GGUF
Q5_0MMVQ. # Safety: asQ4_0. - baracuda_
kernels_ ⚠mmvq_ q5_ 1_ actstrided_ bf16_ can_ implement baracuda_kernels_mmvq_q5_1_actstrided_bf16_can_implement(baracuda kernels mmvq q5 1 actstrided bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ actstrided_ bf16_ run - Strided MMVQ — Q5_1, bf16. # Safety: as Q4_0 strided bf16.
- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ actstrided_ can_ implement baracuda_kernels_mmvq_q5_1_actstrided_can_implement(baracuda kernels mmvq q5 1 actstrided can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ actstrided_ f16_ can_ implement baracuda_kernels_mmvq_q5_1_actstrided_f16_can_implement(baracuda kernels mmvq q5 1 actstrided f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ actstrided_ f16_ run - Strided MMVQ — Q5_1, f16. # Safety: as Q4_0 strided f16.
- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ actstrided_ run - Strided MMVQ — GGUF
Q5_1. # Safety: as the contig sibling. - baracuda_
kernels_ ⚠mmvq_ q5_ 1_ batched_ bf16_ can_ implement baracuda_kernels_mmvq_q5_1_batched_bf16_can_implement(baracuda kernels mmvq q5 1 batched bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ batched_ bf16_ run - Batched MMVQ — Q5_1, bf16. # Safety: as Q4_0.
- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ batched_ can_ implement baracuda_kernels_mmvq_q5_1_batched_can_implement(baracuda kernels mmvq q5 1 batched can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ batched_ f16_ can_ implement baracuda_kernels_mmvq_q5_1_batched_f16_can_implement(baracuda kernels mmvq q5 1 batched f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ batched_ f16_ run - Batched MMVQ — Q5_1, f16. # Safety: as Q4_0.
- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ batched_ run - Batched MMVQ — Q5_1, f32. # Safety: as Q4_0.
- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ bf16_ can_ implement baracuda_kernels_mmvq_q5_1_bf16_can_implement(baracuda kernels mmvq q5 1 bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ bf16_ run - MMVQ — Q5_1, bf16. # Safety: as Q4_0 bf16.
- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ can_ implement baracuda_kernels_mmvq_q5_1_can_implement(baracuda kernels mmvq q5 1 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ f16_ can_ implement baracuda_kernels_mmvq_q5_1_f16_can_implement(baracuda kernels mmvq q5 1 f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ f16_ run - MMVQ — Q5_1, f16. # Safety: as Q4_0 f16.
- baracuda_
kernels_ ⚠mmvq_ q5_ 1_ run - GGUF
Q5_1MMVQ. # Safety: asQ4_0. - baracuda_
kernels_ ⚠mmvq_ q5_ K_ actstrided_ bf16_ can_ implement baracuda_kernels_mmvq_q5_K_actstrided_bf16_can_implement(baracuda kernels mmvq q5 k actstrided bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ K_ actstrided_ bf16_ run - Strided MMVQ — Q5_K, bf16. # Safety: as Q2_K strided bf16.
- baracuda_
kernels_ ⚠mmvq_ q5_ K_ actstrided_ can_ implement baracuda_kernels_mmvq_q5_K_actstrided_can_implement(baracuda kernels mmvq q5 k actstrided can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ K_ actstrided_ f16_ can_ implement baracuda_kernels_mmvq_q5_K_actstrided_f16_can_implement(baracuda kernels mmvq q5 k actstrided f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ K_ actstrided_ f16_ run - Strided MMVQ — Q5_K, f16. # Safety: as Q2_K strided f16.
- baracuda_
kernels_ ⚠mmvq_ q5_ K_ actstrided_ run - Strided MMVQ — GGUF
Q5_K. # Safety: as the contig sibling. - baracuda_
kernels_ ⚠mmvq_ q5_ K_ batched_ bf16_ can_ implement baracuda_kernels_mmvq_q5_K_batched_bf16_can_implement(baracuda kernels mmvq q5 k batched bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ K_ batched_ bf16_ run - Batched MMVQ — Q5_K, bf16. # Safety: as Q2_K.
- baracuda_
kernels_ ⚠mmvq_ q5_ K_ batched_ can_ implement baracuda_kernels_mmvq_q5_K_batched_can_implement(baracuda kernels mmvq q5 k batched can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ K_ batched_ f16_ can_ implement baracuda_kernels_mmvq_q5_K_batched_f16_can_implement(baracuda kernels mmvq q5 k batched f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ K_ batched_ f16_ run - Batched MMVQ — Q5_K, f16. # Safety: as Q2_K.
- baracuda_
kernels_ ⚠mmvq_ q5_ K_ batched_ run - Batched MMVQ — Q5_K, f32. # Safety: as Q2_K.
- baracuda_
kernels_ ⚠mmvq_ q5_ K_ bf16_ can_ implement baracuda_kernels_mmvq_q5_K_bf16_can_implement(baracuda kernels mmvq q5 k bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ K_ bf16_ run - MMVQ — Q5_K, bf16. # Safety: as Q2_K bf16.
- baracuda_
kernels_ ⚠mmvq_ q5_ K_ can_ implement baracuda_kernels_mmvq_q5_K_can_implement(baracuda kernels mmvq q5 k can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ K_ f16_ can_ implement baracuda_kernels_mmvq_q5_K_f16_can_implement(baracuda kernels mmvq q5 k f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q5_ K_ f16_ run - MMVQ — Q5_K, f16. # Safety: as Q2_K f16.
- baracuda_
kernels_ ⚠mmvq_ q5_ K_ run - GGUF
Q5_KMMVQ. # Safety: asQ2_K. - baracuda_
kernels_ ⚠mmvq_ q6_ K_ actstrided_ bf16_ can_ implement baracuda_kernels_mmvq_q6_K_actstrided_bf16_can_implement(baracuda kernels mmvq q6 k actstrided bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q6_ K_ actstrided_ bf16_ run - Strided MMVQ — Q6_K, bf16. # Safety: as Q2_K strided bf16.
- baracuda_
kernels_ ⚠mmvq_ q6_ K_ actstrided_ can_ implement baracuda_kernels_mmvq_q6_K_actstrided_can_implement(baracuda kernels mmvq q6 k actstrided can implement).- baracuda_
kernels_ ⚠mmvq_ q6_ K_ actstrided_ f16_ can_ implement baracuda_kernels_mmvq_q6_K_actstrided_f16_can_implement(baracuda kernels mmvq q6 k actstrided f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q6_ K_ actstrided_ f16_ run - Strided MMVQ — Q6_K, f16. # Safety: as Q2_K strided f16.
- baracuda_
kernels_ ⚠mmvq_ q6_ K_ actstrided_ run - Strided MMVQ — GGUF
Q6_K. # Safety: as the contig sibling. - baracuda_
kernels_ ⚠mmvq_ q6_ K_ batched_ bf16_ can_ implement baracuda_kernels_mmvq_q6_K_batched_bf16_can_implement(baracuda kernels mmvq q6 k batched bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q6_ K_ batched_ bf16_ run - Batched MMVQ — Q6_K, bf16. # Safety: as Q2_K.
- baracuda_
kernels_ ⚠mmvq_ q6_ K_ batched_ can_ implement baracuda_kernels_mmvq_q6_K_batched_can_implement(baracuda kernels mmvq q6 k batched can implement).- baracuda_
kernels_ ⚠mmvq_ q6_ K_ batched_ f16_ can_ implement baracuda_kernels_mmvq_q6_K_batched_f16_can_implement(baracuda kernels mmvq q6 k batched f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q6_ K_ batched_ f16_ run - Batched MMVQ — Q6_K, f16. # Safety: as Q2_K.
- baracuda_
kernels_ ⚠mmvq_ q6_ K_ batched_ run - Batched MMVQ — Q6_K, f32. # Safety: as Q2_K.
- baracuda_
kernels_ ⚠mmvq_ q6_ K_ bf16_ can_ implement baracuda_kernels_mmvq_q6_K_bf16_can_implement(baracuda kernels mmvq q6 k bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q6_ K_ bf16_ run - MMVQ — Q6_K, bf16. # Safety: as Q2_K bf16.
- baracuda_
kernels_ ⚠mmvq_ q6_ K_ can_ implement baracuda_kernels_mmvq_q6_K_can_implement(baracuda kernels mmvq q6 k can implement).- baracuda_
kernels_ ⚠mmvq_ q6_ K_ f16_ can_ implement baracuda_kernels_mmvq_q6_K_f16_can_implement(baracuda kernels mmvq q6 k f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q6_ K_ f16_ run - MMVQ — Q6_K, f16. # Safety: as Q2_K f16.
- baracuda_
kernels_ ⚠mmvq_ q6_ K_ run - GGUF
Q6_KMMVQ. # Safety: asQ2_K. - baracuda_
kernels_ ⚠mmvq_ q8_ 0_ actstrided_ bf16_ can_ implement baracuda_kernels_mmvq_q8_0_actstrided_bf16_can_implement(baracuda kernels mmvq q8 0 actstrided bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ actstrided_ bf16_ run - Strided MMVQ — Q8_0, bf16. # Safety: as Q4_0 strided bf16.
- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ actstrided_ can_ implement baracuda_kernels_mmvq_q8_0_actstrided_can_implement(baracuda kernels mmvq q8 0 actstrided can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ actstrided_ f16_ can_ implement baracuda_kernels_mmvq_q8_0_actstrided_f16_can_implement(baracuda kernels mmvq q8 0 actstrided f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ actstrided_ f16_ run - Strided MMVQ — Q8_0, f16. # Safety: as Q4_0 strided f16.
- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ actstrided_ run - Strided MMVQ — GGUF
Q8_0. # Safety: as the contig sibling. - baracuda_
kernels_ ⚠mmvq_ q8_ 0_ batched_ bf16_ can_ implement baracuda_kernels_mmvq_q8_0_batched_bf16_can_implement(baracuda kernels mmvq q8 0 batched bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ batched_ bf16_ run - Batched MMVQ — Q8_0, bf16. # Safety: as Q4_0.
- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ batched_ can_ implement baracuda_kernels_mmvq_q8_0_batched_can_implement(baracuda kernels mmvq q8 0 batched can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ batched_ f16_ can_ implement baracuda_kernels_mmvq_q8_0_batched_f16_can_implement(baracuda kernels mmvq q8 0 batched f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ batched_ f16_ run - Batched MMVQ — Q8_0, f16. # Safety: as Q4_0.
- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ batched_ run - Batched MMVQ — Q8_0, f32. # Safety: as Q4_0.
- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ bf16_ can_ implement baracuda_kernels_mmvq_q8_0_bf16_can_implement(baracuda kernels mmvq q8 0 bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ bf16_ run - MMVQ — Q8_0, bf16. # Safety: as Q4_0 bf16.
- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ can_ implement baracuda_kernels_mmvq_q8_0_can_implement(baracuda kernels mmvq q8 0 can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ f16_ can_ implement baracuda_kernels_mmvq_q8_0_f16_can_implement(baracuda kernels mmvq q8 0 f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ f16_ run - MMVQ — Q8_0, f16. # Safety: as Q4_0 f16.
- baracuda_
kernels_ ⚠mmvq_ q8_ 0_ run - GGUF
Q8_0MMVQ. # Safety: asQ4_0. - baracuda_
kernels_ ⚠mmvq_ q8_ K_ actstrided_ bf16_ can_ implement baracuda_kernels_mmvq_q8_K_actstrided_bf16_can_implement(baracuda kernels mmvq q8 k actstrided bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ K_ actstrided_ bf16_ run - Strided MMVQ — Q8_K, bf16 (bespoke). # Safety: as Q2_K strided bf16.
- baracuda_
kernels_ ⚠mmvq_ q8_ K_ actstrided_ can_ implement baracuda_kernels_mmvq_q8_K_actstrided_can_implement(baracuda kernels mmvq q8 k actstrided can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ K_ actstrided_ f16_ can_ implement baracuda_kernels_mmvq_q8_K_actstrided_f16_can_implement(baracuda kernels mmvq q8 k actstrided f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ K_ actstrided_ f16_ run - Strided MMVQ — Q8_K, f16 (bespoke). # Safety: as Q2_K strided f16.
- baracuda_
kernels_ ⚠mmvq_ q8_ K_ actstrided_ run - Strided MMVQ — GGUF
Q8_K(bespoke; Phase 11.4 + 14.5). - baracuda_
kernels_ ⚠mmvq_ q8_ K_ batched_ bf16_ can_ implement baracuda_kernels_mmvq_q8_K_batched_bf16_can_implement(baracuda kernels mmvq q8 k batched bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ K_ batched_ bf16_ run - Batched MMVQ — Q8_K (bespoke), bf16. # Safety: as Q2_K.
- baracuda_
kernels_ ⚠mmvq_ q8_ K_ batched_ can_ implement baracuda_kernels_mmvq_q8_K_batched_can_implement(baracuda kernels mmvq q8 k batched can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ K_ batched_ f16_ can_ implement baracuda_kernels_mmvq_q8_K_batched_f16_can_implement(baracuda kernels mmvq q8 k batched f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ K_ batched_ f16_ run - Batched MMVQ — Q8_K (bespoke), f16. # Safety: as Q2_K.
- baracuda_
kernels_ ⚠mmvq_ q8_ K_ batched_ run - Batched MMVQ — Q8_K (bespoke), f32. # Safety: as Q2_K.
- baracuda_
kernels_ ⚠mmvq_ q8_ K_ bf16_ can_ implement baracuda_kernels_mmvq_q8_K_bf16_can_implement(baracuda kernels mmvq q8 k bf16 can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ K_ bf16_ run - MMVQ — Q8_K, bf16 (bespoke; Phase 11.4 + 18.1). # Safety: as Q2_K bf16.
- baracuda_
kernels_ ⚠mmvq_ q8_ K_ can_ implement baracuda_kernels_mmvq_q8_K_can_implement(baracuda kernels mmvq q8 k can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ K_ f16_ can_ implement baracuda_kernels_mmvq_q8_K_f16_can_implement(baracuda kernels mmvq q8 k f16 can implement).- baracuda_
kernels_ ⚠mmvq_ q8_ K_ f16_ run - MMVQ — Q8_K, f16 (bespoke; Phase 11.4 + 18.1). # Safety: as Q2_K f16.
- baracuda_
kernels_ ⚠mmvq_ q8_ K_ run - GGUF
Q8_KMMVQ — Phase 11.4 (bespoke, not vendored from llama.cpp).ncolsmust be a multiple of 256. # Safety: asQ2_K. - baracuda_
kernels_ ⚠moe_ scalar_ gguf_ can_ implement baracuda_kernels_moe_scalar_gguf_can_implement(baracuda kernels moe scalar gguf can implement).- baracuda_
kernels_ ⚠moe_ scalar_ gguf_ run - MoE forward — scalar dispatch path on GGUF-packed expert weights. f32 activations in, f32 output out.
- baracuda_
kernels_ ⚠moe_ wmma_ bf16_ can_ implement baracuda_kernels_moe_wmma_bf16_can_implement(baracuda kernels moe wmma bf16 can implement).- baracuda_
kernels_ ⚠moe_ wmma_ bf16_ run - MoE forward — WMMA FP weights, bf16 activations + weights, bf16 output.
- baracuda_
kernels_ ⚠moe_ wmma_ f16_ can_ implement baracuda_kernels_moe_wmma_f16_can_implement(baracuda kernels moe wmma f16 can implement).- baracuda_
kernels_ ⚠moe_ wmma_ f16_ run - MoE forward — WMMA FP weights, f16 activations + weights, f16 output.
Output buffer must be zero-initialized by the caller when
topk_weights == nullandtopk > 1(multiple writes per row). - baracuda_
kernels_ ⚠moe_ wmma_ gguf_ bf16_ can_ implement baracuda_kernels_moe_wmma_gguf_bf16_can_implement(baracuda kernels moe wmma gguf bf16 can implement).- baracuda_
kernels_ ⚠moe_ wmma_ gguf_ bf16_ run - MoE forward — WMMA + GGUF combined path, bf16 activations.
- baracuda_
kernels_ ⚠moe_ wmma_ gguf_ f16_ can_ implement baracuda_kernels_moe_wmma_gguf_f16_can_implement(baracuda kernels moe wmma gguf f16 can implement).- baracuda_
kernels_ ⚠moe_ wmma_ gguf_ f16_ run - MoE forward — WMMA + GGUF combined path. f16 activations, GGUF-packed weights, f32 output.
- baracuda_
kernels_ ⚠msort_ backward_ f32_ can_ implement baracuda_kernels_msort_backward_f32_can_implement(baracuda kernels msort backward f32 can implement).- baracuda_
kernels_ ⚠msort_ backward_ f32_ run - Msort BW, f32. Same scatter as sort BW; distinct symbol kept for FFI / telemetry parity.
- baracuda_
kernels_ ⚠msort_ backward_ f64_ can_ implement baracuda_kernels_msort_backward_f64_can_implement(baracuda kernels msort backward f64 can implement).- baracuda_
kernels_ ⚠msort_ backward_ f64_ run - Msort BW, f64.
- baracuda_
kernels_ ⚠msort_ f32_ can_ implement baracuda_kernels_msort_f32_can_implement(baracuda kernels msort f32 can implement).- baracuda_
kernels_ ⚠msort_ f32_ run - Stable block-bitonic sort, f32. Tie-break on original index so equal keys preserve input order.
- baracuda_
kernels_ ⚠msort_ f64_ can_ implement baracuda_kernels_msort_f64_can_implement(baracuda kernels msort f64 can implement).- baracuda_
kernels_ ⚠msort_ f64_ run - Stable block-bitonic sort, f64.
- baracuda_
kernels_ ⚠msort_ i32_ can_ implement baracuda_kernels_msort_i32_can_implement(baracuda kernels msort i32 can implement).- baracuda_
kernels_ ⚠msort_ i32_ run - Stable block-bitonic sort, i32.
- baracuda_
kernels_ ⚠msort_ i64_ can_ implement baracuda_kernels_msort_i64_can_implement(baracuda kernels msort i64 can implement).- baracuda_
kernels_ ⚠msort_ i64_ run - Stable block-bitonic sort, i64.
- baracuda_
kernels_ ⚠nms_ f32_ can_ implement baracuda_kernels_nms_f32_can_implement(baracuda kernels nms f32 can implement).- baracuda_
kernels_ ⚠nms_ f32_ run nms(boxes, iou_thresh). Caller supplies boxes pre-sorted by score, descending.boxes:[num_boxes, 4](x1, y1, x2, y2).keep_mask:[num_boxes]u8 (0 / 1);count_out: single i32. f32. # Safety: as above.- baracuda_
kernels_ ⚠nms_ f64_ can_ implement baracuda_kernels_nms_f64_can_implement(baracuda kernels nms f64 can implement).- baracuda_
kernels_ ⚠nms_ f64_ run nms, f64. # Safety: as f32.- baracuda_
kernels_ ⚠nonzero_ bool_ can_ implement - Implementability check for
nonzero_bool. - baracuda_
kernels_ ⚠nonzero_ bool_ run nonzero— bool (u8) input.- baracuda_
kernels_ ⚠nonzero_ f32_ can_ implement - Implementability check for
nonzero_f32. - baracuda_
kernels_ ⚠nonzero_ f32_ run - Coordinates where
x[i] != 0. f32 input. - baracuda_
kernels_ ⚠nonzero_ f64_ can_ implement - Implementability check for
nonzero_f64. - baracuda_
kernels_ ⚠nonzero_ f64_ run nonzero— f64 input.- baracuda_
kernels_ ⚠nonzero_ i32_ can_ implement - Implementability check for
nonzero_i32. - baracuda_
kernels_ ⚠nonzero_ i32_ run nonzero— i32 input.- baracuda_
kernels_ ⚠nonzero_ i64idx_ bool_ can_ implement - Implementability check for
nonzero_i64idx_bool. - baracuda_
kernels_ ⚠nonzero_ i64idx_ bool_ run nonzero— bool input, i64 output coords.- baracuda_
kernels_ ⚠nonzero_ i64idx_ f32_ can_ implement - Implementability check for
nonzero_i64idx_f32. - baracuda_
kernels_ ⚠nonzero_ i64idx_ f32_ run nonzero— f32 input, i64 output coords.- baracuda_
kernels_ ⚠nonzero_ i64idx_ f64_ can_ implement - Implementability check for
nonzero_i64idx_f64. - baracuda_
kernels_ ⚠nonzero_ i64idx_ f64_ run nonzero— f64 input, i64 output coords.- baracuda_
kernels_ ⚠nonzero_ i64idx_ i32_ can_ implement - Implementability check for
nonzero_i64idx_i32. - baracuda_
kernels_ ⚠nonzero_ i64idx_ i32_ run nonzero— i32 input, i64 output coords.- baracuda_
kernels_ ⚠one_ hot_ bool_ can_ implement - Implementability check for
one_hot_bool. - baracuda_
kernels_ ⚠one_ hot_ bool_ run one_hot— bool output (u8 storage).- baracuda_
kernels_ ⚠one_ hot_ f32_ can_ implement - Implementability check for
one_hot_f32. - baracuda_
kernels_ ⚠one_ hot_ f32_ run out[..., c] = 1 if c == src[...] else 0. Output last axis has extentnum_classes. Input dtype is always i32; output is f32.- baracuda_
kernels_ ⚠one_ hot_ f64_ can_ implement - Implementability check for
one_hot_f64. - baracuda_
kernels_ ⚠one_ hot_ f64_ run one_hot— f64 output.- baracuda_
kernels_ ⚠one_ hot_ i32_ can_ implement - Implementability check for
one_hot_i32. - baracuda_
kernels_ ⚠one_ hot_ i32_ run one_hot— i32 output.- baracuda_
kernels_ ⚠one_ hot_ i64idx_ bool_ can_ implement - Implementability check for
one_hot_i64idx_bool. - baracuda_
kernels_ ⚠one_ hot_ i64idx_ bool_ run one_hot— bool output, i64 indices.- baracuda_
kernels_ ⚠one_ hot_ i64idx_ f32_ can_ implement - Implementability check for
one_hot_i64idx_f32. - baracuda_
kernels_ ⚠one_ hot_ i64idx_ f32_ run one_hot— f32 output, i64 input class indices.- baracuda_
kernels_ ⚠one_ hot_ i64idx_ f64_ can_ implement - Implementability check for
one_hot_i64idx_f64. - baracuda_
kernels_ ⚠one_ hot_ i64idx_ f64_ run one_hot— f64 output, i64 indices.- baracuda_
kernels_ ⚠one_ hot_ i64idx_ i32_ can_ implement - Implementability check for
one_hot_i64idx_i32. - baracuda_
kernels_ ⚠one_ hot_ i64idx_ i32_ run one_hot— i32 output, i64 indices.- baracuda_
kernels_ ⚠ormqr_ f32_ run - Apply Householder-encoded
Q(from a priorgeqrf) toc_inout.side ∈ {0=Left, 1=Right};op ∈ {0=N, 1=T, 2=C}. On Left + op=N, computesC := Q · C; pair with a pre-staged identityCto materialize denseQ. - baracuda_
kernels_ ⚠ormqr_ f64_ run - Apply Householder-encoded
Q(from a priorgeqrf) toc_inout.side ∈ {0=Left, 1=Right};op ∈ {0=N, 1=T, 2=C}. On Left + op=N, computesC := Q · C; pair with a pre-staged identityCto materialize denseQ. - baracuda_
kernels_ ⚠pad_ circular_ bf16_ can_ implement baracuda_kernels_pad_circular_bf16_can_implement(baracuda kernels pad circular bf16 can implement).- baracuda_
kernels_ ⚠pad_ circular_ bf16_ run - Pad circular, bf16.
- baracuda_
kernels_ ⚠pad_ circular_ f16_ can_ implement baracuda_kernels_pad_circular_f16_can_implement(baracuda kernels pad circular f16 can implement).- baracuda_
kernels_ ⚠pad_ circular_ f16_ run - Pad circular, f16.
- baracuda_
kernels_ ⚠pad_ circular_ f32_ can_ implement baracuda_kernels_pad_circular_f32_can_implement(baracuda kernels pad circular f32 can implement).- baracuda_
kernels_ ⚠pad_ circular_ f32_ run - Pad circular, f32. Cyclic wrap from the opposite end of each axis.
- baracuda_
kernels_ ⚠pad_ circular_ f64_ can_ implement baracuda_kernels_pad_circular_f64_can_implement(baracuda kernels pad circular f64 can implement).- baracuda_
kernels_ ⚠pad_ circular_ f64_ run - Pad circular, f64.
- baracuda_
kernels_ ⚠pad_ constant_ backward_ bf16_ can_ implement baracuda_kernels_pad_constant_backward_bf16_can_implement(baracuda kernels pad constant backward bf16 can implement).- baracuda_
kernels_ ⚠pad_ constant_ backward_ bf16_ run - Pad-constant backward (slice), bf16.
- baracuda_
kernels_ ⚠pad_ constant_ backward_ f16_ can_ implement baracuda_kernels_pad_constant_backward_f16_can_implement(baracuda kernels pad constant backward f16 can implement).- baracuda_
kernels_ ⚠pad_ constant_ backward_ f16_ run - Pad-constant backward (slice), f16.
- baracuda_
kernels_ ⚠pad_ constant_ backward_ f32_ can_ implement baracuda_kernels_pad_constant_backward_f32_can_implement(baracuda kernels pad constant backward f32 can implement).- baracuda_
kernels_ ⚠pad_ constant_ backward_ f32_ run - Pad-constant backward (slice), f32.
- baracuda_
kernels_ ⚠pad_ constant_ backward_ f64_ can_ implement baracuda_kernels_pad_constant_backward_f64_can_implement(baracuda kernels pad constant backward f64 can implement).- baracuda_
kernels_ ⚠pad_ constant_ backward_ f64_ run - Pad-constant backward (slice), f64.
- baracuda_
kernels_ ⚠pad_ constant_ bf16_ can_ implement baracuda_kernels_pad_constant_bf16_can_implement(baracuda kernels pad constant bf16 can implement).- baracuda_
kernels_ ⚠pad_ constant_ bf16_ run - Pad with a constant value, bf16, contig output. The
valueargument carries the__nv_bfloat16bit pattern asu16— Rust callers can produce it viahalf::bf16::to_bits(). - baracuda_
kernels_ ⚠pad_ constant_ f16_ can_ implement baracuda_kernels_pad_constant_f16_can_implement(baracuda kernels pad constant f16 can implement).- baracuda_
kernels_ ⚠pad_ constant_ f16_ run - Pad with a constant value, f16, contig output. The
valueargument carries the__halfbit pattern asu16— Rust callers can produce it viahalf::f16::to_bits(). ABI-compatible because__halfis a 2-byte__CUDA_ALIGN__(2)POD struct passed in the same register slot asunsigned short. - baracuda_
kernels_ ⚠pad_ constant_ f32_ can_ implement baracuda_kernels_pad_constant_f32_can_implement(baracuda kernels pad constant f32 can implement).- baracuda_
kernels_ ⚠pad_ constant_ f32_ run - Pad with a constant value, f32, contig output.
- baracuda_
kernels_ ⚠pad_ constant_ f64_ can_ implement baracuda_kernels_pad_constant_f64_can_implement(baracuda kernels pad constant f64 can implement).- baracuda_
kernels_ ⚠pad_ constant_ f64_ run - Pad with a constant value, f64, contig output.
- baracuda_
kernels_ ⚠pad_ reflect_ bf16_ can_ implement baracuda_kernels_pad_reflect_bf16_can_implement(baracuda kernels pad reflect bf16 can implement).- baracuda_
kernels_ ⚠pad_ reflect_ bf16_ run - Pad reflect, bf16.
- baracuda_
kernels_ ⚠pad_ reflect_ f16_ can_ implement baracuda_kernels_pad_reflect_f16_can_implement(baracuda kernels pad reflect f16 can implement).- baracuda_
kernels_ ⚠pad_ reflect_ f16_ run - Pad reflect, f16.
- baracuda_
kernels_ ⚠pad_ reflect_ f32_ can_ implement baracuda_kernels_pad_reflect_f32_can_implement(baracuda kernels pad reflect f32 can implement).- baracuda_
kernels_ ⚠pad_ reflect_ f32_ run - Pad reflect, f32. Mirror input across the boundary (no edge duplication).
- baracuda_
kernels_ ⚠pad_ reflect_ f64_ can_ implement baracuda_kernels_pad_reflect_f64_can_implement(baracuda kernels pad reflect f64 can implement).- baracuda_
kernels_ ⚠pad_ reflect_ f64_ run - Pad reflect, f64.
- baracuda_
kernels_ ⚠pad_ replicate_ bf16_ can_ implement - Implementability check for
baracuda_kernels_pad_replicate_bf16. Host-side only. - baracuda_
kernels_ ⚠pad_ replicate_ bf16_ run - Pad replicate, bf16.
- baracuda_
kernels_ ⚠pad_ replicate_ f16_ can_ implement - Implementability check for
baracuda_kernels_pad_replicate_f16. Host-side only. - baracuda_
kernels_ ⚠pad_ replicate_ f16_ run - Pad replicate, f16.
- baracuda_
kernels_ ⚠pad_ replicate_ f32_ can_ implement - Implementability check for
baracuda_kernels_pad_replicate_f32. Host-side only. - baracuda_
kernels_ ⚠pad_ replicate_ f32_ run - Pad replicate, f32. Clamp to the edge value of the input.
- baracuda_
kernels_ ⚠pad_ replicate_ f64_ can_ implement - Implementability check for
baracuda_kernels_pad_replicate_f64. Host-side only. - baracuda_
kernels_ ⚠pad_ replicate_ f64_ run - Pad replicate, f64.
- baracuda_
kernels_ ⚠permute_ bf16_ can_ implement - Pre-launch implementability check for
permute_bf16. - baracuda_
kernels_ ⚠permute_ bf16_ run - Materialized permute, bf16. Pure element copy — no math.
- baracuda_
kernels_ ⚠permute_ bf16_ strided_ can_ implement permute_bf16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠permute_ bf16_ strided_ run - Permute strided sibling, bf16.
- baracuda_
kernels_ ⚠permute_ f16_ can_ implement - Pre-launch implementability check for
permute_f16. - baracuda_
kernels_ ⚠permute_ f16_ run - Materialized permute, f16. Pure element copy — no math.
- baracuda_
kernels_ ⚠permute_ f16_ strided_ can_ implement permute_f16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠permute_ f16_ strided_ run - Permute strided sibling, f16.
- baracuda_
kernels_ ⚠permute_ f32_ can_ implement - Pre-launch implementability check for
permute_f32. - baracuda_
kernels_ ⚠permute_ f32_ run - Materialized permute, f32.
- baracuda_
kernels_ ⚠permute_ f32_ strided_ can_ implement permute_f32_strided_can_implementcompanion.- baracuda_
kernels_ ⚠permute_ f32_ strided_ run - Permute strided sibling, f32.
- baracuda_
kernels_ ⚠permute_ f64_ can_ implement - Pre-launch implementability check for
permute_f64. - baracuda_
kernels_ ⚠permute_ f64_ run - Materialized permute, f64. Pure element copy — no math.
- baracuda_
kernels_ ⚠permute_ f64_ strided_ can_ implement permute_f64_strided_can_implementcompanion.- baracuda_
kernels_ ⚠permute_ f64_ strided_ run - Permute strided sibling, f64.
- baracuda_
kernels_ ⚠pixel_ shuffle_ bf16_ can_ implement baracuda_kernels_pixel_shuffle_bf16_can_implement(baracuda kernels pixel shuffle bf16 can implement).- baracuda_
kernels_ ⚠pixel_ shuffle_ bf16_ run pixel_shuffle, bf16. # Safety: as f32.- baracuda_
kernels_ ⚠pixel_ shuffle_ f16_ can_ implement baracuda_kernels_pixel_shuffle_f16_can_implement(baracuda kernels pixel shuffle f16 can implement).- baracuda_
kernels_ ⚠pixel_ shuffle_ f16_ run pixel_shuffle, f16. # Safety: as f32.- baracuda_
kernels_ ⚠pixel_ shuffle_ f32_ can_ implement baracuda_kernels_pixel_shuffle_f32_can_implement(baracuda kernels pixel shuffle f32 can implement).- baracuda_
kernels_ ⚠pixel_ shuffle_ f32_ run pixel_shuffle(x, r)—[N, C·r², H, W] → [N, C, H·r, W·r]. f32. # Safety: as above.- baracuda_
kernels_ ⚠pixel_ shuffle_ f64_ can_ implement baracuda_kernels_pixel_shuffle_f64_can_implement(baracuda kernels pixel shuffle f64 can implement).- baracuda_
kernels_ ⚠pixel_ shuffle_ f64_ run pixel_shuffle, f64. # Safety: as f32.- baracuda_
kernels_ ⚠pixel_ unshuffle_ bf16_ can_ implement baracuda_kernels_pixel_unshuffle_bf16_can_implement(baracuda kernels pixel unshuffle bf16 can implement).- baracuda_
kernels_ ⚠pixel_ unshuffle_ bf16_ run pixel_unshuffle, bf16. # Safety: as f32.- baracuda_
kernels_ ⚠pixel_ unshuffle_ f16_ can_ implement baracuda_kernels_pixel_unshuffle_f16_can_implement(baracuda kernels pixel unshuffle f16 can implement).- baracuda_
kernels_ ⚠pixel_ unshuffle_ f16_ run pixel_unshuffle, f16. # Safety: as f32.- baracuda_
kernels_ ⚠pixel_ unshuffle_ f32_ can_ implement baracuda_kernels_pixel_unshuffle_f32_can_implement(baracuda kernels pixel unshuffle f32 can implement).- baracuda_
kernels_ ⚠pixel_ unshuffle_ f32_ run pixel_unshuffle(x, r)—[N, C, H·r, W·r] → [N, C·r², H, W]. Inverse of pixel_shuffle (and each is the other’s BW). f32.- baracuda_
kernels_ ⚠pixel_ unshuffle_ f64_ can_ implement baracuda_kernels_pixel_unshuffle_f64_can_implement(baracuda kernels pixel unshuffle f64 can implement).- baracuda_
kernels_ ⚠pixel_ unshuffle_ f64_ run pixel_unshuffle, f64. # Safety: as f32.- baracuda_
kernels_ ⚠prelu_ backward_ bf16_ can_ implement - Implementability check for
baracuda_kernels_prelu_backward_bf16. Host-side only. - baracuda_
kernels_ ⚠prelu_ backward_ bf16_ run - PReLU BW, bf16.
- baracuda_
kernels_ ⚠prelu_ backward_ f16_ can_ implement - Implementability check for
baracuda_kernels_prelu_backward_f16. Host-side only. - baracuda_
kernels_ ⚠prelu_ backward_ f16_ run - PReLU BW, f16.
- baracuda_
kernels_ ⚠prelu_ backward_ f32_ can_ implement - Implementability check for
baracuda_kernels_prelu_backward_f32. Host-side only. - baracuda_
kernels_ ⚠prelu_ backward_ f32_ run - PReLU BW, f32. ABI:
(numel, channel_stride, channel_extent, scalar_weight, dy, x, weight, dx, dweight, workspace, workspace_bytes, stream). - baracuda_
kernels_ ⚠prelu_ backward_ f64_ can_ implement - Implementability check for
baracuda_kernels_prelu_backward_f64. Host-side only. - baracuda_
kernels_ ⚠prelu_ backward_ f64_ run - PReLU BW, f64.
- baracuda_
kernels_ ⚠prelu_ bf16_ can_ implement baracuda_kernels_prelu_bf16_can_implement(baracuda kernels prelu bf16 can implement).- baracuda_
kernels_ ⚠prelu_ bf16_ run - PReLU FW, bf16.
- baracuda_
kernels_ ⚠prelu_ f16_ can_ implement baracuda_kernels_prelu_f16_can_implement(baracuda kernels prelu f16 can implement).- baracuda_
kernels_ ⚠prelu_ f16_ run - PReLU FW, f16.
- baracuda_
kernels_ ⚠prelu_ f32_ can_ implement baracuda_kernels_prelu_f32_can_implement(baracuda kernels prelu f32 can implement).- baracuda_
kernels_ ⚠prelu_ f32_ run - PReLU FW, f32. ABI:
(numel, channel_stride, channel_extent, scalar_weight, x, weight, y, workspace, workspace_bytes, stream). - baracuda_
kernels_ ⚠prelu_ f64_ can_ implement baracuda_kernels_prelu_f64_can_implement(baracuda kernels prelu f64 can implement).- baracuda_
kernels_ ⚠prelu_ f64_ run - PReLU FW, f64.
- baracuda_
kernels_ ⚠qr_ f32_ run - QR factorization (packed Householder output,
m >= nrequired).a_inoutis overwritten withR(upper triangle) + Householder reflectors (strict lower);tau_outis[min(m, n)]. - baracuda_
kernels_ ⚠qr_ f32_ workspace_ size - QR factorization workspace size in bytes for
geqrf. - baracuda_
kernels_ ⚠qr_ f64_ run - QR factorization (packed Householder output,
m >= nrequired).a_inoutis overwritten withR(upper triangle) + Householder reflectors (strict lower);tau_outis[min(m, n)]. - baracuda_
kernels_ ⚠qr_ f64_ workspace_ size - QR factorization workspace size in bytes for
geqrf. - baracuda_
kernels_ ⚠quantize_ per_ channel_ backward_ bf16_ can_ implement - Implementability check for
quantize_per_channel_backward_bf16. - baracuda_
kernels_ ⚠quantize_ per_ channel_ backward_ bf16_ run quantize_per_channel_backward— bf16.- baracuda_
kernels_ ⚠quantize_ per_ channel_ backward_ f16_ can_ implement - Implementability check for
quantize_per_channel_backward_f16. - baracuda_
kernels_ ⚠quantize_ per_ channel_ backward_ f16_ run quantize_per_channel_backward— f16.- baracuda_
kernels_ ⚠quantize_ per_ channel_ backward_ f32_ can_ implement - Implementability check for
quantize_per_channel_backward_f32. - baracuda_
kernels_ ⚠quantize_ per_ channel_ backward_ f32_ run dx[i] = (dy[i] / scale[c]) * in_range_mask(x[i]). f32.- baracuda_
kernels_ ⚠quantize_ per_ channel_ backward_ f64_ can_ implement - Implementability check for
quantize_per_channel_backward_f64. - baracuda_
kernels_ ⚠quantize_ per_ channel_ backward_ f64_ run quantize_per_channel_backward— f64.- baracuda_
kernels_ ⚠quantize_ per_ channel_ bf16_ s8_ can_ implement - Implementability check for
quantize_per_channel_bf16_s8. - baracuda_
kernels_ ⚠quantize_ per_ channel_ bf16_ s8_ run quantize_per_channel— bf16 → s8.- baracuda_
kernels_ ⚠quantize_ per_ channel_ bf16_ u8_ can_ implement - Implementability check for
quantize_per_channel_bf16_u8. - baracuda_
kernels_ ⚠quantize_ per_ channel_ bf16_ u8_ run quantize_per_channel— bf16 → u8.- baracuda_
kernels_ ⚠quantize_ per_ channel_ f16_ s8_ can_ implement - Implementability check for
quantize_per_channel_f16_s8. - baracuda_
kernels_ ⚠quantize_ per_ channel_ f16_ s8_ run quantize_per_channel— f16 → s8.- baracuda_
kernels_ ⚠quantize_ per_ channel_ f16_ u8_ can_ implement - Implementability check for
quantize_per_channel_f16_u8. - baracuda_
kernels_ ⚠quantize_ per_ channel_ f16_ u8_ run quantize_per_channel— f16 → u8.- baracuda_
kernels_ ⚠quantize_ per_ channel_ f32_ s8_ can_ implement - Implementability check for
quantize_per_channel_f32_s8. - baracuda_
kernels_ ⚠quantize_ per_ channel_ f32_ s8_ run q[i] = clamp(round(x[i]/scale[c])+zp[c], qmin, qmax)where c = coord[axis]. f32 → s8.- baracuda_
kernels_ ⚠quantize_ per_ channel_ f32_ u8_ can_ implement - Implementability check for
quantize_per_channel_f32_u8. - baracuda_
kernels_ ⚠quantize_ per_ channel_ f32_ u8_ run quantize_per_channel— f32 → u8.- baracuda_
kernels_ ⚠quantize_ per_ channel_ f64_ s8_ can_ implement - Implementability check for
quantize_per_channel_f64_s8. - baracuda_
kernels_ ⚠quantize_ per_ channel_ f64_ s8_ run quantize_per_channel— f64 → s8.- baracuda_
kernels_ ⚠quantize_ per_ channel_ f64_ u8_ can_ implement - Implementability check for
quantize_per_channel_f64_u8. - baracuda_
kernels_ ⚠quantize_ per_ channel_ f64_ u8_ run quantize_per_channel— f64 → u8.- baracuda_
kernels_ ⚠quantize_ per_ group_ backward_ bf16_ can_ implement - Implementability check for
quantize_per_group_backward_bf16. - baracuda_
kernels_ ⚠quantize_ per_ group_ backward_ bf16_ run - STE BW — bf16.
- baracuda_
kernels_ ⚠quantize_ per_ group_ backward_ f16_ can_ implement - Implementability check for
quantize_per_group_backward_f16. - baracuda_
kernels_ ⚠quantize_ per_ group_ backward_ f16_ run - STE BW — f16.
- baracuda_
kernels_ ⚠quantize_ per_ group_ backward_ f32_ can_ implement - Implementability check for
quantize_per_group_backward_f32. - baracuda_
kernels_ ⚠quantize_ per_ group_ backward_ f32_ run - STE BW — f32.
- baracuda_
kernels_ ⚠quantize_ per_ group_ backward_ f64_ can_ implement - Implementability check for
quantize_per_group_backward_f64. - baracuda_
kernels_ ⚠quantize_ per_ group_ backward_ f64_ run - STE BW — f64.
- baracuda_
kernels_ ⚠quantize_ per_ group_ bf16_ s8_ can_ implement - Implementability check for
quantize_per_group_bf16_s8. - baracuda_
kernels_ ⚠quantize_ per_ group_ bf16_ s8_ run quantize_per_group— bf16 → s8.- baracuda_
kernels_ ⚠quantize_ per_ group_ bf16_ u8_ can_ implement - Implementability check for
quantize_per_group_bf16_u8. - baracuda_
kernels_ ⚠quantize_ per_ group_ bf16_ u8_ run quantize_per_group— bf16 → u8.- baracuda_
kernels_ ⚠quantize_ per_ group_ f16_ s8_ can_ implement - Implementability check for
quantize_per_group_f16_s8. - baracuda_
kernels_ ⚠quantize_ per_ group_ f16_ s8_ run quantize_per_group— f16 → s8.- baracuda_
kernels_ ⚠quantize_ per_ group_ f16_ u8_ can_ implement - Implementability check for
quantize_per_group_f16_u8. - baracuda_
kernels_ ⚠quantize_ per_ group_ f16_ u8_ run quantize_per_group— f16 → u8.- baracuda_
kernels_ ⚠quantize_ per_ group_ f32_ s8_ can_ implement - Implementability check for
quantize_per_group_f32_s8. - baracuda_
kernels_ ⚠quantize_ per_ group_ f32_ s8_ run quantize_per_group— f32 → s8.- baracuda_
kernels_ ⚠quantize_ per_ group_ f32_ u8_ can_ implement - Implementability check for
quantize_per_group_f32_u8. - baracuda_
kernels_ ⚠quantize_ per_ group_ f32_ u8_ run quantize_per_group— f32 → u8.- baracuda_
kernels_ ⚠quantize_ per_ group_ f64_ s8_ can_ implement - Implementability check for
quantize_per_group_f64_s8. - baracuda_
kernels_ ⚠quantize_ per_ group_ f64_ s8_ run quantize_per_group— f64 → s8.- baracuda_
kernels_ ⚠quantize_ per_ group_ f64_ u8_ can_ implement - Implementability check for
quantize_per_group_f64_u8. - baracuda_
kernels_ ⚠quantize_ per_ group_ f64_ u8_ run quantize_per_group— f64 → u8.- baracuda_
kernels_ ⚠quantize_ per_ tensor_ backward_ bf16_ can_ implement - Implementability check for
quantize_per_tensor_backward_bf16. - baracuda_
kernels_ ⚠quantize_ per_ tensor_ backward_ bf16_ run quantize_per_tensor_backward— bf16.- baracuda_
kernels_ ⚠quantize_ per_ tensor_ backward_ f16_ can_ implement - Implementability check for
quantize_per_tensor_backward_f16. - baracuda_
kernels_ ⚠quantize_ per_ tensor_ backward_ f16_ run quantize_per_tensor_backward— f16.- baracuda_
kernels_ ⚠quantize_ per_ tensor_ backward_ f32_ can_ implement - Implementability check for
quantize_per_tensor_backward_f32. - baracuda_
kernels_ ⚠quantize_ per_ tensor_ backward_ f32_ run dx = (dy / scale) * in_range_mask(x). f32.- baracuda_
kernels_ ⚠quantize_ per_ tensor_ backward_ f64_ can_ implement - Implementability check for
quantize_per_tensor_backward_f64. - baracuda_
kernels_ ⚠quantize_ per_ tensor_ backward_ f64_ run quantize_per_tensor_backward— f64 (f64 scale).- baracuda_
kernels_ ⚠quantize_ per_ tensor_ bf16_ s8_ can_ implement - Implementability check for
quantize_per_tensor_bf16_s8. - baracuda_
kernels_ ⚠quantize_ per_ tensor_ bf16_ s8_ run quantize_per_tensor— bf16 → s8.- baracuda_
kernels_ ⚠quantize_ per_ tensor_ bf16_ u8_ can_ implement - Implementability check for
quantize_per_tensor_bf16_u8. - baracuda_
kernels_ ⚠quantize_ per_ tensor_ bf16_ u8_ run quantize_per_tensor— bf16 → u8.- baracuda_
kernels_ ⚠quantize_ per_ tensor_ f16_ s8_ can_ implement - Implementability check for
quantize_per_tensor_f16_s8. - baracuda_
kernels_ ⚠quantize_ per_ tensor_ f16_ s8_ run quantize_per_tensor— f16 → s8.- baracuda_
kernels_ ⚠quantize_ per_ tensor_ f16_ u8_ can_ implement - Implementability check for
quantize_per_tensor_f16_u8. - baracuda_
kernels_ ⚠quantize_ per_ tensor_ f16_ u8_ run quantize_per_tensor— f16 → u8.- baracuda_
kernels_ ⚠quantize_ per_ tensor_ f32_ s8_ can_ implement - Implementability check for
quantize_per_tensor_f32_s8. - baracuda_
kernels_ ⚠quantize_ per_ tensor_ f32_ s8_ run q = clamp(round(x/scale)+zp, qmin, qmax). f32 input, s8 output.- baracuda_
kernels_ ⚠quantize_ per_ tensor_ f32_ u8_ can_ implement - Implementability check for
quantize_per_tensor_f32_u8. - baracuda_
kernels_ ⚠quantize_ per_ tensor_ f32_ u8_ run quantize_per_tensor— f32 → u8.- baracuda_
kernels_ ⚠quantize_ per_ tensor_ f64_ s8_ can_ implement - Implementability check for
quantize_per_tensor_f64_s8. - baracuda_
kernels_ ⚠quantize_ per_ tensor_ f64_ s8_ run quantize_per_tensor— f64 → s8 (f64 scale).- baracuda_
kernels_ ⚠quantize_ per_ tensor_ f64_ u8_ can_ implement - Implementability check for
quantize_per_tensor_f64_u8. - baracuda_
kernels_ ⚠quantize_ per_ tensor_ f64_ u8_ run quantize_per_tensor— f64 → u8 (f64 scale).- baracuda_
kernels_ ⚠quantize_ per_ token_ backward_ bf16_ can_ implement - Implementability check for
quantize_per_token_backward_bf16. - baracuda_
kernels_ ⚠quantize_ per_ token_ backward_ bf16_ run - STE backward — bf16.
- baracuda_
kernels_ ⚠quantize_ per_ token_ backward_ f16_ can_ implement - Implementability check for
quantize_per_token_backward_f16. - baracuda_
kernels_ ⚠quantize_ per_ token_ backward_ f16_ run - STE backward — f16.
- baracuda_
kernels_ ⚠quantize_ per_ token_ backward_ f32_ can_ implement - Implementability check for
quantize_per_token_backward_f32. - baracuda_
kernels_ ⚠quantize_ per_ token_ backward_ f32_ run - STE backward — f32.
- baracuda_
kernels_ ⚠quantize_ per_ token_ backward_ f64_ can_ implement - Implementability check for
quantize_per_token_backward_f64. - baracuda_
kernels_ ⚠quantize_ per_ token_ backward_ f64_ run - STE backward — f64.
- baracuda_
kernels_ ⚠quantize_ per_ token_ bf16_ s8_ can_ implement - Implementability check for
quantize_per_token_bf16_s8. - baracuda_
kernels_ ⚠quantize_ per_ token_ bf16_ s8_ run quantize_per_token— bf16 → s8.- baracuda_
kernels_ ⚠quantize_ per_ token_ bf16_ u8_ can_ implement - Implementability check for
quantize_per_token_bf16_u8. - baracuda_
kernels_ ⚠quantize_ per_ token_ bf16_ u8_ run quantize_per_token— bf16 → u8.- baracuda_
kernels_ ⚠quantize_ per_ token_ f16_ s8_ can_ implement - Implementability check for
quantize_per_token_f16_s8. - baracuda_
kernels_ ⚠quantize_ per_ token_ f16_ s8_ run quantize_per_token— f16 → s8.- baracuda_
kernels_ ⚠quantize_ per_ token_ f16_ u8_ can_ implement - Implementability check for
quantize_per_token_f16_u8. - baracuda_
kernels_ ⚠quantize_ per_ token_ f16_ u8_ run quantize_per_token— f16 → u8.- baracuda_
kernels_ ⚠quantize_ per_ token_ f32_ s8_ can_ implement - Implementability check for
quantize_per_token_f32_s8. - baracuda_
kernels_ ⚠quantize_ per_ token_ f32_ s8_ run quantize_per_token— TIn f32, TOut s8. Status codes as elsewhere.- baracuda_
kernels_ ⚠quantize_ per_ token_ f32_ u8_ can_ implement - Implementability check for
quantize_per_token_f32_u8. - baracuda_
kernels_ ⚠quantize_ per_ token_ f32_ u8_ run quantize_per_token— f32 → u8.- baracuda_
kernels_ ⚠quantize_ per_ token_ f64_ s8_ can_ implement - Implementability check for
quantize_per_token_f64_s8. - baracuda_
kernels_ ⚠quantize_ per_ token_ f64_ s8_ run quantize_per_token— f64 → s8.- baracuda_
kernels_ ⚠quantize_ per_ token_ f64_ u8_ can_ implement - Implementability check for
quantize_per_token_f64_u8. - baracuda_
kernels_ ⚠quantize_ per_ token_ f64_ u8_ run quantize_per_token— f64 → u8.- baracuda_
kernels_ ⚠quantize_ q8_ 1_ bf16_ can_ implement baracuda_kernels_quantize_q8_1_bf16_can_implement(baracuda kernels quantize q8 1 bf16 can implement).- baracuda_
kernels_ ⚠quantize_ q8_ 1_ bf16_ run - Q8_1 activation staging — bf16 source. # Safety: as f32 variant.
- baracuda_
kernels_ ⚠quantize_ q8_ 1_ f16_ can_ implement baracuda_kernels_quantize_q8_1_f16_can_implement(baracuda kernels quantize q8 1 f16 can implement).- baracuda_
kernels_ ⚠quantize_ q8_ 1_ f16_ run - Q8_1 activation staging — f16 source. # Safety: as f32 variant.
- baracuda_
kernels_ ⚠quantize_ q8_ 1_ f32_ can_ implement baracuda_kernels_quantize_q8_1_f32_can_implement(baracuda kernels quantize q8 1 f32 can implement).- baracuda_
kernels_ ⚠quantize_ q8_ 1_ f32_ run - Q8_1 activation staging — f32 source.
- baracuda_
kernels_ ⚠quantize_ q8_ 1_ workspace_ bytes - Returns workspace bytes needed to stage
ny × kxactivations into Q8_1. =ny * ceil(kx / 32) * 36. Returns 0 on invalid (non-positive) arguments. - baracuda_
kernels_ ⚠quantized_ linear_ w8a8_ f32_ can_ implement - Implementability check for
quantized_linear_w8a8_f32. - baracuda_
kernels_ ⚠quantized_ linear_ w8a8_ f32_ run quantized_linear_w8a8— TIn = f32.- baracuda_
kernels_ ⚠quantized_ linear_ w8a8_ f64_ can_ implement - Implementability check for
quantized_linear_w8a8_f64. - baracuda_
kernels_ ⚠quantized_ linear_ w8a8_ f64_ run quantized_linear_w8a8— TIn = f64.- baracuda_
kernels_ ⚠reduce_ all_ bf16_ can_ implement - Pre-launch implementability check for
reduce_all_bf16. - baracuda_
kernels_ ⚠reduce_ all_ bf16_ run all(x, axis=k)with bf16 input, uint8_t Bool output.- baracuda_
kernels_ ⚠reduce_ all_ bool_ can_ implement - Pre-launch implementability check for
reduce_all_bool. - baracuda_
kernels_ ⚠reduce_ all_ bool_ run all(x, axis=k)with Bool (uint8_t) input, uint8_t Bool output.- baracuda_
kernels_ ⚠reduce_ all_ f16_ can_ implement - Pre-launch implementability check for
reduce_all_f16. - baracuda_
kernels_ ⚠reduce_ all_ f16_ run all(x, axis=k)with f16 input, uint8_t Bool output.- baracuda_
kernels_ ⚠reduce_ all_ f32_ can_ implement - Pre-launch implementability check for
reduce_all_f32. - baracuda_
kernels_ ⚠reduce_ all_ f32_ run all(x, axis=k)with f32 input, uint8_t Bool output.- baracuda_
kernels_ ⚠reduce_ all_ f64_ can_ implement - Pre-launch implementability check for
reduce_all_f64. - baracuda_
kernels_ ⚠reduce_ all_ f64_ run all(x, axis=k)with f64 input, uint8_t Bool output.- baracuda_
kernels_ ⚠reduce_ all_ i32_ can_ implement - Pre-launch implementability check for
reduce_all_i32. - baracuda_
kernels_ ⚠reduce_ all_ i32_ run all(x, axis=k)with i32 input, uint8_t Bool output.- baracuda_
kernels_ ⚠reduce_ all_ i64_ can_ implement - Pre-launch implementability check for
reduce_all_i64. - baracuda_
kernels_ ⚠reduce_ all_ i64_ run all(x, axis=k)with i64 input, uint8_t Bool output.- baracuda_
kernels_ ⚠reduce_ any_ bf16_ can_ implement - Pre-launch implementability check for
reduce_any_bf16. - baracuda_
kernels_ ⚠reduce_ any_ bf16_ run any(x, axis=k)with bf16 input, uint8_t Bool output.- baracuda_
kernels_ ⚠reduce_ any_ bool_ can_ implement - Pre-launch implementability check for
reduce_any_bool. - baracuda_
kernels_ ⚠reduce_ any_ bool_ run any(x, axis=k)with Bool (uint8_t) input, uint8_t Bool output.- baracuda_
kernels_ ⚠reduce_ any_ f16_ can_ implement - Pre-launch implementability check for
reduce_any_f16. - baracuda_
kernels_ ⚠reduce_ any_ f16_ run any(x, axis=k)with f16 input, uint8_t Bool output.- baracuda_
kernels_ ⚠reduce_ any_ f32_ can_ implement - Pre-launch implementability check for
reduce_any_f32. - baracuda_
kernels_ ⚠reduce_ any_ f32_ run any(x, axis=k)with f32 input, uint8_t Bool output.- baracuda_
kernels_ ⚠reduce_ any_ f64_ can_ implement - Pre-launch implementability check for
reduce_any_f64. - baracuda_
kernels_ ⚠reduce_ any_ f64_ run any(x, axis=k)with f64 input, uint8_t Bool output.- baracuda_
kernels_ ⚠reduce_ any_ i32_ can_ implement - Pre-launch implementability check for
reduce_any_i32. - baracuda_
kernels_ ⚠reduce_ any_ i32_ run any(x, axis=k)with i32 input, uint8_t Bool output.- baracuda_
kernels_ ⚠reduce_ any_ i64_ can_ implement - Pre-launch implementability check for
reduce_any_i64. - baracuda_
kernels_ ⚠reduce_ any_ i64_ run any(x, axis=k)with i64 input, uint8_t Bool output.- baracuda_
kernels_ ⚠reduce_ count_ nonzero_ bf16_ can_ implement - Pre-launch implementability check for
reduce_count_nonzero_bf16. - baracuda_
kernels_ ⚠reduce_ count_ nonzero_ bf16_ run count_nonzero(x, axis=k)with bf16 input, i64 output.- baracuda_
kernels_ ⚠reduce_ count_ nonzero_ bool_ can_ implement - Pre-launch implementability check for
reduce_count_nonzero_bool. - baracuda_
kernels_ ⚠reduce_ count_ nonzero_ bool_ run count_nonzero(x, axis=k)with Bool (uint8_t) input, i64 output.- baracuda_
kernels_ ⚠reduce_ count_ nonzero_ f16_ can_ implement - Pre-launch implementability check for
reduce_count_nonzero_f16. - baracuda_
kernels_ ⚠reduce_ count_ nonzero_ f16_ run count_nonzero(x, axis=k)with f16 input, i64 output.- baracuda_
kernels_ ⚠reduce_ count_ nonzero_ f32_ can_ implement - Pre-launch implementability check for
reduce_count_nonzero_f32. - baracuda_
kernels_ ⚠reduce_ count_ nonzero_ f32_ run count_nonzero(x, axis=k)with f32 input, i64 output.- baracuda_
kernels_ ⚠reduce_ count_ nonzero_ f64_ can_ implement - Pre-launch implementability check for
reduce_count_nonzero_f64. - baracuda_
kernels_ ⚠reduce_ count_ nonzero_ f64_ run count_nonzero(x, axis=k)with f64 input, i64 output.- baracuda_
kernels_ ⚠reduce_ count_ nonzero_ i32_ can_ implement - Pre-launch implementability check for
reduce_count_nonzero_i32. - baracuda_
kernels_ ⚠reduce_ count_ nonzero_ i32_ run count_nonzero(x, axis=k)with i32 input, i64 output.- baracuda_
kernels_ ⚠reduce_ count_ nonzero_ i64_ can_ implement - Pre-launch implementability check for
reduce_count_nonzero_i64. - baracuda_
kernels_ ⚠reduce_ count_ nonzero_ i64_ run count_nonzero(x, axis=k)with i64 input, i64 output.- baracuda_
kernels_ ⚠reduce_ logsumexp_ backward_ bf16_ can_ implement - Pre-launch implementability check for
reduce_logsumexp_backward_bf16. - baracuda_
kernels_ ⚠reduce_ logsumexp_ backward_ bf16_ run - LogSumExp reduction backward, bf16.
- baracuda_
kernels_ ⚠reduce_ logsumexp_ backward_ f16_ can_ implement - Pre-launch implementability check for
reduce_logsumexp_backward_f16. - baracuda_
kernels_ ⚠reduce_ logsumexp_ backward_ f16_ run - LogSumExp reduction backward, f16.
- baracuda_
kernels_ ⚠reduce_ logsumexp_ backward_ f32_ can_ implement - Pre-launch implementability check for
reduce_logsumexp_backward_f32. - baracuda_
kernels_ ⚠reduce_ logsumexp_ backward_ f32_ run - LogSumExp reduction backward, f32.
- baracuda_
kernels_ ⚠reduce_ logsumexp_ backward_ f64_ can_ implement - Pre-launch implementability check for
reduce_logsumexp_backward_f64. - baracuda_
kernels_ ⚠reduce_ logsumexp_ backward_ f64_ run - LogSumExp reduction backward, f64.
- baracuda_
kernels_ ⚠reduce_ logsumexp_ bf16_ can_ implement - Implementability check for
baracuda_kernels_reduce_logsumexp_bf16. Host-side only. - baracuda_
kernels_ ⚠reduce_ logsumexp_ bf16_ run - LogSumExp reduction along one axis, bf16 (f32-detour throughout).
- baracuda_
kernels_ ⚠reduce_ logsumexp_ f16_ can_ implement - Implementability check for
baracuda_kernels_reduce_logsumexp_f16. Host-side only. - baracuda_
kernels_ ⚠reduce_ logsumexp_ f16_ run - LogSumExp reduction along one axis, f16 (f32-detour throughout).
- baracuda_
kernels_ ⚠reduce_ logsumexp_ f32_ can_ implement - Implementability check for
baracuda_kernels_reduce_logsumexp_f32. Host-side only. - baracuda_
kernels_ ⚠reduce_ logsumexp_ f32_ run - LogSumExp reduction along one axis, f32 — numerically stable two-pass max-then-sum-exp. Shares the simple-reduce parameter shape so the Rust dispatcher can reach it through the same FFI signature; the kernel internally performs two passes over the reduce axis.
- baracuda_
kernels_ ⚠reduce_ logsumexp_ f64_ can_ implement - Implementability check for
baracuda_kernels_reduce_logsumexp_f64. Host-side only. - baracuda_
kernels_ ⚠reduce_ logsumexp_ f64_ run - LogSumExp reduction along one axis, f64.
- baracuda_
kernels_ ⚠reduce_ max_ bf16_ can_ implement - Pre-launch implementability check for
reduce_max_bf16. - baracuda_
kernels_ ⚠reduce_ max_ bf16_ run - Max reduction along one axis, bf16 (f32-detour fmaxf).
- baracuda_
kernels_ ⚠reduce_ max_ f16_ can_ implement - Pre-launch implementability check for
reduce_max_f16. - baracuda_
kernels_ ⚠reduce_ max_ f16_ run - Max reduction along one axis, f16 (f32-detour fmaxf).
- baracuda_
kernels_ ⚠reduce_ max_ f32_ can_ implement - Pre-launch implementability check for
reduce_max_f32. - baracuda_
kernels_ ⚠reduce_ max_ f32_ run - Max reduction along one axis, f32.
init = -INFINITY,fmaxf. - baracuda_
kernels_ ⚠reduce_ max_ f64_ can_ implement - Pre-launch implementability check for
reduce_max_f64. - baracuda_
kernels_ ⚠reduce_ max_ f64_ run - Max reduction along one axis, f64.
- baracuda_
kernels_ ⚠reduce_ max_ i8_ can_ implement - Pre-launch implementability check for
reduce_max_i8. - baracuda_
kernels_ ⚠reduce_ max_ i8_ run max(x, axis=k)with i8 input/output (init =INT8_MIN).- baracuda_
kernels_ ⚠reduce_ max_ i16_ can_ implement - Pre-launch implementability check for
reduce_max_i16. - baracuda_
kernels_ ⚠reduce_ max_ i16_ run max(x, axis=k)with i16 input/output (init =INT16_MIN).- baracuda_
kernels_ ⚠reduce_ max_ i32_ can_ implement - Pre-launch implementability check for
reduce_max_i32. - baracuda_
kernels_ ⚠reduce_ max_ i32_ run max(x, axis=k)with i32 input/output (init =INT32_MIN).- baracuda_
kernels_ ⚠reduce_ max_ i64_ can_ implement - Pre-launch implementability check for
reduce_max_i64. - baracuda_
kernels_ ⚠reduce_ max_ i64_ run max(x, axis=k)with i64 input/output (init =INT64_MIN).- baracuda_
kernels_ ⚠reduce_ max_ min_ backward_ bf16_ can_ implement - Pre-launch implementability check for
reduce_max_min_backward_bf16. - baracuda_
kernels_ ⚠reduce_ max_ min_ backward_ bf16_ run - Max/Min reduction backward, bf16.
- baracuda_
kernels_ ⚠reduce_ max_ min_ backward_ f16_ can_ implement - Pre-launch implementability check for
reduce_max_min_backward_f16. - baracuda_
kernels_ ⚠reduce_ max_ min_ backward_ f16_ run - Max/Min reduction backward, f16.
- baracuda_
kernels_ ⚠reduce_ max_ min_ backward_ f32_ can_ implement - Pre-launch implementability check for
reduce_max_min_backward_f32. - baracuda_
kernels_ ⚠reduce_ max_ min_ backward_ f32_ run - Max/Min reduction backward, f32.
- baracuda_
kernels_ ⚠reduce_ max_ min_ backward_ f64_ can_ implement - Pre-launch implementability check for
reduce_max_min_backward_f64. - baracuda_
kernels_ ⚠reduce_ max_ min_ backward_ f64_ run - Max/Min reduction backward, f64.
- baracuda_
kernels_ ⚠reduce_ max_ to_ bf16_ can_ implement baracuda_kernels_reduce_max_to_bf16_can_implement(baracuda kernels reduce max to bf16 can implement).- baracuda_
kernels_ ⚠reduce_ max_ to_ bf16_ run reduce_max_to, bf16.- baracuda_
kernels_ ⚠reduce_ max_ to_ f16_ can_ implement baracuda_kernels_reduce_max_to_f16_can_implement(baracuda kernels reduce max to f16 can implement).- baracuda_
kernels_ ⚠reduce_ max_ to_ f16_ run reduce_max_to, f16. Identity is-FLT_MAXin f32 accumulator space, narrowed back to f16 on store.- baracuda_
kernels_ ⚠reduce_ max_ to_ f32_ can_ implement baracuda_kernels_reduce_max_to_f32_can_implement(baracuda kernels reduce max to f32 can implement).- baracuda_
kernels_ ⚠reduce_ max_ to_ f32_ run reduce_max_to, f32. Identity is-FLT_MAXwhen the broadcast set is empty.- baracuda_
kernels_ ⚠reduce_ max_ to_ f64_ can_ implement baracuda_kernels_reduce_max_to_f64_can_implement(baracuda kernels reduce max to f64 can implement).- baracuda_
kernels_ ⚠reduce_ max_ to_ f64_ run reduce_max_to, f64. Identity is-DBL_MAX.- baracuda_
kernels_ ⚠reduce_ max_ u8_ can_ implement - Pre-launch implementability check for
reduce_max_u8. - baracuda_
kernels_ ⚠reduce_ max_ u8_ run max(x, axis=k)with u8 input/output (init =0).- baracuda_
kernels_ ⚠reduce_ max_ u32_ can_ implement - Pre-launch implementability check for
reduce_max_u32. - baracuda_
kernels_ ⚠reduce_ max_ u32_ run max(x, axis=k)with u32 input/output (init =0).- baracuda_
kernels_ ⚠reduce_ mean_ backward_ bf16_ can_ implement - Pre-launch implementability check for
reduce_mean_backward_bf16. - baracuda_
kernels_ ⚠reduce_ mean_ backward_ bf16_ run - Mean reduction backward, bf16.
- baracuda_
kernels_ ⚠reduce_ mean_ backward_ f16_ can_ implement - Pre-launch implementability check for
reduce_mean_backward_f16. - baracuda_
kernels_ ⚠reduce_ mean_ backward_ f16_ run - Mean reduction backward, f16.
- baracuda_
kernels_ ⚠reduce_ mean_ backward_ f32_ can_ implement - Pre-launch implementability check for
reduce_mean_backward_f32. - baracuda_
kernels_ ⚠reduce_ mean_ backward_ f32_ run - Mean reduction backward, f32. Same as Sum BW with extra
1/kscale (inv_extentis1.0 / reduced_extentcomputed in f64 on the host). - baracuda_
kernels_ ⚠reduce_ mean_ backward_ f64_ can_ implement - Pre-launch implementability check for
reduce_mean_backward_f64. - baracuda_
kernels_ ⚠reduce_ mean_ backward_ f64_ run - Mean reduction backward, f64.
- baracuda_
kernels_ ⚠reduce_ mean_ bf16_ can_ implement - Pre-launch implementability check for
reduce_mean_bf16. - baracuda_
kernels_ ⚠reduce_ mean_ bf16_ run - Mean reduction along one axis, bf16 (f32-detour for sum + divide).
- baracuda_
kernels_ ⚠reduce_ mean_ f16_ can_ implement - Pre-launch implementability check for
reduce_mean_f16. - baracuda_
kernels_ ⚠reduce_ mean_ f16_ run - Mean reduction along one axis, f16 (f32-detour for sum + divide).
- baracuda_
kernels_ ⚠reduce_ mean_ f32_ can_ implement - Pre-launch implementability check for
reduce_mean_f32. - baracuda_
kernels_ ⚠reduce_ mean_ f32_ run - Mean reduction along one axis, f32. Sum then divide by extent.
- baracuda_
kernels_ ⚠reduce_ mean_ f64_ can_ implement - Pre-launch implementability check for
reduce_mean_f64. - baracuda_
kernels_ ⚠reduce_ mean_ f64_ run - Mean reduction along one axis, f64.
- baracuda_
kernels_ ⚠reduce_ min_ bf16_ can_ implement - Pre-launch implementability check for
reduce_min_bf16. - baracuda_
kernels_ ⚠reduce_ min_ bf16_ run - Min reduction along one axis, bf16 (f32-detour fminf).
- baracuda_
kernels_ ⚠reduce_ min_ f16_ can_ implement - Pre-launch implementability check for
reduce_min_f16. - baracuda_
kernels_ ⚠reduce_ min_ f16_ run - Min reduction along one axis, f16 (f32-detour fminf).
- baracuda_
kernels_ ⚠reduce_ min_ f32_ can_ implement - Pre-launch implementability check for
reduce_min_f32. - baracuda_
kernels_ ⚠reduce_ min_ f32_ run - Min reduction along one axis, f32.
init = +INFINITY,fminf. - baracuda_
kernels_ ⚠reduce_ min_ f64_ can_ implement - Pre-launch implementability check for
reduce_min_f64. - baracuda_
kernels_ ⚠reduce_ min_ f64_ run - Min reduction along one axis, f64.
- baracuda_
kernels_ ⚠reduce_ min_ i8_ can_ implement - Pre-launch implementability check for
reduce_min_i8. - baracuda_
kernels_ ⚠reduce_ min_ i8_ run min(x, axis=k)with i8 input/output (init =INT8_MAX).- baracuda_
kernels_ ⚠reduce_ min_ i16_ can_ implement - Pre-launch implementability check for
reduce_min_i16. - baracuda_
kernels_ ⚠reduce_ min_ i16_ run min(x, axis=k)with i16 input/output (init =INT16_MAX).- baracuda_
kernels_ ⚠reduce_ min_ i32_ can_ implement - Pre-launch implementability check for
reduce_min_i32. - baracuda_
kernels_ ⚠reduce_ min_ i32_ run min(x, axis=k)with i32 input/output (init =INT32_MAX).- baracuda_
kernels_ ⚠reduce_ min_ i64_ can_ implement - Pre-launch implementability check for
reduce_min_i64. - baracuda_
kernels_ ⚠reduce_ min_ i64_ run min(x, axis=k)with i64 input/output (init =INT64_MAX).- baracuda_
kernels_ ⚠reduce_ min_ to_ bf16_ can_ implement baracuda_kernels_reduce_min_to_bf16_can_implement(baracuda kernels reduce min to bf16 can implement).- baracuda_
kernels_ ⚠reduce_ min_ to_ bf16_ run reduce_min_to, bf16.- baracuda_
kernels_ ⚠reduce_ min_ to_ f16_ can_ implement baracuda_kernels_reduce_min_to_f16_can_implement(baracuda kernels reduce min to f16 can implement).- baracuda_
kernels_ ⚠reduce_ min_ to_ f16_ run reduce_min_to, f16. Accumulator widens to f32; identity is+FLT_MAXin f32 accumulator space, narrowing to+infon store.- baracuda_
kernels_ ⚠reduce_ min_ to_ f32_ can_ implement baracuda_kernels_reduce_min_to_f32_can_implement(baracuda kernels reduce min to f32 can implement).- baracuda_
kernels_ ⚠reduce_ min_ to_ f32_ run reduce_min_to, f32. Identity is+FLT_MAXwhen the broadcast set is empty.- baracuda_
kernels_ ⚠reduce_ min_ to_ f64_ can_ implement baracuda_kernels_reduce_min_to_f64_can_implement(baracuda kernels reduce min to f64 can implement).- baracuda_
kernels_ ⚠reduce_ min_ to_ f64_ run reduce_min_to, f64. Identity is+DBL_MAX.- baracuda_
kernels_ ⚠reduce_ min_ u8_ can_ implement - Pre-launch implementability check for
reduce_min_u8. - baracuda_
kernels_ ⚠reduce_ min_ u8_ run min(x, axis=k)with u8 input/output (same-dtype, init =UINT8_MAX).- baracuda_
kernels_ ⚠reduce_ min_ u32_ can_ implement - Pre-launch implementability check for
reduce_min_u32. - baracuda_
kernels_ ⚠reduce_ min_ u32_ run min(x, axis=k)with u32 input/output (init =UINT32_MAX).- baracuda_
kernels_ ⚠reduce_ norm2_ backward_ bf16_ can_ implement - Pre-launch implementability check for
reduce_norm2_backward_bf16. - baracuda_
kernels_ ⚠reduce_ norm2_ backward_ bf16_ run - Norm2 reduction backward, bf16.
- baracuda_
kernels_ ⚠reduce_ norm2_ backward_ f16_ can_ implement - Pre-launch implementability check for
reduce_norm2_backward_f16. - baracuda_
kernels_ ⚠reduce_ norm2_ backward_ f16_ run - Norm2 reduction backward, f16.
- baracuda_
kernels_ ⚠reduce_ norm2_ backward_ f32_ can_ implement - Pre-launch implementability check for
reduce_norm2_backward_f32. - baracuda_
kernels_ ⚠reduce_ norm2_ backward_ f32_ run - Norm2 reduction backward, f32.
- baracuda_
kernels_ ⚠reduce_ norm2_ backward_ f64_ can_ implement - Pre-launch implementability check for
reduce_norm2_backward_f64. - baracuda_
kernels_ ⚠reduce_ norm2_ backward_ f64_ run - Norm2 reduction backward, f64.
- baracuda_
kernels_ ⚠reduce_ norm2_ bf16_ can_ implement - Pre-launch implementability check for
reduce_norm2_bf16. - baracuda_
kernels_ ⚠reduce_ norm2_ bf16_ run - Norm2 reduction along one axis, bf16 (f32-detour functor + sqrt).
- baracuda_
kernels_ ⚠reduce_ norm2_ f16_ can_ implement - Pre-launch implementability check for
reduce_norm2_f16. - baracuda_
kernels_ ⚠reduce_ norm2_ f16_ run - Norm2 reduction along one axis, f16 (f32-detour functor + sqrt).
- baracuda_
kernels_ ⚠reduce_ norm2_ f32_ can_ implement - Pre-launch implementability check for
reduce_norm2_f32. - baracuda_
kernels_ ⚠reduce_ norm2_ f32_ run - Norm2 reduction along one axis, f32.
y = sqrt(sum(x*x))— shares the simple-reduce parameter shape. - baracuda_
kernels_ ⚠reduce_ norm2_ f64_ can_ implement - Pre-launch implementability check for
reduce_norm2_f64. - baracuda_
kernels_ ⚠reduce_ norm2_ f64_ run - Norm2 reduction along one axis, f64.
- baracuda_
kernels_ ⚠reduce_ prod_ backward_ bf16_ can_ implement - Pre-launch implementability check for
reduce_prod_backward_bf16. - baracuda_
kernels_ ⚠reduce_ prod_ backward_ bf16_ run - Prod reduction backward, bf16.
- baracuda_
kernels_ ⚠reduce_ prod_ backward_ f16_ can_ implement - Pre-launch implementability check for
reduce_prod_backward_f16. - baracuda_
kernels_ ⚠reduce_ prod_ backward_ f16_ run - Prod reduction backward, f16.
- baracuda_
kernels_ ⚠reduce_ prod_ backward_ f32_ can_ implement - Pre-launch implementability check for
reduce_prod_backward_f32. - baracuda_
kernels_ ⚠reduce_ prod_ backward_ f32_ run - Prod reduction backward, f32.
- baracuda_
kernels_ ⚠reduce_ prod_ backward_ f64_ can_ implement - Pre-launch implementability check for
reduce_prod_backward_f64. - baracuda_
kernels_ ⚠reduce_ prod_ backward_ f64_ run - Prod reduction backward, f64.
- baracuda_
kernels_ ⚠reduce_ prod_ bf16_ can_ implement - Pre-launch implementability check for
reduce_prod_bf16. - baracuda_
kernels_ ⚠reduce_ prod_ bf16_ run - Product reduction along one axis, bf16 (f32-detour multiply).
- baracuda_
kernels_ ⚠reduce_ prod_ f16_ can_ implement - Pre-launch implementability check for
reduce_prod_f16. - baracuda_
kernels_ ⚠reduce_ prod_ f16_ run - Product reduction along one axis, f16 (f32-detour multiply).
- baracuda_
kernels_ ⚠reduce_ prod_ f32_ can_ implement - Pre-launch implementability check for
reduce_prod_f32. - baracuda_
kernels_ ⚠reduce_ prod_ f32_ run - Product reduction along one axis, f32.
init = 1, op =*. - baracuda_
kernels_ ⚠reduce_ prod_ f64_ can_ implement - Pre-launch implementability check for
reduce_prod_f64. - baracuda_
kernels_ ⚠reduce_ prod_ f64_ run - Product reduction along one axis, f64.
- baracuda_
kernels_ ⚠reduce_ prod_ i8_ can_ implement - Pre-launch implementability check for
reduce_prod_i8. - baracuda_
kernels_ ⚠reduce_ prod_ i8_ run prod(x, axis=k)with i8 input/output (wider i64 accumulator).- baracuda_
kernels_ ⚠reduce_ prod_ i16_ can_ implement - Pre-launch implementability check for
reduce_prod_i16. - baracuda_
kernels_ ⚠reduce_ prod_ i16_ run prod(x, axis=k)with i16 input/output (wider i64 accumulator).- baracuda_
kernels_ ⚠reduce_ prod_ i32_ can_ implement - Pre-launch implementability check for
reduce_prod_i32. - baracuda_
kernels_ ⚠reduce_ prod_ i32_ run prod(x, axis=k)with i32 input/output (wider i64 accumulator).- baracuda_
kernels_ ⚠reduce_ prod_ i64_ can_ implement - Pre-launch implementability check for
reduce_prod_i64. - baracuda_
kernels_ ⚠reduce_ prod_ i64_ run prod(x, axis=k)with i64 input/output. Modulo-2^64 wrap.- baracuda_
kernels_ ⚠reduce_ prod_ to_ bf16_ can_ implement baracuda_kernels_reduce_prod_to_bf16_can_implement(baracuda kernels reduce prod to bf16 can implement).- baracuda_
kernels_ ⚠reduce_ prod_ to_ bf16_ run reduce_prod_to, bf16.- baracuda_
kernels_ ⚠reduce_ prod_ to_ f16_ can_ implement baracuda_kernels_reduce_prod_to_f16_can_implement(baracuda kernels reduce prod to f16 can implement).- baracuda_
kernels_ ⚠reduce_ prod_ to_ f16_ run reduce_prod_to, f16. Cumulative product overflows fast in half-precision; callers should keep values close to 1.- baracuda_
kernels_ ⚠reduce_ prod_ to_ f32_ can_ implement baracuda_kernels_reduce_prod_to_f32_can_implement(baracuda kernels reduce prod to f32 can implement).- baracuda_
kernels_ ⚠reduce_ prod_ to_ f32_ run reduce_prod_to, f32. Identity is1(multiplicative). Half dtypes accumulate in f32 then narrow on store.- baracuda_
kernels_ ⚠reduce_ prod_ to_ f64_ can_ implement baracuda_kernels_reduce_prod_to_f64_can_implement(baracuda kernels reduce prod to f64 can implement).- baracuda_
kernels_ ⚠reduce_ prod_ to_ f64_ run reduce_prod_to, f64.- baracuda_
kernels_ ⚠reduce_ prod_ u8_ can_ implement - Pre-launch implementability check for
reduce_prod_u8. - baracuda_
kernels_ ⚠reduce_ prod_ u8_ run prod(x, axis=k)with u8 input/output (wider u64 accumulator, wrap-on-overflow narrow on store).- baracuda_
kernels_ ⚠reduce_ prod_ u32_ can_ implement - Pre-launch implementability check for
reduce_prod_u32. - baracuda_
kernels_ ⚠reduce_ prod_ u32_ run prod(x, axis=k)with u32 input/output (wider u64 accumulator).- baracuda_
kernels_ ⚠reduce_ std_ backward_ bf16_ can_ implement - Pre-launch implementability check for
reduce_std_backward_bf16. - baracuda_
kernels_ ⚠reduce_ std_ backward_ bf16_ run - Std-dev reduction backward, bf16.
- baracuda_
kernels_ ⚠reduce_ std_ backward_ f16_ can_ implement - Pre-launch implementability check for
reduce_std_backward_f16. - baracuda_
kernels_ ⚠reduce_ std_ backward_ f16_ run - Std-dev reduction backward, f16.
- baracuda_
kernels_ ⚠reduce_ std_ backward_ f32_ can_ implement - Pre-launch implementability check for
reduce_std_backward_f32. - baracuda_
kernels_ ⚠reduce_ std_ backward_ f32_ run - Std-dev reduction backward, f32 (Welford BW + sqrt term).
- baracuda_
kernels_ ⚠reduce_ std_ backward_ f64_ can_ implement - Pre-launch implementability check for
reduce_std_backward_f64. - baracuda_
kernels_ ⚠reduce_ std_ backward_ f64_ run - Std-dev reduction backward, f64 (Welford BW in f64 + sqrt term).
- baracuda_
kernels_ ⚠reduce_ std_ bf16_ can_ implement - Pre-launch implementability check for
reduce_std_bf16. - baracuda_
kernels_ ⚠reduce_ std_ bf16_ run - Std-dev along one axis, bf16.
- baracuda_
kernels_ ⚠reduce_ std_ f16_ can_ implement - Pre-launch implementability check for
reduce_std_f16. - baracuda_
kernels_ ⚠reduce_ std_ f16_ run - Std-dev along one axis, f16.
- baracuda_
kernels_ ⚠reduce_ std_ f32_ can_ implement - Pre-launch implementability check for
reduce_std_f32. - baracuda_
kernels_ ⚠reduce_ std_ f32_ run - Std-dev along one axis, f32, Welford + sqrt.
- baracuda_
kernels_ ⚠reduce_ std_ f64_ can_ implement - Pre-launch implementability check for
reduce_std_f64. - baracuda_
kernels_ ⚠reduce_ std_ f64_ run - Std-dev along one axis, f64 (Welford in f64 + sqrt).
- baracuda_
kernels_ ⚠reduce_ sum_ backward_ bf16_ can_ implement - Pre-launch implementability check for
reduce_sum_backward_bf16. - baracuda_
kernels_ ⚠reduce_ sum_ backward_ bf16_ run - Sum reduction backward, bf16.
- baracuda_
kernels_ ⚠reduce_ sum_ backward_ f16_ can_ implement - Pre-launch implementability check for
reduce_sum_backward_f16. - baracuda_
kernels_ ⚠reduce_ sum_ backward_ f16_ run - Sum reduction backward, f16.
- baracuda_
kernels_ ⚠reduce_ sum_ backward_ f32_ can_ implement - Pre-launch implementability check for
reduce_sum_backward_f32. - baracuda_
kernels_ ⚠reduce_ sum_ backward_ f32_ run - Sum reduction backward, f32.
dx[c] = dy[c_with_reduce_axis_0]realized via stride-0 broadcast on the reduce axis. - baracuda_
kernels_ ⚠reduce_ sum_ backward_ f64_ can_ implement - Pre-launch implementability check for
reduce_sum_backward_f64. - baracuda_
kernels_ ⚠reduce_ sum_ backward_ f64_ run - Sum reduction backward, f64.
- baracuda_
kernels_ ⚠reduce_ sum_ bf16_ can_ implement - Pre-launch implementability check for
reduce_sum_bf16. - baracuda_
kernels_ ⚠reduce_ sum_ bf16_ run - Sum reduction along one axis, bf16 (f32-detour functor).
- baracuda_
kernels_ ⚠reduce_ sum_ f16_ can_ implement - Pre-launch implementability check for
reduce_sum_f16. - baracuda_
kernels_ ⚠reduce_ sum_ f16_ run - Sum reduction along one axis, f16.
- baracuda_
kernels_ ⚠reduce_ sum_ f32_ can_ implement - Pre-launch implementability check for
reduce_sum_f32. - baracuda_
kernels_ ⚠reduce_ sum_ f32_ run - Sum reduction along one axis, f32, naive thread-per-output-cell.
- baracuda_
kernels_ ⚠reduce_ sum_ f64_ can_ implement - Pre-launch implementability check for
reduce_sum_f64. - baracuda_
kernels_ ⚠reduce_ sum_ f64_ run - Sum reduction along one axis, f64.
- baracuda_
kernels_ ⚠reduce_ sum_ i8_ can_ implement - Pre-launch implementability check for
reduce_sum_i8. - baracuda_
kernels_ ⚠reduce_ sum_ i8_ run sum(x, axis=k)with i8 input/output (wider i64 accumulator).- baracuda_
kernels_ ⚠reduce_ sum_ i16_ can_ implement - Pre-launch implementability check for
reduce_sum_i16. - baracuda_
kernels_ ⚠reduce_ sum_ i16_ run sum(x, axis=k)with i16 input/output (wider i64 accumulator).- baracuda_
kernels_ ⚠reduce_ sum_ i32_ can_ implement - Pre-launch implementability check for
reduce_sum_i32. - baracuda_
kernels_ ⚠reduce_ sum_ i32_ run sum(x, axis=k)with i32 input/output (wider i64 accumulator).- baracuda_
kernels_ ⚠reduce_ sum_ i64_ can_ implement - Pre-launch implementability check for
reduce_sum_i64. - baracuda_
kernels_ ⚠reduce_ sum_ i64_ run sum(x, axis=k)with i64 input/output. Accumulator and output share dtype; modulo-2^64 wrap is the natural device behaviour.- baracuda_
kernels_ ⚠reduce_ sum_ to_ bf16_ can_ implement baracuda_kernels_reduce_sum_to_bf16_can_implement(baracuda kernels reduce sum to bf16 can implement).- baracuda_
kernels_ ⚠reduce_ sum_ to_ bf16_ run reduce_sum_to, bf16. Accumulator widens to f32.- baracuda_
kernels_ ⚠reduce_ sum_ to_ f16_ can_ implement baracuda_kernels_reduce_sum_to_f16_can_implement(baracuda kernels reduce sum to f16 can implement).- baracuda_
kernels_ ⚠reduce_ sum_ to_ f16_ run reduce_sum_to, f16. Accumulator widens to f32 per the rest of the family’s convention.- baracuda_
kernels_ ⚠reduce_ sum_ to_ f32_ can_ implement baracuda_kernels_reduce_sum_to_f32_can_implement(baracuda kernels reduce sum to f32 can implement).- baracuda_
kernels_ ⚠reduce_ sum_ to_ f32_ run reduce_sum_to, f32. Broadcast-reverse Σ. Phase 31.- baracuda_
kernels_ ⚠reduce_ sum_ to_ f64_ can_ implement baracuda_kernels_reduce_sum_to_f64_can_implement(baracuda kernels reduce sum to f64 can implement).- baracuda_
kernels_ ⚠reduce_ sum_ to_ f64_ run reduce_sum_to, f64.- baracuda_
kernels_ ⚠reduce_ sum_ u8_ can_ implement - Pre-launch implementability check for
reduce_sum_u8. - baracuda_
kernels_ ⚠reduce_ sum_ u8_ run sum(x, axis=k)with u8 input/output (wider u64 accumulator, wrap-on-overflow narrow on store).- baracuda_
kernels_ ⚠reduce_ sum_ u32_ can_ implement - Pre-launch implementability check for
reduce_sum_u32. - baracuda_
kernels_ ⚠reduce_ sum_ u32_ run sum(x, axis=k)with u32 input/output (wider u64 accumulator).- baracuda_
kernels_ ⚠reduce_ var_ backward_ bf16_ can_ implement - Pre-launch implementability check for
reduce_var_backward_bf16. - baracuda_
kernels_ ⚠reduce_ var_ backward_ bf16_ run - Variance reduction backward, bf16.
- baracuda_
kernels_ ⚠reduce_ var_ backward_ f16_ can_ implement - Pre-launch implementability check for
reduce_var_backward_f16. - baracuda_
kernels_ ⚠reduce_ var_ backward_ f16_ run - Variance reduction backward, f16.
- baracuda_
kernels_ ⚠reduce_ var_ backward_ f32_ can_ implement - Pre-launch implementability check for
reduce_var_backward_f32. - baracuda_
kernels_ ⚠reduce_ var_ backward_ f32_ run - Variance reduction backward, f32 (Welford BW).
- baracuda_
kernels_ ⚠reduce_ var_ backward_ f64_ can_ implement - Pre-launch implementability check for
reduce_var_backward_f64. - baracuda_
kernels_ ⚠reduce_ var_ backward_ f64_ run - Variance reduction backward, f64 (Welford BW in f64).
- baracuda_
kernels_ ⚠reduce_ var_ bf16_ can_ implement - Pre-launch implementability check for
reduce_var_bf16. - baracuda_
kernels_ ⚠reduce_ var_ bf16_ run - Variance reduction along one axis, bf16.
- baracuda_
kernels_ ⚠reduce_ var_ f16_ can_ implement - Pre-launch implementability check for
reduce_var_f16. - baracuda_
kernels_ ⚠reduce_ var_ f16_ run - Variance reduction along one axis, f16.
- baracuda_
kernels_ ⚠reduce_ var_ f32_ can_ implement - Pre-launch implementability check for
reduce_var_f32. - baracuda_
kernels_ ⚠reduce_ var_ f32_ run - Variance reduction along one axis, f32, Welford one-pass.
correction = 1for Bessel-corrected sample variance, 0 for population variance. - baracuda_
kernels_ ⚠reduce_ var_ f64_ can_ implement - Pre-launch implementability check for
reduce_var_f64. - baracuda_
kernels_ ⚠reduce_ var_ f64_ run - Variance reduction along one axis, f64 (Welford in f64).
- baracuda_
kernels_ ⚠repeat_ backward_ bf16_ can_ implement baracuda_kernels_repeat_backward_bf16_can_implement(baracuda kernels repeat backward bf16 can implement).- baracuda_
kernels_ ⚠repeat_ backward_ bf16_ run - Repeat backward (gather-adjoint sum), bf16. Accumulates in float.
- baracuda_
kernels_ ⚠repeat_ backward_ f16_ can_ implement baracuda_kernels_repeat_backward_f16_can_implement(baracuda kernels repeat backward f16 can implement).- baracuda_
kernels_ ⚠repeat_ backward_ f16_ run - Repeat backward (gather-adjoint sum), f16. Accumulates in float.
- baracuda_
kernels_ ⚠repeat_ backward_ f32_ can_ implement baracuda_kernels_repeat_backward_f32_can_implement(baracuda kernels repeat backward f32 can implement).- baracuda_
kernels_ ⚠repeat_ backward_ f32_ run - Repeat backward (gather-adjoint sum), f32.
- baracuda_
kernels_ ⚠repeat_ backward_ f64_ can_ implement baracuda_kernels_repeat_backward_f64_can_implement(baracuda kernels repeat backward f64 can implement).- baracuda_
kernels_ ⚠repeat_ backward_ f64_ run - Repeat backward (gather-adjoint sum), f64.
- baracuda_
kernels_ ⚠repeat_ bf16_ can_ implement - Pre-launch implementability check for
repeat_bf16. - baracuda_
kernels_ ⚠repeat_ bf16_ run - Repeat (per-axis tile), bf16.
- baracuda_
kernels_ ⚠repeat_ f16_ can_ implement - Pre-launch implementability check for
repeat_f16. - baracuda_
kernels_ ⚠repeat_ f16_ run - Repeat (per-axis tile), f16. Same parameter shape as the f32 variant — pure copy, no arithmetic.
- baracuda_
kernels_ ⚠repeat_ f32_ can_ implement - Pre-launch implementability check for
repeat_f32. - baracuda_
kernels_ ⚠repeat_ f32_ run - Repeat (per-axis tile), f32.
output.shape[d] = input.shape[d] * repeats[d]. Kernel computesinput_coord[d] = output_coord[d] % input.shape[d]. - baracuda_
kernels_ ⚠repeat_ f64_ can_ implement - Pre-launch implementability check for
repeat_f64. - baracuda_
kernels_ ⚠repeat_ f64_ run - Repeat (per-axis tile), f64.
- baracuda_
kernels_ ⚠rfft_ 1d_ f32_ run - 1-D R2C FFT (real → Hermitian-half complex). Unnormalized
(matches PyTorch’s
norm="backward"). - baracuda_
kernels_ ⚠rfft_ 1d_ f32_ workspace_ size - 1-D R2C FFT workspace size in bytes — always
0. - baracuda_
kernels_ ⚠rfft_ 1d_ f64_ run - 1-D R2C FFT (real → Hermitian-half complex). Unnormalized
(matches PyTorch’s
norm="backward"). - baracuda_
kernels_ ⚠rfft_ 1d_ f64_ workspace_ size - 1-D R2C FFT workspace size in bytes — always
0. - baracuda_
kernels_ ⚠rfft_ nd_ f32_ run - ND R2C FFT (real → Hermitian-half complex). Unnormalized.
dims[..rank]are real-side extents; complex output hasdims[rank-1] / 2 + 1on the last transformed axis. - baracuda_
kernels_ ⚠rfft_ nd_ f32_ workspace_ size - ND R2C FFT workspace size in bytes — always
0. - baracuda_
kernels_ ⚠rfft_ nd_ f64_ run - ND R2C FFT (real → Hermitian-half complex). Unnormalized.
dims[..rank]are real-side extents; complex output hasdims[rank-1] / 2 + 1on the last transformed axis. - baracuda_
kernels_ ⚠rfft_ nd_ f64_ workspace_ size - ND R2C FFT workspace size in bytes — always
0. - baracuda_
kernels_ ⚠rms_ norm_ backward_ bf16_ can_ implement baracuda_kernels_rms_norm_backward_bf16_can_implement(baracuda kernels rms norm backward bf16 can implement).- baracuda_
kernels_ ⚠rms_ norm_ backward_ bf16_ run - RMSNorm BW, bf16.
- baracuda_
kernels_ ⚠rms_ norm_ backward_ bf16_ strided_ can_ implement rms_norm_backward_bf16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠rms_ norm_ backward_ bf16_ strided_ run - RMSNorm BW strided sibling, bf16.
- baracuda_
kernels_ ⚠rms_ norm_ backward_ f16_ can_ implement baracuda_kernels_rms_norm_backward_f16_can_implement(baracuda kernels rms norm backward f16 can implement).- baracuda_
kernels_ ⚠rms_ norm_ backward_ f16_ run - RMSNorm BW, f16.
- baracuda_
kernels_ ⚠rms_ norm_ backward_ f16_ strided_ can_ implement rms_norm_backward_f16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠rms_ norm_ backward_ f16_ strided_ run - RMSNorm BW strided sibling, f16.
- baracuda_
kernels_ ⚠rms_ norm_ backward_ f32_ can_ implement baracuda_kernels_rms_norm_backward_f32_can_implement(baracuda kernels rms norm backward f32 can implement).- baracuda_
kernels_ ⚠rms_ norm_ backward_ f32_ run - RMSNorm BW, f32. Computes
dxand (whendgamma != null)dgamma[i] = Σ over outer cells dy[..., i] · (x[..., i] / rms[..., 0])whereiranges over the joint normalized region of lengthnorm_total_extent. - baracuda_
kernels_ ⚠rms_ norm_ backward_ f32_ strided_ can_ implement rms_norm_backward_f32_strided_can_implementcompanion.- baracuda_
kernels_ ⚠rms_ norm_ backward_ f32_ strided_ run - RMSNorm BW strided sibling, f32. Same contract as
baracuda_kernels_rms_norm_backward_f32_run; identical underlying launcher. - baracuda_
kernels_ ⚠rms_ norm_ backward_ f64_ can_ implement baracuda_kernels_rms_norm_backward_f64_can_implement(baracuda kernels rms norm backward f64 can implement).- baracuda_
kernels_ ⚠rms_ norm_ backward_ f64_ run - RMSNorm BW, f64.
- baracuda_
kernels_ ⚠rms_ norm_ backward_ f64_ strided_ can_ implement rms_norm_backward_f64_strided_can_implementcompanion.- baracuda_
kernels_ ⚠rms_ norm_ backward_ f64_ strided_ run - RMSNorm BW strided sibling, f64.
- baracuda_
kernels_ ⚠rms_ norm_ bf16_ can_ implement baracuda_kernels_rms_norm_bf16_can_implement(baracuda kernels rms norm bf16 can implement).- baracuda_
kernels_ ⚠rms_ norm_ bf16_ run - RMSNorm FW, bf16. f32 accumulator inside the kernel.
- baracuda_
kernels_ ⚠rms_ norm_ bf16_ strided_ can_ implement rms_norm_bf16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠rms_ norm_ bf16_ strided_ run - RMSNorm FW strided sibling, bf16. See
rms_norm_f32_strided_run. - baracuda_
kernels_ ⚠rms_ norm_ f16_ can_ implement baracuda_kernels_rms_norm_f16_can_implement(baracuda kernels rms norm f16 can implement).- baracuda_
kernels_ ⚠rms_ norm_ f16_ run - RMSNorm FW, f16. f32 accumulator inside the kernel.
- baracuda_
kernels_ ⚠rms_ norm_ f16_ strided_ can_ implement rms_norm_f16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠rms_ norm_ f16_ strided_ run - RMSNorm FW strided sibling, f16. See
rms_norm_f32_strided_run. - baracuda_
kernels_ ⚠rms_ norm_ f32_ can_ implement baracuda_kernels_rms_norm_f32_can_implement(baracuda kernels rms norm f32 can implement).- baracuda_
kernels_ ⚠rms_ norm_ f32_ run - RMSNorm FW, f32.
y = x / sqrt(mean(x², over norm_axes) + eps) * gamma.norm_axes_maskis a bitmask over input axes (suffix of[0, rank));norm_total_extentis the product of those axes’ extents.gammamay be null (treated as 1).rms_outshape equals input shape with norm axes collapsed to 1; only the slot at inner_lin == 0 within each row is written. - baracuda_
kernels_ ⚠rms_ norm_ f32_ strided_ can_ implement rms_norm_f32_strided_can_implementcompanion.- baracuda_
kernels_ ⚠rms_ norm_ f32_ strided_ run - RMSNorm FW strided sibling, f32. Same contract as
baracuda_kernels_rms_norm_f32_run; identical underlying launcher. - baracuda_
kernels_ ⚠rms_ norm_ f64_ can_ implement baracuda_kernels_rms_norm_f64_can_implement(baracuda kernels rms norm f64 can implement).- baracuda_
kernels_ ⚠rms_ norm_ f64_ run - RMSNorm FW, f64.
- baracuda_
kernels_ ⚠rms_ norm_ f64_ strided_ can_ implement rms_norm_f64_strided_can_implementcompanion.- baracuda_
kernels_ ⚠rms_ norm_ f64_ strided_ run - RMSNorm FW strided sibling, f64. See
rms_norm_f32_strided_run. - baracuda_
kernels_ ⚠roi_ align_ backward_ f32_ can_ implement baracuda_kernels_roi_align_backward_f32_can_implement(baracuda kernels roi align backward f32 can implement).- baracuda_
kernels_ ⚠roi_ align_ backward_ f32_ run roi_alignBW, f32. Caller pre-zerosdinput. # Safety: as FW.- baracuda_
kernels_ ⚠roi_ align_ backward_ f64_ can_ implement baracuda_kernels_roi_align_backward_f64_can_implement(baracuda kernels roi align backward f64 can implement).- baracuda_
kernels_ ⚠roi_ align_ backward_ f64_ run roi_alignBW, f64. # Safety: as f32 BW.- baracuda_
kernels_ ⚠roi_ align_ f32_ can_ implement baracuda_kernels_roi_align_f32_can_implement(baracuda kernels roi align f32 can implement).- baracuda_
kernels_ ⚠roi_ align_ f32_ run roi_align, f32.rois:[num_rois, 5](batch_idx, x1, y1, x2, y2) in INPUT-pixel coords (scaled byspatial_scaleinside the kernel).sampling_ratio == 0selects adaptive sampling.aligned == 0is PyTorch’s pre-0.6 convention.- baracuda_
kernels_ ⚠roi_ align_ f64_ can_ implement baracuda_kernels_roi_align_f64_can_implement(baracuda kernels roi align f64 can implement).- baracuda_
kernels_ ⚠roi_ align_ f64_ run roi_align, f64. # Safety: as f32.- baracuda_
kernels_ ⚠roi_ pool_ backward_ f32_ can_ implement baracuda_kernels_roi_pool_backward_f32_can_implement(baracuda kernels roi pool backward f32 can implement).- baracuda_
kernels_ ⚠roi_ pool_ backward_ f32_ run roi_poolBW, f32. Caller pre-zerosdinput. # Safety: as FW.- baracuda_
kernels_ ⚠roi_ pool_ backward_ f64_ can_ implement baracuda_kernels_roi_pool_backward_f64_can_implement(baracuda kernels roi pool backward f64 can implement).- baracuda_
kernels_ ⚠roi_ pool_ backward_ f64_ run roi_poolBW, f64. # Safety: as f32 BW.- baracuda_
kernels_ ⚠roi_ pool_ f32_ can_ implement baracuda_kernels_roi_pool_f32_can_implement(baracuda kernels roi pool f32 can implement).- baracuda_
kernels_ ⚠roi_ pool_ f32_ run roi_pool, f32. WritesoutputANDargmax(i32 linear plane-relative index per output cell;-1for empty bins).- baracuda_
kernels_ ⚠roi_ pool_ f64_ can_ implement baracuda_kernels_roi_pool_f64_can_implement(baracuda kernels roi pool f64 can implement).- baracuda_
kernels_ ⚠roi_ pool_ f64_ run roi_pool, f64. # Safety: as f32.- baracuda_
kernels_ ⚠roll_ bf16_ can_ implement - Pre-launch implementability check for
roll_bf16. - baracuda_
kernels_ ⚠roll_ bf16_ run - Roll, bf16. Pure element copy — no math.
- baracuda_
kernels_ ⚠roll_ bf16_ strided_ can_ implement roll_bf16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠roll_ bf16_ strided_ run - Roll strided sibling, bf16.
- baracuda_
kernels_ ⚠roll_ f16_ can_ implement - Pre-launch implementability check for
roll_f16. - baracuda_
kernels_ ⚠roll_ f16_ run - Roll, f16. Pure element copy — no math.
- baracuda_
kernels_ ⚠roll_ f16_ strided_ can_ implement roll_f16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠roll_ f16_ strided_ run - Roll strided sibling, f16.
- baracuda_
kernels_ ⚠roll_ f32_ can_ implement - Pre-launch implementability check for
roll_f32. - baracuda_
kernels_ ⚠roll_ f32_ run - Roll (cyclic shift along axes), f32.
shifts[d]is the shift amount on axis d (positive or negative, mod shape[d]). - baracuda_
kernels_ ⚠roll_ f32_ strided_ can_ implement roll_f32_strided_can_implementcompanion.- baracuda_
kernels_ ⚠roll_ f32_ strided_ run - Roll strided sibling, f32.
- baracuda_
kernels_ ⚠roll_ f64_ can_ implement - Pre-launch implementability check for
roll_f64. - baracuda_
kernels_ ⚠roll_ f64_ run - Roll, f64. Pure element copy — no math.
- baracuda_
kernels_ ⚠roll_ f64_ strided_ can_ implement roll_f64_strided_can_implementcompanion.- baracuda_
kernels_ ⚠roll_ f64_ strided_ run - Roll strided sibling, f64.
- baracuda_
kernels_ ⚠rope_ apply_ backward_ bf16_ can_ implement baracuda_kernels_rope_apply_backward_bf16_can_implement(baracuda kernels rope apply backward bf16 can implement).- baracuda_
kernels_ ⚠rope_ apply_ backward_ bf16_ run - RoPE apply BW, bf16.
- baracuda_
kernels_ ⚠rope_ apply_ backward_ f16_ can_ implement baracuda_kernels_rope_apply_backward_f16_can_implement(baracuda kernels rope apply backward f16 can implement).- baracuda_
kernels_ ⚠rope_ apply_ backward_ f16_ run - RoPE apply BW, f16.
- baracuda_
kernels_ ⚠rope_ apply_ backward_ f32_ can_ implement baracuda_kernels_rope_apply_backward_f32_can_implement(baracuda kernels rope apply backward f32 can implement).- baracuda_
kernels_ ⚠rope_ apply_ backward_ f32_ run - RoPE apply BW, f32. Same cos/sin tables as FW; orthogonal-rotation reverse.
- baracuda_
kernels_ ⚠rope_ apply_ backward_ f64_ can_ implement baracuda_kernels_rope_apply_backward_f64_can_implement(baracuda kernels rope apply backward f64 can implement).- baracuda_
kernels_ ⚠rope_ apply_ backward_ f64_ run - RoPE apply BW, f64.
- baracuda_
kernels_ ⚠rope_ apply_ bf16_ can_ implement baracuda_kernels_rope_apply_bf16_can_implement(baracuda kernels rope apply bf16 can implement).- baracuda_
kernels_ ⚠rope_ apply_ bf16_ run - RoPE apply FW, bf16 (f32 trig table, f32 multiply detour).
- baracuda_
kernels_ ⚠rope_ apply_ f16_ can_ implement baracuda_kernels_rope_apply_f16_can_implement(baracuda kernels rope apply f16 can implement).- baracuda_
kernels_ ⚠rope_ apply_ f16_ run - RoPE apply FW, f16 (f32 trig table, f32 multiply detour).
- baracuda_
kernels_ ⚠rope_ apply_ f32_ can_ implement - Implementability check for
rope_apply_f32. Host-side only. - baracuda_
kernels_ ⚠rope_ apply_ f32_ run - RoPE apply FW, f32. Cos/sin tables provided by caller.
- baracuda_
kernels_ ⚠rope_ apply_ f64_ can_ implement baracuda_kernels_rope_apply_f64_can_implement(baracuda kernels rope apply f64 can implement).- baracuda_
kernels_ ⚠rope_ apply_ f64_ run - RoPE apply FW, f64 (f32 trig table promoted to double at load).
- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ backward_ bf16_ can_ implement baracuda_kernels_rope_apply_interleaved_backward_bf16_can_implement(baracuda kernels rope apply interleaved backward bf16 can implement).- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ backward_ bf16_ run - RoPE apply interleaved BW, bf16.
- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ backward_ f16_ can_ implement baracuda_kernels_rope_apply_interleaved_backward_f16_can_implement(baracuda kernels rope apply interleaved backward f16 can implement).- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ backward_ f16_ run - RoPE apply interleaved BW, f16.
- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ backward_ f32_ can_ implement baracuda_kernels_rope_apply_interleaved_backward_f32_can_implement(baracuda kernels rope apply interleaved backward f32 can implement).- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ backward_ f32_ run - RoPE apply interleaved BW, f32.
- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ backward_ f64_ can_ implement baracuda_kernels_rope_apply_interleaved_backward_f64_can_implement(baracuda kernels rope apply interleaved backward f64 can implement).- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ backward_ f64_ run - RoPE apply interleaved BW, f64.
- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ bf16_ can_ implement baracuda_kernels_rope_apply_interleaved_bf16_can_implement(baracuda kernels rope apply interleaved bf16 can implement).- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ bf16_ run - RoPE apply interleaved FW, bf16.
- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ f16_ can_ implement baracuda_kernels_rope_apply_interleaved_f16_can_implement(baracuda kernels rope apply interleaved f16 can implement).- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ f16_ run - RoPE apply interleaved FW, f16.
- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ f32_ can_ implement baracuda_kernels_rope_apply_interleaved_f32_can_implement(baracuda kernels rope apply interleaved f32 can implement).- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ f32_ run - RoPE apply interleaved FW, f32.
- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ f64_ can_ implement baracuda_kernels_rope_apply_interleaved_f64_can_implement(baracuda kernels rope apply interleaved f64 can implement).- baracuda_
kernels_ ⚠rope_ apply_ interleaved_ f64_ run - RoPE apply interleaved FW, f64.
- baracuda_
kernels_ ⚠rope_ apply_ thd_ backward_ bf16_ can_ implement baracuda_kernels_rope_apply_thd_backward_bf16_can_implement(baracuda kernels rope apply thd backward bf16 can implement).- baracuda_
kernels_ ⚠rope_ apply_ thd_ backward_ bf16_ run - RoPE apply THD BW, bf16.
- baracuda_
kernels_ ⚠rope_ apply_ thd_ backward_ f16_ can_ implement baracuda_kernels_rope_apply_thd_backward_f16_can_implement(baracuda kernels rope apply thd backward f16 can implement).- baracuda_
kernels_ ⚠rope_ apply_ thd_ backward_ f16_ run - RoPE apply THD BW, f16.
- baracuda_
kernels_ ⚠rope_ apply_ thd_ backward_ f32_ can_ implement baracuda_kernels_rope_apply_thd_backward_f32_can_implement(baracuda kernels rope apply thd backward f32 can implement).- baracuda_
kernels_ ⚠rope_ apply_ thd_ backward_ f32_ run - RoPE apply THD BW, f32.
- baracuda_
kernels_ ⚠rope_ apply_ thd_ backward_ f64_ can_ implement baracuda_kernels_rope_apply_thd_backward_f64_can_implement(baracuda kernels rope apply thd backward f64 can implement).- baracuda_
kernels_ ⚠rope_ apply_ thd_ backward_ f64_ run - RoPE apply THD BW, f64.
- baracuda_
kernels_ ⚠rope_ apply_ thd_ bf16_ can_ implement baracuda_kernels_rope_apply_thd_bf16_can_implement(baracuda kernels rope apply thd bf16 can implement).- baracuda_
kernels_ ⚠rope_ apply_ thd_ bf16_ run - RoPE apply THD FW, bf16.
- baracuda_
kernels_ ⚠rope_ apply_ thd_ f16_ can_ implement baracuda_kernels_rope_apply_thd_f16_can_implement(baracuda kernels rope apply thd f16 can implement).- baracuda_
kernels_ ⚠rope_ apply_ thd_ f16_ run - RoPE apply THD FW, f16.
- baracuda_
kernels_ ⚠rope_ apply_ thd_ f32_ can_ implement baracuda_kernels_rope_apply_thd_f32_can_implement(baracuda kernels rope apply thd f32 can implement).- baracuda_
kernels_ ⚠rope_ apply_ thd_ f32_ run - RoPE apply THD FW, f32.
- baracuda_
kernels_ ⚠rope_ apply_ thd_ f64_ can_ implement baracuda_kernels_rope_apply_thd_f64_can_implement(baracuda kernels rope apply thd f64 can implement).- baracuda_
kernels_ ⚠rope_ apply_ thd_ f64_ run - RoPE apply THD FW, f64.
- baracuda_
kernels_ ⚠rope_ backward_ bf16_ can_ implement - Implementability check for
rope_backward_bf16. Host-side only. - baracuda_
kernels_ ⚠rope_ backward_ bf16_ run - RoPE BW, bf16.
- baracuda_
kernels_ ⚠rope_ backward_ bf16_ strided_ can_ implement - Implementability check for
rope_backward_bf16_strided. Host-side only. - baracuda_
kernels_ ⚠rope_ backward_ bf16_ strided_ run - RoPE BW strided, bf16.
- baracuda_
kernels_ ⚠rope_ backward_ f16_ can_ implement - Implementability check for
rope_backward_f16. Host-side only. - baracuda_
kernels_ ⚠rope_ backward_ f16_ run - RoPE BW, f16.
- baracuda_
kernels_ ⚠rope_ backward_ f16_ strided_ can_ implement - Implementability check for
rope_backward_f16_strided. Host-side only. - baracuda_
kernels_ ⚠rope_ backward_ f16_ strided_ run - RoPE BW strided, f16.
- baracuda_
kernels_ ⚠rope_ backward_ f32_ can_ implement - Implementability check for
rope_backward_f32. Host-side only. - baracuda_
kernels_ ⚠rope_ backward_ f32_ run - RoPE BW, f32. Same shape as FW; computes
dxfromdyby rotation through-θ. - baracuda_
kernels_ ⚠rope_ backward_ f32_ strided_ can_ implement - Implementability check for
rope_backward_f32_strided. Host-side only. - baracuda_
kernels_ ⚠rope_ backward_ f32_ strided_ run - RoPE BW strided, f32. Strides apply to
dy(input) anddx(output). - baracuda_
kernels_ ⚠rope_ backward_ f64_ can_ implement - Implementability check for
rope_backward_f64. Host-side only. - baracuda_
kernels_ ⚠rope_ backward_ f64_ run - RoPE BW, f64.
- baracuda_
kernels_ ⚠rope_ backward_ f64_ strided_ can_ implement - Implementability check for
rope_backward_f64_strided. Host-side only. - baracuda_
kernels_ ⚠rope_ backward_ f64_ strided_ run - RoPE BW strided, f64.
- baracuda_
kernels_ ⚠rope_ bf16_ can_ implement - Implementability check for
rope_bf16. Host-side only. - baracuda_
kernels_ ⚠rope_ bf16_ run - RoPE FW, bf16.
- baracuda_
kernels_ ⚠rope_ bf16_ strided_ can_ implement - Implementability check for
rope_bf16_strided. Host-side only. - baracuda_
kernels_ ⚠rope_ bf16_ strided_ run - RoPE FW strided, bf16.
- baracuda_
kernels_ ⚠rope_ f16_ can_ implement - Implementability check for
rope_f16. Host-side only. - baracuda_
kernels_ ⚠rope_ f16_ run - RoPE FW, f16 (f32 trig detour internally).
- baracuda_
kernels_ ⚠rope_ f16_ strided_ can_ implement - Implementability check for
rope_f16_strided. Host-side only. - baracuda_
kernels_ ⚠rope_ f16_ strided_ run - RoPE FW strided, f16.
- baracuda_
kernels_ ⚠rope_ f32_ can_ implement - Implementability check for
rope_f32. Host-side only. - baracuda_
kernels_ ⚠rope_ f32_ run - RoPE FW, f32. Input/output are [B, H, S, D] contiguous row-major;
head_dim(D) must be even. Whenpos_default_flag != 0, the kernel ignorespositionsand uses position index = sequence index; otherwisepositionsisint64_t[seq]. - baracuda_
kernels_ ⚠rope_ f32_ strided_ can_ implement - Implementability check for
rope_f32_strided. Host-side only. - baracuda_
kernels_ ⚠rope_ f32_ strided_ run - RoPE FW strided, f32.
- baracuda_
kernels_ ⚠rope_ f64_ can_ implement - Implementability check for
rope_f64. Host-side only. - baracuda_
kernels_ ⚠rope_ f64_ run - RoPE FW, f64.
- baracuda_
kernels_ ⚠rope_ f64_ strided_ can_ implement - Implementability check for
rope_f64_strided. Host-side only. - baracuda_
kernels_ ⚠rope_ f64_ strided_ run - RoPE FW strided, f64.
- baracuda_
kernels_ ⚠scale_ inplace_ c32_ can_ implement - Implementability check for
baracuda_kernels_scale_inplace_c32. Host-side only. - baracuda_
kernels_ ⚠scale_ inplace_ c32_ run - In-place scale of a
cufftComplexbuffer by a real scalar:y[i].x *= scale; y[i].y *= scale;. Applied aftercufftExecC2Cin the inverse direction to bake in the 1/N normalization PyTorch expects. - baracuda_
kernels_ ⚠scale_ inplace_ c64_ can_ implement - Implementability check for
baracuda_kernels_scale_inplace_c64. Host-side only. - baracuda_
kernels_ ⚠scale_ inplace_ c64_ run - In-place scale of a
cufftDoubleComplexbuffer by a real scalar. f64 analogue ofbaracuda_kernels_scale_inplace_c32_run. - baracuda_
kernels_ ⚠scale_ inplace_ real_ f32_ can_ implement - Implementability check for
baracuda_kernels_scale_inplace_real_f32. Host-side only. - baracuda_
kernels_ ⚠scale_ inplace_ real_ f32_ run - In-place scale of a real
f32buffer. Used to bake the1/Nnormalization into the output ofcufftExecC2R(IRFFT). - baracuda_
kernels_ ⚠scale_ inplace_ real_ f64_ can_ implement - Implementability check for
baracuda_kernels_scale_inplace_real_f64. Host-side only. - baracuda_
kernels_ ⚠scale_ inplace_ real_ f64_ run - In-place scale of a real
f64buffer. f64 analogue. - baracuda_
kernels_ ⚠scan_ cummax_ backward_ bf16_ can_ implement - Pre-launch implementability check for
scan_cummax_backward_bf16. - baracuda_
kernels_ ⚠scan_ cummax_ backward_ bf16_ run - Cummax backward, bf16.
- baracuda_
kernels_ ⚠scan_ cummax_ backward_ f16_ can_ implement - Pre-launch implementability check for
scan_cummax_backward_f16. - baracuda_
kernels_ ⚠scan_ cummax_ backward_ f16_ run - Cummax backward, f16.
- baracuda_
kernels_ ⚠scan_ cummax_ backward_ f32_ can_ implement - Pre-launch implementability check for
scan_cummax_backward_f32. - baracuda_
kernels_ ⚠scan_ cummax_ backward_ f32_ run - Cummax backward, f32. Walks the forward scan tracking first-occurrence argmax; gradient flows to the source position.
- baracuda_
kernels_ ⚠scan_ cummax_ backward_ f64_ can_ implement - Pre-launch implementability check for
scan_cummax_backward_f64. - baracuda_
kernels_ ⚠scan_ cummax_ backward_ f64_ run - Cummax backward, f64.
- baracuda_
kernels_ ⚠scan_ cummax_ bf16_ can_ implement - Pre-launch implementability check for
scan_cummax_bf16. - baracuda_
kernels_ ⚠scan_ cummax_ bf16_ run - Cummax, bf16.
- baracuda_
kernels_ ⚠scan_ cummax_ f16_ can_ implement - Pre-launch implementability check for
scan_cummax_f16. - baracuda_
kernels_ ⚠scan_ cummax_ f16_ run - Cummax, f16.
- baracuda_
kernels_ ⚠scan_ cummax_ f32_ can_ implement - Pre-launch implementability check for
scan_cummax_f32. - baracuda_
kernels_ ⚠scan_ cummax_ f32_ run - Cummax (inclusive prefix running max), f32.
- baracuda_
kernels_ ⚠scan_ cummax_ f64_ can_ implement - Pre-launch implementability check for
scan_cummax_f64. - baracuda_
kernels_ ⚠scan_ cummax_ f64_ run - Cummax, f64.
- baracuda_
kernels_ ⚠scan_ cummin_ backward_ bf16_ can_ implement - Pre-launch implementability check for
scan_cummin_backward_bf16. - baracuda_
kernels_ ⚠scan_ cummin_ backward_ bf16_ run - Cummin backward, bf16.
- baracuda_
kernels_ ⚠scan_ cummin_ backward_ f16_ can_ implement - Pre-launch implementability check for
scan_cummin_backward_f16. - baracuda_
kernels_ ⚠scan_ cummin_ backward_ f16_ run - Cummin backward, f16.
- baracuda_
kernels_ ⚠scan_ cummin_ backward_ f32_ can_ implement - Pre-launch implementability check for
scan_cummin_backward_f32. - baracuda_
kernels_ ⚠scan_ cummin_ backward_ f32_ run - Cummin backward, f32. Same kernel shape as Cummax BW with
<instead of>for the tie-tracking comparison. - baracuda_
kernels_ ⚠scan_ cummin_ backward_ f64_ can_ implement - Pre-launch implementability check for
scan_cummin_backward_f64. - baracuda_
kernels_ ⚠scan_ cummin_ backward_ f64_ run - Cummin backward, f64.
- baracuda_
kernels_ ⚠scan_ cummin_ bf16_ can_ implement - Pre-launch implementability check for
scan_cummin_bf16. - baracuda_
kernels_ ⚠scan_ cummin_ bf16_ run - Cummin, bf16.
- baracuda_
kernels_ ⚠scan_ cummin_ f16_ can_ implement - Pre-launch implementability check for
scan_cummin_f16. - baracuda_
kernels_ ⚠scan_ cummin_ f16_ run - Cummin, f16.
- baracuda_
kernels_ ⚠scan_ cummin_ f32_ can_ implement - Pre-launch implementability check for
scan_cummin_f32. - baracuda_
kernels_ ⚠scan_ cummin_ f32_ run - Cummin (inclusive prefix running min), f32.
- baracuda_
kernels_ ⚠scan_ cummin_ f64_ can_ implement - Pre-launch implementability check for
scan_cummin_f64. - baracuda_
kernels_ ⚠scan_ cummin_ f64_ run - Cummin, f64.
- baracuda_
kernels_ ⚠scan_ cumprod_ backward_ bf16_ can_ implement - Pre-launch implementability check for
scan_cumprod_backward_bf16. - baracuda_
kernels_ ⚠scan_ cumprod_ backward_ bf16_ run - Cumprod backward, bf16.
- baracuda_
kernels_ ⚠scan_ cumprod_ backward_ f16_ can_ implement - Pre-launch implementability check for
scan_cumprod_backward_f16. - baracuda_
kernels_ ⚠scan_ cumprod_ backward_ f16_ run - Cumprod backward, f16. f32-detour accumulator.
- baracuda_
kernels_ ⚠scan_ cumprod_ backward_ f32_ can_ implement - Pre-launch implementability check for
scan_cumprod_backward_f32. - baracuda_
kernels_ ⚠scan_ cumprod_ backward_ f32_ run - Cumprod backward, f32. Per-cell suffix accumulator of
dy[i] * y[i] / x[j]. Caller must ensure x has no zeros along the scan axis. - baracuda_
kernels_ ⚠scan_ cumprod_ backward_ f64_ can_ implement - Pre-launch implementability check for
scan_cumprod_backward_f64. - baracuda_
kernels_ ⚠scan_ cumprod_ backward_ f64_ run - Cumprod backward, f64.
- baracuda_
kernels_ ⚠scan_ cumprod_ bf16_ can_ implement - Pre-launch implementability check for
scan_cumprod_bf16. - baracuda_
kernels_ ⚠scan_ cumprod_ bf16_ run - Cumprod, bf16.
- baracuda_
kernels_ ⚠scan_ cumprod_ f16_ can_ implement - Pre-launch implementability check for
scan_cumprod_f16. - baracuda_
kernels_ ⚠scan_ cumprod_ f16_ run - Cumprod, f16. f32-detour accumulator.
- baracuda_
kernels_ ⚠scan_ cumprod_ f32_ can_ implement - Pre-launch implementability check for
scan_cumprod_f32. - baracuda_
kernels_ ⚠scan_ cumprod_ f32_ run - Cumprod (inclusive prefix product), f32. Same ABI as cumsum.
- baracuda_
kernels_ ⚠scan_ cumprod_ f64_ can_ implement - Pre-launch implementability check for
scan_cumprod_f64. - baracuda_
kernels_ ⚠scan_ cumprod_ f64_ run - Cumprod, f64.
- baracuda_
kernels_ ⚠scan_ cumsum_ bf16_ can_ implement - Pre-launch implementability check for
scan_cumsum_bf16. - baracuda_
kernels_ ⚠scan_ cumsum_ bf16_ run - Cumsum, bf16.
- baracuda_
kernels_ ⚠scan_ cumsum_ f16_ can_ implement - Pre-launch implementability check for
scan_cumsum_f16. - baracuda_
kernels_ ⚠scan_ cumsum_ f16_ run - Cumsum, f16. f32-detour accumulator inside the kernel.
- baracuda_
kernels_ ⚠scan_ cumsum_ f32_ can_ implement - Pre-launch implementability check for
scan_cumsum_f32. - baracuda_
kernels_ ⚠scan_ cumsum_ f32_ run - Inclusive prefix sum (
cumsum) along a single axis, f32.reverse != 0flips the scan direction. - baracuda_
kernels_ ⚠scan_ cumsum_ f64_ can_ implement - Pre-launch implementability check for
scan_cumsum_f64. - baracuda_
kernels_ ⚠scan_ cumsum_ f64_ run - Cumsum, f64.
- baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ backward_ bf16_ can_ implement baracuda_kernels_scan_log_cumsum_exp_backward_bf16_can_implement(baracuda kernels scan log cumsum exp backward bf16 can implement).- baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ backward_ bf16_ run - LogCumsumExp BW, bf16.
- baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ backward_ f16_ can_ implement baracuda_kernels_scan_log_cumsum_exp_backward_f16_can_implement(baracuda kernels scan log cumsum exp backward f16 can implement).- baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ backward_ f16_ run - LogCumsumExp BW, f16. f32-detour accumulator.
- baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ backward_ f32_ can_ implement baracuda_kernels_scan_log_cumsum_exp_backward_f32_can_implement(baracuda kernels scan log cumsum exp backward f32 can implement).- baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ backward_ f32_ run - LogCumsumExp BW, f32. Per-cell accumulator of
Σ dy[i] * exp(x[k] - y[i])over the FW-direction-dependentirange. Needs both savedxand savedy(same shape since scans are length-preserving). Stable by construction:x[k] - y[i] ≤ 0soexp(.) ∈ [0, 1]. - baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ backward_ f64_ can_ implement baracuda_kernels_scan_log_cumsum_exp_backward_f64_can_implement(baracuda kernels scan log cumsum exp backward f64 can implement).- baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ backward_ f64_ run - LogCumsumExp BW, f64.
- baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ bf16_ can_ implement baracuda_kernels_scan_log_cumsum_exp_bf16_can_implement(baracuda kernels scan log cumsum exp bf16 can implement).- baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ bf16_ run - LogCumsumExp FW, bf16.
- baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ f16_ can_ implement baracuda_kernels_scan_log_cumsum_exp_f16_can_implement(baracuda kernels scan log cumsum exp f16 can implement).- baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ f16_ run - LogCumsumExp FW, f16. f32-detour accumulator inside the kernel.
- baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ f32_ can_ implement baracuda_kernels_scan_log_cumsum_exp_f32_can_implement(baracuda kernels scan log cumsum exp f32 can implement).- baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ f32_ run - LogCumsumExp FW, f32.
y[k] = log(Σ_{j ≤ k} exp(x[j]))(or suffix-LSE whenreverse != 0). Numerically stable via the online running-max algorithm. Same ABI as cumsum. - baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ f64_ can_ implement baracuda_kernels_scan_log_cumsum_exp_f64_can_implement(baracuda kernels scan log cumsum exp f64 can implement).- baracuda_
kernels_ ⚠scan_ log_ cumsum_ exp_ f64_ run - LogCumsumExp FW, f64.
- baracuda_
kernels_ ⚠scatter_ add_ f32_ can_ implement - Implementability check for
scatter_add_f32. - baracuda_
kernels_ ⚠scatter_ add_ f32_ run out[..., index[..., j, ...], ...] += updates[..., j, ...]alongscatter_dim. f32 (atomicAdd).- baracuda_
kernels_ ⚠scatter_ add_ f64_ can_ implement - Implementability check for
scatter_add_f64. - baracuda_
kernels_ ⚠scatter_ add_ f64_ run scatter_add— f64 (atomicAdd).- baracuda_
kernels_ ⚠scatter_ add_ i64idx_ f32_ can_ implement - Implementability check for
scatter_add_i64idx_f32. - baracuda_
kernels_ ⚠scatter_ add_ i64idx_ f32_ run scatter_add— f32, i64 indices (atomicAdd).- baracuda_
kernels_ ⚠scatter_ add_ i64idx_ f64_ can_ implement - Implementability check for
scatter_add_i64idx_f64. - baracuda_
kernels_ ⚠scatter_ add_ i64idx_ f64_ run scatter_add— f64, i64 indices.- baracuda_
kernels_ ⚠scatter_ bf16_ can_ implement - Implementability check for
scatter_bf16. - baracuda_
kernels_ ⚠scatter_ bf16_ run scatter— bf16, i32 idx.- baracuda_
kernels_ ⚠scatter_ f16_ can_ implement - Implementability check for
scatter_f16. - baracuda_
kernels_ ⚠scatter_ f16_ run scatter— f16, i32 idx.- baracuda_
kernels_ ⚠scatter_ f32_ can_ implement - Implementability check for
scatter_f32. - baracuda_
kernels_ ⚠scatter_ f32_ run scatter—out[index] = updates, f32, i32 idx. NO accumulation.- baracuda_
kernels_ ⚠scatter_ f64_ can_ implement - Implementability check for
scatter_f64. - baracuda_
kernels_ ⚠scatter_ f64_ run scatter— f64, i32 idx.- baracuda_
kernels_ ⚠scatter_ i8_ can_ implement baracuda_kernels_scatter_i8_can_implement(baracuda kernels scatter i8 can implement).- baracuda_
kernels_ ⚠scatter_ i8_ run baracuda_kernels_scatter_i8_run(baracuda kernels scatter i8 run).- baracuda_
kernels_ ⚠scatter_ i16_ can_ implement baracuda_kernels_scatter_i16_can_implement(baracuda kernels scatter i16 can implement).- baracuda_
kernels_ ⚠scatter_ i16_ run baracuda_kernels_scatter_i16_run(baracuda kernels scatter i16 run).- baracuda_
kernels_ ⚠scatter_ i32_ can_ implement baracuda_kernels_scatter_i32_can_implement(baracuda kernels scatter i32 can implement).- baracuda_
kernels_ ⚠scatter_ i32_ run baracuda_kernels_scatter_i32_run(baracuda kernels scatter i32 run).- baracuda_
kernels_ ⚠scatter_ i64_ can_ implement baracuda_kernels_scatter_i64_can_implement(baracuda kernels scatter i64 can implement).- baracuda_
kernels_ ⚠scatter_ i64_ run baracuda_kernels_scatter_i64_run(baracuda kernels scatter i64 run).- baracuda_
kernels_ ⚠scatter_ i64idx_ bf16_ can_ implement - Implementability check for
scatter_i64idx_bf16. - baracuda_
kernels_ ⚠scatter_ i64idx_ bf16_ run scatter— bf16, i64 idx.- baracuda_
kernels_ ⚠scatter_ i64idx_ f16_ can_ implement - Implementability check for
scatter_i64idx_f16. - baracuda_
kernels_ ⚠scatter_ i64idx_ f16_ run scatter— f16, i64 idx.- baracuda_
kernels_ ⚠scatter_ i64idx_ f32_ can_ implement - Implementability check for
scatter_i64idx_f32. - baracuda_
kernels_ ⚠scatter_ i64idx_ f32_ run scatter— f32, i64 idx.- baracuda_
kernels_ ⚠scatter_ i64idx_ f64_ can_ implement - Implementability check for
scatter_i64idx_f64. - baracuda_
kernels_ ⚠scatter_ i64idx_ f64_ run scatter— f64, i64 idx.- baracuda_
kernels_ ⚠scatter_ i64idx_ i8_ can_ implement baracuda_kernels_scatter_i64idx_i8_can_implement(baracuda kernels scatter i64idx i8 can implement).- baracuda_
kernels_ ⚠scatter_ i64idx_ i8_ run baracuda_kernels_scatter_i64idx_i8_run(baracuda kernels scatter i64idx i8 run).- baracuda_
kernels_ ⚠scatter_ i64idx_ i16_ can_ implement baracuda_kernels_scatter_i64idx_i16_can_implement(baracuda kernels scatter i64idx i16 can implement).- baracuda_
kernels_ ⚠scatter_ i64idx_ i16_ run baracuda_kernels_scatter_i64idx_i16_run(baracuda kernels scatter i64idx i16 run).- baracuda_
kernels_ ⚠scatter_ i64idx_ i32_ can_ implement baracuda_kernels_scatter_i64idx_i32_can_implement(baracuda kernels scatter i64idx i32 can implement).- baracuda_
kernels_ ⚠scatter_ i64idx_ i32_ run baracuda_kernels_scatter_i64idx_i32_run(baracuda kernels scatter i64idx i32 run).- baracuda_
kernels_ ⚠scatter_ i64idx_ i64_ can_ implement baracuda_kernels_scatter_i64idx_i64_can_implement(baracuda kernels scatter i64idx i64 can implement).- baracuda_
kernels_ ⚠scatter_ i64idx_ i64_ run baracuda_kernels_scatter_i64idx_i64_run(baracuda kernels scatter i64idx i64 run).- baracuda_
kernels_ ⚠scatter_ i64idx_ u8_ can_ implement baracuda_kernels_scatter_i64idx_u8_can_implement(baracuda kernels scatter i64idx u8 can implement).- baracuda_
kernels_ ⚠scatter_ i64idx_ u8_ run baracuda_kernels_scatter_i64idx_u8_run(baracuda kernels scatter i64idx u8 run).- baracuda_
kernels_ ⚠scatter_ i64idx_ u16_ can_ implement baracuda_kernels_scatter_i64idx_u16_can_implement(baracuda kernels scatter i64idx u16 can implement).- baracuda_
kernels_ ⚠scatter_ i64idx_ u16_ run baracuda_kernels_scatter_i64idx_u16_run(baracuda kernels scatter i64idx u16 run).- baracuda_
kernels_ ⚠scatter_ i64idx_ u32_ can_ implement baracuda_kernels_scatter_i64idx_u32_can_implement(baracuda kernels scatter i64idx u32 can implement).- baracuda_
kernels_ ⚠scatter_ i64idx_ u32_ run baracuda_kernels_scatter_i64idx_u32_run(baracuda kernels scatter i64idx u32 run).- baracuda_
kernels_ ⚠scatter_ u8_ can_ implement baracuda_kernels_scatter_u8_can_implement(baracuda kernels scatter u8 can implement).- baracuda_
kernels_ ⚠scatter_ u8_ run baracuda_kernels_scatter_u8_run(baracuda kernels scatter u8 run).- baracuda_
kernels_ ⚠scatter_ u16_ can_ implement baracuda_kernels_scatter_u16_can_implement(baracuda kernels scatter u16 can implement).- baracuda_
kernels_ ⚠scatter_ u16_ run baracuda_kernels_scatter_u16_run(baracuda kernels scatter u16 run).- baracuda_
kernels_ ⚠scatter_ u32_ can_ implement baracuda_kernels_scatter_u32_can_implement(baracuda kernels scatter u32 can implement).- baracuda_
kernels_ ⚠scatter_ u32_ run baracuda_kernels_scatter_u32_run(baracuda kernels scatter u32 run).- baracuda_
kernels_ ⚠sdpa_ backward_ bf16_ can_ implement - Implementability check for
sdpa_backward_bf16. Host-side only. - baracuda_
kernels_ ⚠sdpa_ backward_ bf16_ run - SDPA BW, bf16.
- baracuda_
kernels_ ⚠sdpa_ backward_ bf16_ strided_ can_ implement - Implementability check for
sdpa_backward_bf16_strided. Host-side only. - baracuda_
kernels_ ⚠sdpa_ backward_ bf16_ strided_ run - SDPA BW strided, bf16.
- baracuda_
kernels_ ⚠sdpa_ backward_ f16_ can_ implement - Implementability check for
sdpa_backward_f16. Host-side only. - baracuda_
kernels_ ⚠sdpa_ backward_ f16_ run - SDPA BW, f16.
- baracuda_
kernels_ ⚠sdpa_ backward_ f16_ strided_ can_ implement - Implementability check for
sdpa_backward_f16_strided. Host-side only. - baracuda_
kernels_ ⚠sdpa_ backward_ f16_ strided_ run - SDPA BW strided, f16.
- baracuda_
kernels_ ⚠sdpa_ backward_ f32_ can_ implement - Implementability check for
sdpa_backward_f32. Host-side only. - baracuda_
kernels_ ⚠sdpa_ backward_ f32_ run - SDPA BW, f32. Given the FW-saved
attn([B, H, Q, K]),Q,K,V, and upstreamdy, computesdQ,dK,dV. Thedscores_wsargument is a caller-allocated [B, H, Q, K] scratch buffer reused as the dattn → dscores intermediate; size matches the FWattntensor. - baracuda_
kernels_ ⚠sdpa_ backward_ f32_ strided_ can_ implement - Implementability check for
sdpa_backward_f32_strided. Host-side only. - baracuda_
kernels_ ⚠sdpa_ backward_ f32_ strided_ run - SDPA BW strided, f32.
- baracuda_
kernels_ ⚠sdpa_ backward_ f64_ can_ implement - Implementability check for
sdpa_backward_f64. Host-side only. - baracuda_
kernels_ ⚠sdpa_ backward_ f64_ run - SDPA BW, f64.
- baracuda_
kernels_ ⚠sdpa_ backward_ f64_ strided_ can_ implement - Implementability check for
sdpa_backward_f64_strided. Host-side only. - baracuda_
kernels_ ⚠sdpa_ backward_ f64_ strided_ run - SDPA BW strided, f64.
- baracuda_
kernels_ ⚠sdpa_ bf16_ arbmask_ can_ implement - Arbitrary-mask SDPA host-side can-implement, bf16.
- baracuda_
kernels_ ⚠sdpa_ bf16_ arbmask_ run - Arbitrary additive-mask SDPA FW, bf16 (f32 accumulators).
- baracuda_
kernels_ ⚠sdpa_ bf16_ can_ implement - Implementability check for
sdpa_bf16. Host-side only. - baracuda_
kernels_ ⚠sdpa_ bf16_ run - SDPA FW, bf16 (f32 accumulators).
- baracuda_
kernels_ ⚠sdpa_ bf16_ strided_ can_ implement - Implementability check for
sdpa_bf16_strided. Host-side only. - baracuda_
kernels_ ⚠sdpa_ bf16_ strided_ run - SDPA FW strided, bf16.
- baracuda_
kernels_ ⚠sdpa_ f16_ arbmask_ can_ implement - Arbitrary-mask SDPA host-side can-implement, f16.
- baracuda_
kernels_ ⚠sdpa_ f16_ arbmask_ run - Arbitrary additive-mask SDPA FW, f16 (f32 accumulators).
- baracuda_
kernels_ ⚠sdpa_ f16_ can_ implement - Implementability check for
sdpa_f16. Host-side only. - baracuda_
kernels_ ⚠sdpa_ f16_ run - SDPA FW, f16 (f32 accumulators).
- baracuda_
kernels_ ⚠sdpa_ f16_ strided_ can_ implement - Implementability check for
sdpa_f16_strided. Host-side only. - baracuda_
kernels_ ⚠sdpa_ f16_ strided_ run - SDPA FW strided, f16.
- baracuda_
kernels_ ⚠sdpa_ f32_ arbmask_ can_ implement - Arbitrary-mask SDPA host-side can-implement, f32.
- baracuda_
kernels_ ⚠sdpa_ f32_ arbmask_ run - Arbitrary additive-mask SDPA FW, f32.
maskshape[B, H, Q, K]f32, applied as an additive bias on the score tile before softmax. - baracuda_
kernels_ ⚠sdpa_ f32_ can_ implement - Implementability check for
sdpa_f32. Host-side only. - baracuda_
kernels_ ⚠sdpa_ f32_ run - SDPA FW, f32. Computes
y = softmax(Q·K^T·scale + mask) · V. Theattnbuffer ([B, H, Q, K]) doubles as the scores intermediate and is overwritten in place with the softmax output (saved for BW). Passhas_mask = 0andmask = nullptrto skip the mask add.is_causal = 1applies an upper-triangular -inf mask inside the scores kernel. - baracuda_
kernels_ ⚠sdpa_ f32_ strided_ can_ implement - Implementability check for
sdpa_f32_strided. Host-side only. - baracuda_
kernels_ ⚠sdpa_ f32_ strided_ run - SDPA FW strided, f32.
- baracuda_
kernels_ ⚠sdpa_ f64_ arbmask_ can_ implement - Arbitrary-mask SDPA host-side can-implement, f64.
- baracuda_
kernels_ ⚠sdpa_ f64_ arbmask_ run - Arbitrary additive-mask SDPA FW, f64.
- baracuda_
kernels_ ⚠sdpa_ f64_ can_ implement - Implementability check for
sdpa_f64. Host-side only. - baracuda_
kernels_ ⚠sdpa_ f64_ run - SDPA FW, f64.
- baracuda_
kernels_ ⚠sdpa_ f64_ strided_ can_ implement - Implementability check for
sdpa_f64_strided. Host-side only. - baracuda_
kernels_ ⚠sdpa_ f64_ strided_ run - SDPA FW strided, f64.
- baracuda_
kernels_ ⚠searchsorted_ f32_ can_ implement baracuda_kernels_searchsorted_f32_can_implement(baracuda kernels searchsorted f32 can implement).- baracuda_
kernels_ ⚠searchsorted_ f32_ run searchsorted, f32.right == 0= lower_bound;right == 1= upper_bound.- baracuda_
kernels_ ⚠searchsorted_ f64_ can_ implement baracuda_kernels_searchsorted_f64_can_implement(baracuda kernels searchsorted f64 can implement).- baracuda_
kernels_ ⚠searchsorted_ f64_ run searchsorted, f64.- baracuda_
kernels_ ⚠searchsorted_ i32_ can_ implement baracuda_kernels_searchsorted_i32_can_implement(baracuda kernels searchsorted i32 can implement).- baracuda_
kernels_ ⚠searchsorted_ i32_ run searchsorted, i32.- baracuda_
kernels_ ⚠searchsorted_ i64_ can_ implement baracuda_kernels_searchsorted_i64_can_implement(baracuda kernels searchsorted i64 can implement).- baracuda_
kernels_ ⚠searchsorted_ i64_ run searchsorted, i64.- baracuda_
kernels_ ⚠segment_ max_ backward_ f32_ can_ implement - Implementability check for
segment_max_backward_f32. - baracuda_
kernels_ ⚠segment_ max_ backward_ f32_ run d_input[k, d] = d_output[seg, d]iff k is the (first) max-argument of the segment in column d, else 0. Sorted seg ids. f32.- baracuda_
kernels_ ⚠segment_ max_ backward_ f64_ can_ implement - Implementability check for
segment_max_backward_f64. - baracuda_
kernels_ ⚠segment_ max_ backward_ f64_ run segment_max_backward— f64.- baracuda_
kernels_ ⚠segment_ max_ f32_ can_ implement - Implementability check for
segment_max_f32. - baracuda_
kernels_ ⚠segment_ max_ f32_ run out[s, d] = max_{n : seg[n] == s} input[n, d]— sorted. f32.- baracuda_
kernels_ ⚠segment_ max_ f64_ can_ implement - Implementability check for
segment_max_f64. - baracuda_
kernels_ ⚠segment_ max_ f64_ run segment_max— f64.- baracuda_
kernels_ ⚠segment_ max_ i64idx_ f32_ can_ implement baracuda_kernels_segment_max_i64idx_f32_can_implement(baracuda kernels segment max i64idx f32 can implement).- baracuda_
kernels_ ⚠segment_ max_ i64idx_ f32_ run baracuda_kernels_segment_max_i64idx_f32_run(baracuda kernels segment max i64idx f32 run).- baracuda_
kernels_ ⚠segment_ max_ i64idx_ f64_ can_ implement baracuda_kernels_segment_max_i64idx_f64_can_implement(baracuda kernels segment max i64idx f64 can implement).- baracuda_
kernels_ ⚠segment_ max_ i64idx_ f64_ run baracuda_kernels_segment_max_i64idx_f64_run(baracuda kernels segment max i64idx f64 run).- baracuda_
kernels_ ⚠segment_ mean_ backward_ f32_ can_ implement - Implementability check for
segment_mean_backward_f32. - baracuda_
kernels_ ⚠segment_ mean_ backward_ f32_ run d_input[n, d] = d_output[seg[n], d] / count[seg[n]]. Workspace:num_segments * sizeof(i32). f32.- baracuda_
kernels_ ⚠segment_ mean_ backward_ f64_ can_ implement - Implementability check for
segment_mean_backward_f64. - baracuda_
kernels_ ⚠segment_ mean_ backward_ f64_ run segment_mean_backward— f64.- baracuda_
kernels_ ⚠segment_ mean_ backward_ i64idx_ f32_ can_ implement baracuda_kernels_segment_mean_backward_i64idx_f32_can_implement(baracuda kernels segment mean backward i64idx f32 can implement).- baracuda_
kernels_ ⚠segment_ mean_ backward_ i64idx_ f32_ run baracuda_kernels_segment_mean_backward_i64idx_f32_run(baracuda kernels segment mean backward i64idx f32 run).- baracuda_
kernels_ ⚠segment_ mean_ backward_ i64idx_ f64_ can_ implement baracuda_kernels_segment_mean_backward_i64idx_f64_can_implement(baracuda kernels segment mean backward i64idx f64 can implement).- baracuda_
kernels_ ⚠segment_ mean_ backward_ i64idx_ f64_ run baracuda_kernels_segment_mean_backward_i64idx_f64_run(baracuda kernels segment mean backward i64idx f64 run).- baracuda_
kernels_ ⚠segment_ mean_ f32_ can_ implement - Implementability check for
segment_mean_f32. - baracuda_
kernels_ ⚠segment_ mean_ f32_ run out[s, d] = mean_{n : seg[n] == s} input[n, d]— sorted. f32.- baracuda_
kernels_ ⚠segment_ mean_ f64_ can_ implement - Implementability check for
segment_mean_f64. - baracuda_
kernels_ ⚠segment_ mean_ f64_ run segment_mean— f64.- baracuda_
kernels_ ⚠segment_ mean_ i64idx_ f32_ can_ implement baracuda_kernels_segment_mean_i64idx_f32_can_implement(baracuda kernels segment mean i64idx f32 can implement).- baracuda_
kernels_ ⚠segment_ mean_ i64idx_ f32_ run baracuda_kernels_segment_mean_i64idx_f32_run(baracuda kernels segment mean i64idx f32 run).- baracuda_
kernels_ ⚠segment_ mean_ i64idx_ f64_ can_ implement baracuda_kernels_segment_mean_i64idx_f64_can_implement(baracuda kernels segment mean i64idx f64 can implement).- baracuda_
kernels_ ⚠segment_ mean_ i64idx_ f64_ run baracuda_kernels_segment_mean_i64idx_f64_run(baracuda kernels segment mean i64idx f64 run).- baracuda_
kernels_ ⚠segment_ min_ backward_ f32_ can_ implement - Implementability check for
segment_min_backward_f32. - baracuda_
kernels_ ⚠segment_ min_ backward_ f32_ run segment_min_backward— f32.- baracuda_
kernels_ ⚠segment_ min_ backward_ f64_ can_ implement - Implementability check for
segment_min_backward_f64. - baracuda_
kernels_ ⚠segment_ min_ backward_ f64_ run segment_min_backward— f64.- baracuda_
kernels_ ⚠segment_ min_ f32_ can_ implement - Implementability check for
segment_min_f32. - baracuda_
kernels_ ⚠segment_ min_ f32_ run out[s, d] = min_{n : seg[n] == s} input[n, d]— sorted. f32.- baracuda_
kernels_ ⚠segment_ min_ f64_ can_ implement - Implementability check for
segment_min_f64. - baracuda_
kernels_ ⚠segment_ min_ f64_ run segment_min— f64.- baracuda_
kernels_ ⚠segment_ min_ i64idx_ f32_ can_ implement baracuda_kernels_segment_min_i64idx_f32_can_implement(baracuda kernels segment min i64idx f32 can implement).- baracuda_
kernels_ ⚠segment_ min_ i64idx_ f32_ run baracuda_kernels_segment_min_i64idx_f32_run(baracuda kernels segment min i64idx f32 run).- baracuda_
kernels_ ⚠segment_ min_ i64idx_ f64_ can_ implement baracuda_kernels_segment_min_i64idx_f64_can_implement(baracuda kernels segment min i64idx f64 can implement).- baracuda_
kernels_ ⚠segment_ min_ i64idx_ f64_ run baracuda_kernels_segment_min_i64idx_f64_run(baracuda kernels segment min i64idx f64 run).- baracuda_
kernels_ ⚠segment_ prod_ backward_ f32_ can_ implement - Implementability check for
segment_prod_backward_f32. - baracuda_
kernels_ ⚠segment_ prod_ backward_ f32_ run segment_prod_backward— f32.- baracuda_
kernels_ ⚠segment_ prod_ backward_ f64_ can_ implement - Implementability check for
segment_prod_backward_f64. - baracuda_
kernels_ ⚠segment_ prod_ backward_ f64_ run segment_prod_backward— f64.- baracuda_
kernels_ ⚠segment_ prod_ f32_ can_ implement - Implementability check for
segment_prod_f32. - baracuda_
kernels_ ⚠segment_ prod_ f32_ run out[s, d] = prod_{n : seg[n] == s} input[n, d]— sorted. f32.- baracuda_
kernels_ ⚠segment_ prod_ f64_ can_ implement - Implementability check for
segment_prod_f64. - baracuda_
kernels_ ⚠segment_ prod_ f64_ run segment_prod— f64.- baracuda_
kernels_ ⚠segment_ prod_ i64idx_ f32_ can_ implement baracuda_kernels_segment_prod_i64idx_f32_can_implement(baracuda kernels segment prod i64idx f32 can implement).- baracuda_
kernels_ ⚠segment_ prod_ i64idx_ f32_ run baracuda_kernels_segment_prod_i64idx_f32_run(baracuda kernels segment prod i64idx f32 run).- baracuda_
kernels_ ⚠segment_ prod_ i64idx_ f64_ can_ implement baracuda_kernels_segment_prod_i64idx_f64_can_implement(baracuda kernels segment prod i64idx f64 can implement).- baracuda_
kernels_ ⚠segment_ prod_ i64idx_ f64_ run baracuda_kernels_segment_prod_i64idx_f64_run(baracuda kernels segment prod i64idx f64 run).- baracuda_
kernels_ ⚠segment_ sum_ backward_ f32_ can_ implement - Implementability check for
segment_sum_backward_f32. - baracuda_
kernels_ ⚠segment_ sum_ backward_ f32_ run d_input[n, d] = d_output[seg[n], d]. f32.- baracuda_
kernels_ ⚠segment_ sum_ backward_ f64_ can_ implement - Implementability check for
segment_sum_backward_f64. - baracuda_
kernels_ ⚠segment_ sum_ backward_ f64_ run segment_sum_backward— f64.- baracuda_
kernels_ ⚠segment_ sum_ backward_ i64idx_ f32_ can_ implement baracuda_kernels_segment_sum_backward_i64idx_f32_can_implement(baracuda kernels segment sum backward i64idx f32 can implement).- baracuda_
kernels_ ⚠segment_ sum_ backward_ i64idx_ f32_ run baracuda_kernels_segment_sum_backward_i64idx_f32_run(baracuda kernels segment sum backward i64idx f32 run).- baracuda_
kernels_ ⚠segment_ sum_ backward_ i64idx_ f64_ can_ implement baracuda_kernels_segment_sum_backward_i64idx_f64_can_implement(baracuda kernels segment sum backward i64idx f64 can implement).- baracuda_
kernels_ ⚠segment_ sum_ backward_ i64idx_ f64_ run baracuda_kernels_segment_sum_backward_i64idx_f64_run(baracuda kernels segment sum backward i64idx f64 run).- baracuda_
kernels_ ⚠segment_ sum_ f32_ can_ implement - Implementability check for
segment_sum_f32. - baracuda_
kernels_ ⚠segment_ sum_ f32_ run out[s, d] = Σ_{n : seg[n] == s} input[n, d]— sorted seg ids (monotonically non-decreasing). f32.- baracuda_
kernels_ ⚠segment_ sum_ f64_ can_ implement - Implementability check for
segment_sum_f64. - baracuda_
kernels_ ⚠segment_ sum_ f64_ run segment_sum— f64.- baracuda_
kernels_ ⚠segment_ sum_ i64idx_ f32_ can_ implement baracuda_kernels_segment_sum_i64idx_f32_can_implement(baracuda kernels segment sum i64idx f32 can implement).- baracuda_
kernels_ ⚠segment_ sum_ i64idx_ f32_ run baracuda_kernels_segment_sum_i64idx_f32_run(baracuda kernels segment sum i64idx f32 run).- baracuda_
kernels_ ⚠segment_ sum_ i64idx_ f64_ can_ implement baracuda_kernels_segment_sum_i64idx_f64_can_implement(baracuda kernels segment sum i64idx f64 can implement).- baracuda_
kernels_ ⚠segment_ sum_ i64idx_ f64_ run baracuda_kernels_segment_sum_i64idx_f64_run(baracuda kernels segment sum i64idx f64 run).- baracuda_
kernels_ ⚠softmax_ backward_ bf16_ can_ implement baracuda_kernels_softmax_backward_bf16_can_implement(baracuda kernels softmax backward bf16 can implement).- baracuda_
kernels_ ⚠softmax_ backward_ bf16_ run - Softmax BW, bf16.
- baracuda_
kernels_ ⚠softmax_ backward_ bf16_ strided_ can_ implement softmax_backward_bf16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠softmax_ backward_ bf16_ strided_ run - Softmax BW strided sibling, bf16.
- baracuda_
kernels_ ⚠softmax_ backward_ f16_ can_ implement baracuda_kernels_softmax_backward_f16_can_implement(baracuda kernels softmax backward f16 can implement).- baracuda_
kernels_ ⚠softmax_ backward_ f16_ run - Softmax BW, f16.
- baracuda_
kernels_ ⚠softmax_ backward_ f16_ strided_ can_ implement softmax_backward_f16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠softmax_ backward_ f16_ strided_ run - Softmax BW strided sibling, f16.
- baracuda_
kernels_ ⚠softmax_ backward_ f32_ can_ implement baracuda_kernels_softmax_backward_f32_can_implement(baracuda kernels softmax backward f32 can implement).- baracuda_
kernels_ ⚠softmax_ backward_ f32_ run - Softmax BW, f32.
dx[k] = y[k] * (dy[k] - Σ_j y[j] * dy[j]). Caller passes the saved forward outputy. - baracuda_
kernels_ ⚠softmax_ backward_ f32_ strided_ can_ implement softmax_backward_f32_strided_can_implementcompanion.- baracuda_
kernels_ ⚠softmax_ backward_ f32_ strided_ run - Softmax BW strided sibling, f32.
- baracuda_
kernels_ ⚠softmax_ backward_ f64_ can_ implement baracuda_kernels_softmax_backward_f64_can_implement(baracuda kernels softmax backward f64 can implement).- baracuda_
kernels_ ⚠softmax_ backward_ f64_ run - Softmax BW, f64.
- baracuda_
kernels_ ⚠softmax_ backward_ f64_ strided_ can_ implement softmax_backward_f64_strided_can_implementcompanion.- baracuda_
kernels_ ⚠softmax_ backward_ f64_ strided_ run - Softmax BW strided sibling, f64.
- baracuda_
kernels_ ⚠softmax_ bf16_ can_ implement baracuda_kernels_softmax_bf16_can_implement(baracuda kernels softmax bf16 can implement).- baracuda_
kernels_ ⚠softmax_ bf16_ run - Softmax FW, bf16.
- baracuda_
kernels_ ⚠softmax_ bf16_ strided_ can_ implement softmax_bf16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠softmax_ bf16_ strided_ run - Softmax FW strided sibling, bf16.
- baracuda_
kernels_ ⚠softmax_ f16_ can_ implement baracuda_kernels_softmax_f16_can_implement(baracuda kernels softmax f16 can implement).- baracuda_
kernels_ ⚠softmax_ f16_ run - Softmax FW, f16. f32 accumulator inside the kernel.
- baracuda_
kernels_ ⚠softmax_ f16_ strided_ can_ implement softmax_f16_strided_can_implementcompanion.- baracuda_
kernels_ ⚠softmax_ f16_ strided_ run - Softmax FW strided sibling, f16.
- baracuda_
kernels_ ⚠softmax_ f32_ can_ implement baracuda_kernels_softmax_f32_can_implement(baracuda kernels softmax f32 can implement).- baracuda_
kernels_ ⚠softmax_ f32_ run - Softmax FW, f32.
y[k] = exp(x[k] - max) / Σ exp(x[j] - max)alongsoftmax_axis. Numerically stable. - baracuda_
kernels_ ⚠softmax_ f32_ strided_ can_ implement softmax_f32_strided_can_implementcompanion.- baracuda_
kernels_ ⚠softmax_ f32_ strided_ run - Softmax FW strided sibling, f32. Same contract as
baracuda_kernels_softmax_f32_run; identical underlying launcher. - baracuda_
kernels_ ⚠softmax_ f64_ can_ implement baracuda_kernels_softmax_f64_can_implement(baracuda kernels softmax f64 can implement).- baracuda_
kernels_ ⚠softmax_ f64_ run - Softmax FW, f64.
- baracuda_
kernels_ ⚠softmax_ f64_ strided_ can_ implement softmax_f64_strided_can_implementcompanion.- baracuda_
kernels_ ⚠softmax_ f64_ strided_ run - Softmax FW strided sibling, f64.
- baracuda_
kernels_ ⚠solve_ f32_ run - Linear-system solve
A · X = Bvia fusedgetrf+getrs.a_inoutis overwritten with packedLUfactors;b_inoutis overwritten with the solutionX.pivots_outis[n](1-based per LAPACK convention). - baracuda_
kernels_ ⚠solve_ f32_ workspace_ size - Solve workspace size — uses the
getrfquery (cuSOLVER’sgetrsis workspace-free). - baracuda_
kernels_ ⚠solve_ f64_ run - Linear-system solve
A · X = Bvia fusedgetrf+getrs.a_inoutis overwritten with packedLUfactors;b_inoutis overwritten with the solutionX.pivots_outis[n](1-based per LAPACK convention). - baracuda_
kernels_ ⚠solve_ f64_ workspace_ size - Solve workspace size — uses the
getrfquery (cuSOLVER’sgetrsis workspace-free). - baracuda_
kernels_ ⚠sort_ backward_ f32_ can_ implement baracuda_kernels_sort_backward_f32_can_implement(baracuda kernels sort backward f32 can implement).- baracuda_
kernels_ ⚠sort_ backward_ f32_ run - Sort BW, f32.
dx[indices[i]] = dy[i]; launcher zerosdxfirst. - baracuda_
kernels_ ⚠sort_ backward_ f64_ can_ implement baracuda_kernels_sort_backward_f64_can_implement(baracuda kernels sort backward f64 can implement).- baracuda_
kernels_ ⚠sort_ backward_ f64_ run - Sort BW, f64.
- baracuda_
kernels_ ⚠sort_ f32_ can_ implement baracuda_kernels_sort_f32_can_implement(baracuda kernels sort f32 can implement).- baracuda_
kernels_ ⚠sort_ f32_ run - Block-bitonic sort, f32. Emits sorted values + sorted indices
(saved-indices contract for BW).
descending == 0= ascending. - baracuda_
kernels_ ⚠sort_ f64_ can_ implement baracuda_kernels_sort_f64_can_implement(baracuda kernels sort f64 can implement).- baracuda_
kernels_ ⚠sort_ f64_ run - Block-bitonic sort, f64.
- baracuda_
kernels_ ⚠sort_ i32_ can_ implement baracuda_kernels_sort_i32_can_implement(baracuda kernels sort i32 can implement).- baracuda_
kernels_ ⚠sort_ i32_ run - Block-bitonic sort, i32.
- baracuda_
kernels_ ⚠sort_ i64_ can_ implement baracuda_kernels_sort_i64_can_implement(baracuda kernels sort i64 can implement).- baracuda_
kernels_ ⚠sort_ i64_ run - Block-bitonic sort, i64.
- baracuda_
kernels_ ⚠sparsemax_ backward_ bf16_ can_ implement baracuda_kernels_sparsemax_backward_bf16_can_implement(baracuda kernels sparsemax backward bf16 can implement).- baracuda_
kernels_ ⚠sparsemax_ backward_ bf16_ run - Sparsemax BW, bf16.
- baracuda_
kernels_ ⚠sparsemax_ backward_ f16_ can_ implement baracuda_kernels_sparsemax_backward_f16_can_implement(baracuda kernels sparsemax backward f16 can implement).- baracuda_
kernels_ ⚠sparsemax_ backward_ f16_ run - Sparsemax BW, f16.
- baracuda_
kernels_ ⚠sparsemax_ backward_ f32_ can_ implement baracuda_kernels_sparsemax_backward_f32_can_implement(baracuda kernels sparsemax backward f32 can implement).- baracuda_
kernels_ ⚠sparsemax_ backward_ f32_ run - Sparsemax BW, f32.
dx[i] = dy[i] - sum_dy_active / n_activefor active positions (y > 0),0elsewhere. - baracuda_
kernels_ ⚠sparsemax_ backward_ f64_ can_ implement baracuda_kernels_sparsemax_backward_f64_can_implement(baracuda kernels sparsemax backward f64 can implement).- baracuda_
kernels_ ⚠sparsemax_ backward_ f64_ run - Sparsemax BW, f64.
- baracuda_
kernels_ ⚠sparsemax_ bf16_ can_ implement baracuda_kernels_sparsemax_bf16_can_implement(baracuda kernels sparsemax bf16 can implement).- baracuda_
kernels_ ⚠sparsemax_ bf16_ run - Sparsemax FW, bf16.
- baracuda_
kernels_ ⚠sparsemax_ f16_ can_ implement baracuda_kernels_sparsemax_f16_can_implement(baracuda kernels sparsemax f16 can implement).- baracuda_
kernels_ ⚠sparsemax_ f16_ run - Sparsemax FW, f16.
- baracuda_
kernels_ ⚠sparsemax_ f32_ can_ implement baracuda_kernels_sparsemax_f32_can_implement(baracuda kernels sparsemax f32 can implement).- baracuda_
kernels_ ⚠sparsemax_ f32_ run - Sparsemax FW, f32.
y = ProjSimplex(x)via threshold τ found after sorting the row descending. Row extent limited to 64. - baracuda_
kernels_ ⚠sparsemax_ f64_ can_ implement baracuda_kernels_sparsemax_f64_can_implement(baracuda kernels sparsemax f64 can implement).- baracuda_
kernels_ ⚠sparsemax_ f64_ run - Sparsemax FW, f64.
- baracuda_
kernels_ ⚠svd_ batched_ f32_ run - Batched Jacobi-SVD on square input. Returns
V(notV^T). Whenjobz == 0,u_out/v_outmay be null. - baracuda_
kernels_ ⚠svd_ batched_ f32_ workspace_ size - Batched Jacobi-SVD workspace size in bytes.
- baracuda_
kernels_ ⚠svd_ batched_ f64_ run - Batched Jacobi-SVD on square input. Returns
V(notV^T). Whenjobz == 0,u_out/v_outmay be null. - baracuda_
kernels_ ⚠svd_ batched_ f64_ workspace_ size - Batched Jacobi-SVD workspace size in bytes.
- baracuda_
kernels_ ⚠svd_ f32_ run - SVD
A = U · diag(S) · V^T. Requiresm >= n.a_inoutis overwritten by cuSOLVER as scratch. - baracuda_
kernels_ ⚠svd_ f32_ workspace_ size - SVD workspace size in bytes for
gesvd. - baracuda_
kernels_ ⚠svd_ f64_ run - SVD
A = U · diag(S) · V^T. Requiresm >= n.a_inoutis overwritten by cuSOLVER as scratch. - baracuda_
kernels_ ⚠svd_ f64_ workspace_ size - SVD workspace size in bytes for
gesvd. - baracuda_
kernels_ ⚠svda_ batched_ f32_ run - Approximate (Jacobi-bidiagonal) batched SVD on rectangular
input. Returns
V(notV^T). Theh_r_nrm_f_outbuffer is host-resident and receives per-slot residual Frobenius norms (cuSOLVER signature). Pass null to discard — but cuSOLVER may dereference even when “discarding”, so callers should pass a real buffer ofbatch_sizef64s. - baracuda_
kernels_ ⚠svda_ batched_ f32_ workspace_ size - Approximate batched SVD workspace size in bytes.
- baracuda_
kernels_ ⚠svda_ batched_ f64_ run - Approximate (Jacobi-bidiagonal) batched SVD on rectangular
input. Returns
V(notV^T). Theh_r_nrm_f_outbuffer is host-resident and receives per-slot residual Frobenius norms (cuSOLVER signature). Pass null to discard — but cuSOLVER may dereference even when “discarding”, so callers should pass a real buffer ofbatch_sizef64s. - baracuda_
kernels_ ⚠svda_ batched_ f64_ workspace_ size - Approximate batched SVD workspace size in bytes.
- baracuda_
kernels_ ⚠ternary_ addcdiv_ backward_ bf16_ can_ implement - Pre-launch check for
ternary_addcdiv_backward_bf16. - baracuda_
kernels_ ⚠ternary_ addcdiv_ backward_ bf16_ run - Addcdiv backward, bf16.
- baracuda_
kernels_ ⚠ternary_ addcdiv_ backward_ f16_ can_ implement - Pre-launch check for
ternary_addcdiv_backward_f16. - baracuda_
kernels_ ⚠ternary_ addcdiv_ backward_ f16_ run - Addcdiv backward, f16.
- baracuda_
kernels_ ⚠ternary_ addcdiv_ backward_ f32_ can_ implement - Pre-launch check for
ternary_addcdiv_backward_f32. - baracuda_
kernels_ ⚠ternary_ addcdiv_ backward_ f32_ run - Addcdiv backward, f32. Reads
desc.scale. Writesda = dy,db = dy*scale/c,dc = -dy*scale*b/c². - baracuda_
kernels_ ⚠ternary_ addcdiv_ backward_ f64_ can_ implement - Pre-launch check for
ternary_addcdiv_backward_f64. - baracuda_
kernels_ ⚠ternary_ addcdiv_ backward_ f64_ run - Addcdiv backward, f64.
- baracuda_
kernels_ ⚠ternary_ addcdiv_ bf16_ can_ implement - Pre-launch check for
addcdiv_bf16. - baracuda_
kernels_ ⚠ternary_ addcdiv_ bf16_ run addcdiv, bf16, contig.- baracuda_
kernels_ ⚠ternary_ addcdiv_ bf16_ strided_ can_ implement - Pre-launch check for
addcdiv_bf16_strided. - baracuda_
kernels_ ⚠ternary_ addcdiv_ bf16_ strided_ run addcdiv, bf16, strided.- baracuda_
kernels_ ⚠ternary_ addcdiv_ f16_ can_ implement - Pre-launch check for
addcdiv_f16. - baracuda_
kernels_ ⚠ternary_ addcdiv_ f16_ run addcdiv, f16, contig.- baracuda_
kernels_ ⚠ternary_ addcdiv_ f16_ strided_ can_ implement - Pre-launch check for
addcdiv_f16_strided. - baracuda_
kernels_ ⚠ternary_ addcdiv_ f16_ strided_ run addcdiv, f16, strided.- baracuda_
kernels_ ⚠ternary_ addcdiv_ f32_ can_ implement - Pre-launch check for
addcdiv_f32. - baracuda_
kernels_ ⚠ternary_ addcdiv_ f32_ run y = a + scale * (b / c), f32, contig.- baracuda_
kernels_ ⚠ternary_ addcdiv_ f32_ strided_ can_ implement - Pre-launch check for
addcdiv_f32_strided. - baracuda_
kernels_ ⚠ternary_ addcdiv_ f32_ strided_ run addcdiv, f32, strided.- baracuda_
kernels_ ⚠ternary_ addcdiv_ f64_ can_ implement - Pre-launch check for
addcdiv_f64. - baracuda_
kernels_ ⚠ternary_ addcdiv_ f64_ run addcdiv, f64, contig.- baracuda_
kernels_ ⚠ternary_ addcdiv_ f64_ strided_ can_ implement - Pre-launch check for
addcdiv_f64_strided. - baracuda_
kernels_ ⚠ternary_ addcdiv_ f64_ strided_ run addcdiv, f64, strided.- baracuda_
kernels_ ⚠ternary_ addcmul_ backward_ bf16_ can_ implement - Pre-launch check for
ternary_addcmul_backward_bf16. - baracuda_
kernels_ ⚠ternary_ addcmul_ backward_ bf16_ run - Addcmul backward, bf16.
- baracuda_
kernels_ ⚠ternary_ addcmul_ backward_ f16_ can_ implement - Pre-launch check for
ternary_addcmul_backward_f16. - baracuda_
kernels_ ⚠ternary_ addcmul_ backward_ f16_ run - Addcmul backward, f16.
- baracuda_
kernels_ ⚠ternary_ addcmul_ backward_ f32_ can_ implement - Pre-launch check for
ternary_addcmul_backward_f32. - baracuda_
kernels_ ⚠ternary_ addcmul_ backward_ f32_ run - Addcmul backward, f32. Reads
desc.scale. Writesda = dy,db = dy*scale*c,dc = dy*scale*b. - baracuda_
kernels_ ⚠ternary_ addcmul_ backward_ f64_ can_ implement - Pre-launch check for
ternary_addcmul_backward_f64. - baracuda_
kernels_ ⚠ternary_ addcmul_ backward_ f64_ run - Addcmul backward, f64.
- baracuda_
kernels_ ⚠ternary_ addcmul_ bf16_ can_ implement - Pre-launch check for
addcmul_bf16. - baracuda_
kernels_ ⚠ternary_ addcmul_ bf16_ run addcmul, bf16, contig.- baracuda_
kernels_ ⚠ternary_ addcmul_ bf16_ strided_ can_ implement - Pre-launch check for
addcmul_bf16_strided. - baracuda_
kernels_ ⚠ternary_ addcmul_ bf16_ strided_ run addcmul, bf16, strided.- baracuda_
kernels_ ⚠ternary_ addcmul_ f16_ can_ implement - Pre-launch check for
addcmul_f16. - baracuda_
kernels_ ⚠ternary_ addcmul_ f16_ run addcmul, f16, contig.- baracuda_
kernels_ ⚠ternary_ addcmul_ f16_ strided_ can_ implement - Pre-launch check for
addcmul_f16_strided. - baracuda_
kernels_ ⚠ternary_ addcmul_ f16_ strided_ run addcmul, f16, strided.- baracuda_
kernels_ ⚠ternary_ addcmul_ f32_ can_ implement - Pre-launch implementability check for
addcmul_f32. - baracuda_
kernels_ ⚠ternary_ addcmul_ f32_ run y = a + scale * b * c, f32, contig fast path.- baracuda_
kernels_ ⚠ternary_ addcmul_ f32_ strided_ can_ implement - Pre-launch check for
addcmul_f32_strided. - baracuda_
kernels_ ⚠ternary_ addcmul_ f32_ strided_ run y = a + scale * b * c, f32, strided / broadcast path.- baracuda_
kernels_ ⚠ternary_ addcmul_ f64_ can_ implement - Pre-launch check for
addcmul_f64. - baracuda_
kernels_ ⚠ternary_ addcmul_ f64_ run addcmul, f64, contig.- baracuda_
kernels_ ⚠ternary_ addcmul_ f64_ strided_ can_ implement - Pre-launch check for
addcmul_f64_strided. - baracuda_
kernels_ ⚠ternary_ addcmul_ f64_ strided_ run addcmul, f64, strided.- baracuda_
kernels_ ⚠ternary_ clamp_ backward_ bf16_ can_ implement - Pre-launch check for
ternary_clamp_backward_bf16. - baracuda_
kernels_ ⚠ternary_ clamp_ backward_ bf16_ run - Clamp backward, bf16.
- baracuda_
kernels_ ⚠ternary_ clamp_ backward_ f16_ can_ implement - Pre-launch check for
ternary_clamp_backward_f16. - baracuda_
kernels_ ⚠ternary_ clamp_ backward_ f16_ run - Clamp backward, f16.
- baracuda_
kernels_ ⚠ternary_ clamp_ backward_ f32_ can_ implement - Pre-launch check for
ternary_clamp_backward_f32. - baracuda_
kernels_ ⚠ternary_ clamp_ backward_ f32_ run - Clamp backward, f32. Writes mask × dy per axis (a/b/c).
- baracuda_
kernels_ ⚠ternary_ clamp_ backward_ f64_ can_ implement - Pre-launch check for
ternary_clamp_backward_f64. - baracuda_
kernels_ ⚠ternary_ clamp_ backward_ f64_ run - Clamp backward, f64.
- baracuda_
kernels_ ⚠ternary_ clamp_ bf16_ can_ implement - Pre-launch implementability check for
ternary_clamp_bf16. - baracuda_
kernels_ ⚠ternary_ clamp_ bf16_ run - Ternary elementwise
clamp, bf16, contig fast path. - baracuda_
kernels_ ⚠ternary_ clamp_ bf16_ strided_ can_ implement - Pre-launch implementability check for
ternary_clamp_bf16_strided. - baracuda_
kernels_ ⚠ternary_ clamp_ bf16_ strided_ run - Ternary elementwise
clamp, bf16, strided / broadcast path. - baracuda_
kernels_ ⚠ternary_ clamp_ f16_ can_ implement - Pre-launch implementability check for
ternary_clamp_f16. - baracuda_
kernels_ ⚠ternary_ clamp_ f16_ run - Ternary elementwise
clamp, f16, contig fast path. - baracuda_
kernels_ ⚠ternary_ clamp_ f16_ strided_ can_ implement - Pre-launch implementability check for
ternary_clamp_f16_strided. - baracuda_
kernels_ ⚠ternary_ clamp_ f16_ strided_ run - Ternary elementwise
clamp, f16, strided / broadcast path. - baracuda_
kernels_ ⚠ternary_ clamp_ f32_ can_ implement - Pre-launch implementability check for
ternary_clamp_f32. - baracuda_
kernels_ ⚠ternary_ clamp_ f32_ run - Ternary elementwise
clamp, f32, contig fast path. - baracuda_
kernels_ ⚠ternary_ clamp_ f32_ strided_ can_ implement - Pre-launch implementability check for
ternary_clamp_f32_strided. - baracuda_
kernels_ ⚠ternary_ clamp_ f32_ strided_ run - Ternary elementwise
clamp, f32, strided / broadcast path. This is the ternary-strided trailblazer — its safety contract (including aliasing) carries over to every ternary strided launcher across all dtypes. - baracuda_
kernels_ ⚠ternary_ clamp_ f64_ can_ implement - Pre-launch implementability check for
ternary_clamp_f64. - baracuda_
kernels_ ⚠ternary_ clamp_ f64_ run - Ternary elementwise
clamp, f64, contig fast path. - baracuda_
kernels_ ⚠ternary_ clamp_ f64_ strided_ can_ implement - Pre-launch implementability check for
ternary_clamp_f64_strided. - baracuda_
kernels_ ⚠ternary_ clamp_ f64_ strided_ run - Ternary elementwise
clamp, f64, strided / broadcast path. - baracuda_
kernels_ ⚠ternary_ fma_ backward_ bf16_ can_ implement - Pre-launch check for
ternary_fma_backward_bf16. - baracuda_
kernels_ ⚠ternary_ fma_ backward_ bf16_ run - Fma backward, bf16.
- baracuda_
kernels_ ⚠ternary_ fma_ backward_ f16_ can_ implement - Pre-launch check for
ternary_fma_backward_f16. - baracuda_
kernels_ ⚠ternary_ fma_ backward_ f16_ run - Fma backward, f16.
- baracuda_
kernels_ ⚠ternary_ fma_ backward_ f32_ can_ implement - Pre-launch check for
ternary_fma_backward_f32. - baracuda_
kernels_ ⚠ternary_ fma_ backward_ f32_ run - Fma backward, f32. Writes
da = dy*b,db = dy*a,dc = dy. - baracuda_
kernels_ ⚠ternary_ fma_ backward_ f64_ can_ implement - Pre-launch check for
ternary_fma_backward_f64. - baracuda_
kernels_ ⚠ternary_ fma_ backward_ f64_ run - Fma backward, f64.
- baracuda_
kernels_ ⚠ternary_ fma_ bf16_ can_ implement - Pre-launch implementability check for
ternary_fma_bf16. - baracuda_
kernels_ ⚠ternary_ fma_ bf16_ run - Ternary elementwise
fma, bf16, contig fast path. - baracuda_
kernels_ ⚠ternary_ fma_ bf16_ strided_ can_ implement - Pre-launch implementability check for
ternary_fma_bf16_strided. - baracuda_
kernels_ ⚠ternary_ fma_ bf16_ strided_ run - Ternary elementwise
fma, bf16, strided / broadcast path. - baracuda_
kernels_ ⚠ternary_ fma_ f16_ can_ implement - Pre-launch implementability check for
ternary_fma_f16. - baracuda_
kernels_ ⚠ternary_ fma_ f16_ run - Ternary elementwise
fma, f16, contig fast path. - baracuda_
kernels_ ⚠ternary_ fma_ f16_ strided_ can_ implement - Pre-launch implementability check for
ternary_fma_f16_strided. - baracuda_
kernels_ ⚠ternary_ fma_ f16_ strided_ run - Ternary elementwise
fma, f16, strided / broadcast path. - baracuda_
kernels_ ⚠ternary_ fma_ f32_ can_ implement - Pre-launch implementability check for
ternary_fma_f32. - baracuda_
kernels_ ⚠ternary_ fma_ f32_ run - Ternary elementwise
fma, f32, contig fast path. - baracuda_
kernels_ ⚠ternary_ fma_ f32_ strided_ can_ implement - Pre-launch implementability check for
ternary_fma_f32_strided. - baracuda_
kernels_ ⚠ternary_ fma_ f32_ strided_ run - Ternary elementwise
fma, f32, strided / broadcast path. - baracuda_
kernels_ ⚠ternary_ fma_ f64_ can_ implement - Pre-launch implementability check for
ternary_fma_f64. - baracuda_
kernels_ ⚠ternary_ fma_ f64_ run - Ternary elementwise
fma, f64, contig fast path. - baracuda_
kernels_ ⚠ternary_ fma_ f64_ strided_ can_ implement - Pre-launch implementability check for
ternary_fma_f64_strided. - baracuda_
kernels_ ⚠ternary_ fma_ f64_ strided_ run - Ternary elementwise
fma, f64, strided / broadcast path. - baracuda_
kernels_ ⚠topk_ backward_ f32_ can_ implement baracuda_kernels_topk_backward_f32_can_implement(baracuda kernels topk backward f32 can implement).- baracuda_
kernels_ ⚠topk_ backward_ f32_ run - Top-k BW, f32. Scatter k-wide
dyintorow_len-widedx(zero-init) via saved indices. - baracuda_
kernels_ ⚠topk_ backward_ f64_ can_ implement baracuda_kernels_topk_backward_f64_can_implement(baracuda kernels topk backward f64 can implement).- baracuda_
kernels_ ⚠topk_ backward_ f64_ run - Top-k BW, f64.
- baracuda_
kernels_ ⚠topk_ f32_ can_ implement baracuda_kernels_topk_f32_can_implement(baracuda kernels topk f32 can implement).- baracuda_
kernels_ ⚠topk_ f32_ run - Block-bitonic top-k, f32. Caps
k ≤ 64androw_len ≤ 1024.largest == 1= top-k by value;largest == 0= bottom-k. - baracuda_
kernels_ ⚠topk_ f64_ can_ implement baracuda_kernels_topk_f64_can_implement(baracuda kernels topk f64 can implement).- baracuda_
kernels_ ⚠topk_ f64_ run - Block-bitonic top-k, f64.
- baracuda_
kernels_ ⚠trace_ bf16_ can_ implement baracuda_kernels_trace_bf16_can_implement(baracuda kernels trace bf16 can implement).- baracuda_
kernels_ ⚠trace_ bf16_ run - Trace, bf16 (f32-detour accumulator).
- baracuda_
kernels_ ⚠trace_ f16_ can_ implement baracuda_kernels_trace_f16_can_implement(baracuda kernels trace f16 can implement).- baracuda_
kernels_ ⚠trace_ f16_ run - Trace, f16 (f32-detour accumulator).
- baracuda_
kernels_ ⚠trace_ f32_ can_ implement baracuda_kernels_trace_f32_can_implement(baracuda kernels trace f32 can implement).- baracuda_
kernels_ ⚠trace_ f32_ run - Trace of a 2-D square matrix, f32.
y[0] = Σ x[i * stride_row + i * stride_col]foriin0..rows. Output is a single scalar. - baracuda_
kernels_ ⚠trace_ f64_ can_ implement baracuda_kernels_trace_f64_can_implement(baracuda kernels trace f64 can implement).- baracuda_
kernels_ ⚠trace_ f64_ run - Trace, f64.
- baracuda_
kernels_ ⚠tril_ bf16_ can_ implement - Implementability check for
tril_bf16. - baracuda_
kernels_ ⚠tril_ bf16_ run - Tril, bf16.
- baracuda_
kernels_ ⚠tril_ bf16_ strided_ can_ implement - Implementability check for
tril_bf16_strided. - baracuda_
kernels_ ⚠tril_ bf16_ strided_ run - Tril strided, bf16.
- baracuda_
kernels_ ⚠tril_ bool_ can_ implement - Implementability check for
tril_bool. - baracuda_
kernels_ ⚠tril_ bool_ run - Tril, Bool (storage = u8).
- baracuda_
kernels_ ⚠tril_ bool_ strided_ can_ implement - Implementability check for
tril_bool_strided. - baracuda_
kernels_ ⚠tril_ bool_ strided_ run - Tril strided, Bool (storage = u8).
- baracuda_
kernels_ ⚠tril_ f16_ can_ implement - Implementability check for
tril_f16. - baracuda_
kernels_ ⚠tril_ f16_ run - Tril, f16.
- baracuda_
kernels_ ⚠tril_ f16_ strided_ can_ implement - Implementability check for
tril_f16_strided. - baracuda_
kernels_ ⚠tril_ f16_ strided_ run - Tril strided, f16.
- baracuda_
kernels_ ⚠tril_ f32_ can_ implement - Implementability check for
tril_f32. - baracuda_
kernels_ ⚠tril_ f32_ run - Tril, f32.
- baracuda_
kernels_ ⚠tril_ f32_ strided_ can_ implement - Implementability check for
tril_f32_strided. - baracuda_
kernels_ ⚠tril_ f32_ strided_ run - Tril strided, f32.
- baracuda_
kernels_ ⚠tril_ f64_ can_ implement - Implementability check for
tril_f64. - baracuda_
kernels_ ⚠tril_ f64_ run - Tril, f64.
- baracuda_
kernels_ ⚠tril_ f64_ strided_ can_ implement - Implementability check for
tril_f64_strided. - baracuda_
kernels_ ⚠tril_ f64_ strided_ run - Tril strided, f64.
- baracuda_
kernels_ ⚠tril_ i32_ can_ implement - Implementability check for
tril_i32. - baracuda_
kernels_ ⚠tril_ i32_ run - Tril, i32.
- baracuda_
kernels_ ⚠tril_ i32_ strided_ can_ implement - Implementability check for
tril_i32_strided. - baracuda_
kernels_ ⚠tril_ i32_ strided_ run - Tril strided, i32.
- baracuda_
kernels_ ⚠tril_ i64_ can_ implement - Implementability check for
tril_i64. - baracuda_
kernels_ ⚠tril_ i64_ run - Tril, i64.
- baracuda_
kernels_ ⚠tril_ i64_ strided_ can_ implement - Implementability check for
tril_i64_strided. - baracuda_
kernels_ ⚠tril_ i64_ strided_ run - Tril strided, i64.
- baracuda_
kernels_ ⚠triu_ bf16_ can_ implement - Implementability check for
triu_bf16. - baracuda_
kernels_ ⚠triu_ bf16_ run - Triu, bf16.
- baracuda_
kernels_ ⚠triu_ bf16_ strided_ can_ implement - Implementability check for
triu_bf16_strided. - baracuda_
kernels_ ⚠triu_ bf16_ strided_ run - Triu strided, bf16.
- baracuda_
kernels_ ⚠triu_ bool_ can_ implement - Implementability check for
triu_bool. - baracuda_
kernels_ ⚠triu_ bool_ run - Triu, Bool (storage = u8).
- baracuda_
kernels_ ⚠triu_ bool_ strided_ can_ implement - Implementability check for
triu_bool_strided. - baracuda_
kernels_ ⚠triu_ bool_ strided_ run - Triu strided, Bool (storage = u8).
- baracuda_
kernels_ ⚠triu_ f16_ can_ implement - Implementability check for
triu_f16. - baracuda_
kernels_ ⚠triu_ f16_ run - Triu, f16.
- baracuda_
kernels_ ⚠triu_ f16_ strided_ can_ implement - Implementability check for
triu_f16_strided. - baracuda_
kernels_ ⚠triu_ f16_ strided_ run - Triu strided, f16.
- baracuda_
kernels_ ⚠triu_ f32_ can_ implement - Implementability check for
triu_f32. - baracuda_
kernels_ ⚠triu_ f32_ run - Triu, f32. This is the triu trailblazer — its aliasing contract
carries over to every other
triu_<dt>_run,triu_<dt>_strided_run, and the siblingtril_*family. - baracuda_
kernels_ ⚠triu_ f32_ strided_ can_ implement - Implementability check for
triu_f32_strided. - baracuda_
kernels_ ⚠triu_ f32_ strided_ run - Triu strided, f32.
- baracuda_
kernels_ ⚠triu_ f64_ can_ implement - Implementability check for
triu_f64. - baracuda_
kernels_ ⚠triu_ f64_ run - Triu, f64.
- baracuda_
kernels_ ⚠triu_ f64_ strided_ can_ implement - Implementability check for
triu_f64_strided. - baracuda_
kernels_ ⚠triu_ f64_ strided_ run - Triu strided, f64.
- baracuda_
kernels_ ⚠triu_ i32_ can_ implement - Implementability check for
triu_i32. - baracuda_
kernels_ ⚠triu_ i32_ run - Triu, i32.
- baracuda_
kernels_ ⚠triu_ i32_ strided_ can_ implement - Implementability check for
triu_i32_strided. - baracuda_
kernels_ ⚠triu_ i32_ strided_ run - Triu strided, i32.
- baracuda_
kernels_ ⚠triu_ i64_ can_ implement - Implementability check for
triu_i64. - baracuda_
kernels_ ⚠triu_ i64_ run - Triu, i64.
- baracuda_
kernels_ ⚠triu_ i64_ strided_ can_ implement - Implementability check for
triu_i64_strided. - baracuda_
kernels_ ⚠triu_ i64_ strided_ run - Triu strided, i64.
- baracuda_
kernels_ ⚠unary_ abs_ bf16_ can_ implement - Pre-launch implementability check for
unary_abs_bf16. - baracuda_
kernels_ ⚠unary_ abs_ bf16_ run - Unary elementwise
abs, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ abs_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_abs_bf16_strided. - baracuda_
kernels_ ⚠unary_ abs_ bf16_ strided_ run - Unary elementwise
abs, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ abs_ f16_ can_ implement - Pre-launch implementability check for
unary_abs_f16. - baracuda_
kernels_ ⚠unary_ abs_ f16_ run - Unary elementwise
abs, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ abs_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_abs_f16_strided. - baracuda_
kernels_ ⚠unary_ abs_ f16_ strided_ run - Unary elementwise
abs, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ abs_ f32_ can_ implement - Pre-launch implementability check for
unary_abs_f32. - baracuda_
kernels_ ⚠unary_ abs_ f32_ run - Unary elementwise
abs, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ abs_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_abs_f32_strided. - baracuda_
kernels_ ⚠unary_ abs_ f32_ strided_ run - Unary elementwise
abs, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ abs_ f64_ can_ implement - Pre-launch implementability check for
unary_abs_f64. - baracuda_
kernels_ ⚠unary_ abs_ f64_ run - Unary elementwise
abs, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ abs_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_abs_f64_strided. - baracuda_
kernels_ ⚠unary_ abs_ f64_ strided_ run - Unary elementwise
abs, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ acos_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_acos_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ acos_ backward_ bf16_ run - Acos backward, bf16.
- baracuda_
kernels_ ⚠unary_ acos_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_acos_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ acos_ backward_ f16_ run - Acos backward, f16.
- baracuda_
kernels_ ⚠unary_ acos_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_acos_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ acos_ backward_ f32_ run - Acos backward, f32.
dx = -dy / sqrt(1 - x²). Saved-x. Domain:|x| < 1. - baracuda_
kernels_ ⚠unary_ acos_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_acos_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ acos_ backward_ f64_ run - Acos backward, f64.
- baracuda_
kernels_ ⚠unary_ acos_ bf16_ can_ implement - Pre-launch implementability check for
unary_acos_bf16. - baracuda_
kernels_ ⚠unary_ acos_ bf16_ run - Unary elementwise
acos, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ acos_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_acos_bf16_strided. - baracuda_
kernels_ ⚠unary_ acos_ bf16_ strided_ run - Unary elementwise
acos, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ acos_ f16_ can_ implement - Pre-launch implementability check for
unary_acos_f16. - baracuda_
kernels_ ⚠unary_ acos_ f16_ run - Unary elementwise
acos, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ acos_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_acos_f16_strided. - baracuda_
kernels_ ⚠unary_ acos_ f16_ strided_ run - Unary elementwise
acos, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ acos_ f32_ can_ implement - Pre-launch implementability check for
unary_acos_f32. - baracuda_
kernels_ ⚠unary_ acos_ f32_ run - Unary elementwise
acos, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ acos_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_acos_f32_strided. - baracuda_
kernels_ ⚠unary_ acos_ f32_ strided_ run - Unary elementwise
acos, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ acos_ f64_ can_ implement - Pre-launch implementability check for
unary_acos_f64. - baracuda_
kernels_ ⚠unary_ acos_ f64_ run - Unary elementwise
acos, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ acos_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_acos_f64_strided. - baracuda_
kernels_ ⚠unary_ acos_ f64_ strided_ run - Unary elementwise
acos, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ acosh_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_acosh_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ acosh_ backward_ bf16_ run - Acosh backward, bf16.
- baracuda_
kernels_ ⚠unary_ acosh_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_acosh_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ acosh_ backward_ f16_ run - Acosh backward, f16.
- baracuda_
kernels_ ⚠unary_ acosh_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_acosh_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ acosh_ backward_ f32_ run - Acosh backward, f32.
dx = dy / sqrt(x² - 1). Saved-x. Domain:x > 1. - baracuda_
kernels_ ⚠unary_ acosh_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_acosh_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ acosh_ backward_ f64_ run - Acosh backward, f64.
- baracuda_
kernels_ ⚠unary_ acosh_ bf16_ can_ implement - Pre-launch implementability check for
unary_acosh_bf16. - baracuda_
kernels_ ⚠unary_ acosh_ bf16_ run - Unary elementwise
acosh, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ acosh_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_acosh_bf16_strided. - baracuda_
kernels_ ⚠unary_ acosh_ bf16_ strided_ run - Unary elementwise
acosh, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ acosh_ f16_ can_ implement - Pre-launch implementability check for
unary_acosh_f16. - baracuda_
kernels_ ⚠unary_ acosh_ f16_ run - Unary elementwise
acosh, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ acosh_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_acosh_f16_strided. - baracuda_
kernels_ ⚠unary_ acosh_ f16_ strided_ run - Unary elementwise
acosh, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ acosh_ f32_ can_ implement - Pre-launch implementability check for
unary_acosh_f32. - baracuda_
kernels_ ⚠unary_ acosh_ f32_ run - Unary elementwise
acosh, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ acosh_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_acosh_f32_strided. - baracuda_
kernels_ ⚠unary_ acosh_ f32_ strided_ run - Unary elementwise
acosh, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ acosh_ f64_ can_ implement - Pre-launch implementability check for
unary_acosh_f64. - baracuda_
kernels_ ⚠unary_ acosh_ f64_ run - Unary elementwise
acosh, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ acosh_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_acosh_f64_strided. - baracuda_
kernels_ ⚠unary_ acosh_ f64_ strided_ run - Unary elementwise
acosh, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ asin_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_asin_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ asin_ backward_ bf16_ run - Asin backward, bf16.
- baracuda_
kernels_ ⚠unary_ asin_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_asin_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ asin_ backward_ f16_ run - Asin backward, f16.
- baracuda_
kernels_ ⚠unary_ asin_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_asin_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ asin_ backward_ f32_ run - Asin backward, f32.
dx = dy / sqrt(1 - x²). Saved-x. Domain:|x| < 1. - baracuda_
kernels_ ⚠unary_ asin_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_asin_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ asin_ backward_ f64_ run - Asin backward, f64.
- baracuda_
kernels_ ⚠unary_ asin_ bf16_ can_ implement - Pre-launch implementability check for
unary_asin_bf16. - baracuda_
kernels_ ⚠unary_ asin_ bf16_ run - Unary elementwise
asin, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ asin_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_asin_bf16_strided. - baracuda_
kernels_ ⚠unary_ asin_ bf16_ strided_ run - Unary elementwise
asin, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ asin_ f16_ can_ implement - Pre-launch implementability check for
unary_asin_f16. - baracuda_
kernels_ ⚠unary_ asin_ f16_ run - Unary elementwise
asin, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ asin_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_asin_f16_strided. - baracuda_
kernels_ ⚠unary_ asin_ f16_ strided_ run - Unary elementwise
asin, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ asin_ f32_ can_ implement - Pre-launch implementability check for
unary_asin_f32. - baracuda_
kernels_ ⚠unary_ asin_ f32_ run - Unary elementwise
asin, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ asin_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_asin_f32_strided. - baracuda_
kernels_ ⚠unary_ asin_ f32_ strided_ run - Unary elementwise
asin, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ asin_ f64_ can_ implement - Pre-launch implementability check for
unary_asin_f64. - baracuda_
kernels_ ⚠unary_ asin_ f64_ run - Unary elementwise
asin, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ asin_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_asin_f64_strided. - baracuda_
kernels_ ⚠unary_ asin_ f64_ strided_ run - Unary elementwise
asin, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ asinh_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_asinh_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ asinh_ backward_ bf16_ run - Asinh backward, bf16.
- baracuda_
kernels_ ⚠unary_ asinh_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_asinh_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ asinh_ backward_ f16_ run - Asinh backward, f16.
- baracuda_
kernels_ ⚠unary_ asinh_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_asinh_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ asinh_ backward_ f32_ run - Asinh backward, f32.
dx = dy / sqrt(1 + x²). Saved-x. - baracuda_
kernels_ ⚠unary_ asinh_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_asinh_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ asinh_ backward_ f64_ run - Asinh backward, f64.
- baracuda_
kernels_ ⚠unary_ asinh_ bf16_ can_ implement - Pre-launch implementability check for
unary_asinh_bf16. - baracuda_
kernels_ ⚠unary_ asinh_ bf16_ run - Unary elementwise
asinh, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ asinh_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_asinh_bf16_strided. - baracuda_
kernels_ ⚠unary_ asinh_ bf16_ strided_ run - Unary elementwise
asinh, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ asinh_ f16_ can_ implement - Pre-launch implementability check for
unary_asinh_f16. - baracuda_
kernels_ ⚠unary_ asinh_ f16_ run - Unary elementwise
asinh, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ asinh_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_asinh_f16_strided. - baracuda_
kernels_ ⚠unary_ asinh_ f16_ strided_ run - Unary elementwise
asinh, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ asinh_ f32_ can_ implement - Pre-launch implementability check for
unary_asinh_f32. - baracuda_
kernels_ ⚠unary_ asinh_ f32_ run - Unary elementwise
asinh, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ asinh_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_asinh_f32_strided. - baracuda_
kernels_ ⚠unary_ asinh_ f32_ strided_ run - Unary elementwise
asinh, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ asinh_ f64_ can_ implement - Pre-launch implementability check for
unary_asinh_f64. - baracuda_
kernels_ ⚠unary_ asinh_ f64_ run - Unary elementwise
asinh, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ asinh_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_asinh_f64_strided. - baracuda_
kernels_ ⚠unary_ asinh_ f64_ strided_ run - Unary elementwise
asinh, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ atan_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_atan_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ atan_ backward_ bf16_ run - Atan backward, bf16.
- baracuda_
kernels_ ⚠unary_ atan_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_atan_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ atan_ backward_ f16_ run - Atan backward, f16.
- baracuda_
kernels_ ⚠unary_ atan_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_atan_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ atan_ backward_ f32_ run - Atan backward, f32.
dx = dy / (1 + x²). - baracuda_
kernels_ ⚠unary_ atan_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_atan_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ atan_ backward_ f64_ run - Atan backward, f64.
- baracuda_
kernels_ ⚠unary_ atan_ bf16_ can_ implement - Pre-launch implementability check for
unary_atan_bf16. - baracuda_
kernels_ ⚠unary_ atan_ bf16_ run - Unary elementwise
atan, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ atan_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_atan_bf16_strided. - baracuda_
kernels_ ⚠unary_ atan_ bf16_ strided_ run - Unary elementwise
atan, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ atan_ f16_ can_ implement - Pre-launch implementability check for
unary_atan_f16. - baracuda_
kernels_ ⚠unary_ atan_ f16_ run - Unary elementwise
atan, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ atan_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_atan_f16_strided. - baracuda_
kernels_ ⚠unary_ atan_ f16_ strided_ run - Unary elementwise
atan, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ atan_ f32_ can_ implement - Pre-launch implementability check for
unary_atan_f32. - baracuda_
kernels_ ⚠unary_ atan_ f32_ run - Unary elementwise
atan, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ atan_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_atan_f32_strided. - baracuda_
kernels_ ⚠unary_ atan_ f32_ strided_ run - Unary elementwise
atan, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ atan_ f64_ can_ implement - Pre-launch implementability check for
unary_atan_f64. - baracuda_
kernels_ ⚠unary_ atan_ f64_ run - Unary elementwise
atan, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ atan_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_atan_f64_strided. - baracuda_
kernels_ ⚠unary_ atan_ f64_ strided_ run - Unary elementwise
atan, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ atanh_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_atanh_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ atanh_ backward_ bf16_ run - Atanh backward, bf16.
- baracuda_
kernels_ ⚠unary_ atanh_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_atanh_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ atanh_ backward_ f16_ run - Atanh backward, f16.
- baracuda_
kernels_ ⚠unary_ atanh_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_atanh_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ atanh_ backward_ f32_ run - Atanh backward, f32.
dx = dy / (1 - x²). Saved-x. Domain:|x| < 1. - baracuda_
kernels_ ⚠unary_ atanh_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_atanh_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ atanh_ backward_ f64_ run - Atanh backward, f64.
- baracuda_
kernels_ ⚠unary_ atanh_ bf16_ can_ implement - Pre-launch implementability check for
unary_atanh_bf16. - baracuda_
kernels_ ⚠unary_ atanh_ bf16_ run - Unary elementwise
atanh, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ atanh_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_atanh_bf16_strided. - baracuda_
kernels_ ⚠unary_ atanh_ bf16_ strided_ run - Unary elementwise
atanh, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ atanh_ f16_ can_ implement - Pre-launch implementability check for
unary_atanh_f16. - baracuda_
kernels_ ⚠unary_ atanh_ f16_ run - Unary elementwise
atanh, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ atanh_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_atanh_f16_strided. - baracuda_
kernels_ ⚠unary_ atanh_ f16_ strided_ run - Unary elementwise
atanh, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ atanh_ f32_ can_ implement - Pre-launch implementability check for
unary_atanh_f32. - baracuda_
kernels_ ⚠unary_ atanh_ f32_ run - Unary elementwise
atanh, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ atanh_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_atanh_f32_strided. - baracuda_
kernels_ ⚠unary_ atanh_ f32_ strided_ run - Unary elementwise
atanh, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ atanh_ f64_ can_ implement - Pre-launch implementability check for
unary_atanh_f64. - baracuda_
kernels_ ⚠unary_ atanh_ f64_ run - Unary elementwise
atanh, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ atanh_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_atanh_f64_strided. - baracuda_
kernels_ ⚠unary_ atanh_ f64_ strided_ run - Unary elementwise
atanh, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cbrt_ bf16_ can_ implement - Pre-launch implementability check for
unary_cbrt_bf16. - baracuda_
kernels_ ⚠unary_ cbrt_ bf16_ run - Unary elementwise
cbrt, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cbrt_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_cbrt_bf16_strided. - baracuda_
kernels_ ⚠unary_ cbrt_ bf16_ strided_ run - Unary elementwise
cbrt, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cbrt_ f16_ can_ implement - Pre-launch implementability check for
unary_cbrt_f16. - baracuda_
kernels_ ⚠unary_ cbrt_ f16_ run - Unary elementwise
cbrt, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cbrt_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_cbrt_f16_strided. - baracuda_
kernels_ ⚠unary_ cbrt_ f16_ strided_ run - Unary elementwise
cbrt, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cbrt_ f32_ can_ implement - Pre-launch implementability check for
unary_cbrt_f32. - baracuda_
kernels_ ⚠unary_ cbrt_ f32_ run - Unary elementwise
cbrt, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cbrt_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_cbrt_f32_strided. - baracuda_
kernels_ ⚠unary_ cbrt_ f32_ strided_ run - Unary elementwise
cbrt, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cbrt_ f64_ can_ implement - Pre-launch implementability check for
unary_cbrt_f64. - baracuda_
kernels_ ⚠unary_ cbrt_ f64_ run - Unary elementwise
cbrt, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cbrt_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_cbrt_f64_strided. - baracuda_
kernels_ ⚠unary_ cbrt_ f64_ strided_ run - Unary elementwise
cbrt, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ ceil_ bf16_ can_ implement - Pre-launch implementability check for
unary_ceil_bf16. - baracuda_
kernels_ ⚠unary_ ceil_ bf16_ run - Unary elementwise
ceil, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ ceil_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_ceil_bf16_strided. - baracuda_
kernels_ ⚠unary_ ceil_ bf16_ strided_ run - Unary elementwise
ceil, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ ceil_ f16_ can_ implement - Pre-launch implementability check for
unary_ceil_f16. - baracuda_
kernels_ ⚠unary_ ceil_ f16_ run - Unary elementwise
ceil, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ ceil_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_ceil_f16_strided. - baracuda_
kernels_ ⚠unary_ ceil_ f16_ strided_ run - Unary elementwise
ceil, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ ceil_ f32_ can_ implement - Pre-launch implementability check for
unary_ceil_f32. - baracuda_
kernels_ ⚠unary_ ceil_ f32_ run - Unary elementwise
ceil, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ ceil_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_ceil_f32_strided. - baracuda_
kernels_ ⚠unary_ ceil_ f32_ strided_ run - Unary elementwise
ceil, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ ceil_ f64_ can_ implement - Pre-launch implementability check for
unary_ceil_f64. - baracuda_
kernels_ ⚠unary_ ceil_ f64_ run - Unary elementwise
ceil, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ ceil_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_ceil_f64_strided. - baracuda_
kernels_ ⚠unary_ ceil_ f64_ strided_ run - Unary elementwise
ceil, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cos_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_cos_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ cos_ backward_ bf16_ run - Cos backward, bf16.
- baracuda_
kernels_ ⚠unary_ cos_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_cos_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ cos_ backward_ f16_ run - Cos backward, f16.
- baracuda_
kernels_ ⚠unary_ cos_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_cos_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ cos_ backward_ f32_ run - Cos backward, f32.
dx = -dy * sin(x). Saved-x. - baracuda_
kernels_ ⚠unary_ cos_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_cos_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ cos_ backward_ f64_ run - Cos backward, f64.
- baracuda_
kernels_ ⚠unary_ cos_ bf16_ can_ implement - Pre-launch implementability check for
unary_cos_bf16. - baracuda_
kernels_ ⚠unary_ cos_ bf16_ run - Unary elementwise
cos, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cos_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_cos_bf16_strided. - baracuda_
kernels_ ⚠unary_ cos_ bf16_ strided_ run - Unary elementwise
cos, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cos_ f16_ can_ implement - Pre-launch implementability check for
unary_cos_f16. - baracuda_
kernels_ ⚠unary_ cos_ f16_ run - Unary elementwise
cos, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cos_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_cos_f16_strided. - baracuda_
kernels_ ⚠unary_ cos_ f16_ strided_ run - Unary elementwise
cos, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cos_ f32_ can_ implement - Pre-launch implementability check for
unary_cos_f32. - baracuda_
kernels_ ⚠unary_ cos_ f32_ run - Unary elementwise
cos, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cos_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_cos_f32_strided. - baracuda_
kernels_ ⚠unary_ cos_ f32_ strided_ run - Unary elementwise
cos, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cos_ f64_ can_ implement - Pre-launch implementability check for
unary_cos_f64. - baracuda_
kernels_ ⚠unary_ cos_ f64_ run - Unary elementwise
cos, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cos_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_cos_f64_strided. - baracuda_
kernels_ ⚠unary_ cos_ f64_ strided_ run - Unary elementwise
cos, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cosh_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_cosh_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ cosh_ backward_ bf16_ run - Cosh backward, bf16.
- baracuda_
kernels_ ⚠unary_ cosh_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_cosh_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ cosh_ backward_ f16_ run - Cosh backward, f16.
- baracuda_
kernels_ ⚠unary_ cosh_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_cosh_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ cosh_ backward_ f32_ run - Cosh backward, f32.
dx = dy * sinh(x). Saved-x. - baracuda_
kernels_ ⚠unary_ cosh_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_cosh_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ cosh_ backward_ f64_ run - Cosh backward, f64.
- baracuda_
kernels_ ⚠unary_ cosh_ bf16_ can_ implement - Pre-launch implementability check for
unary_cosh_bf16. - baracuda_
kernels_ ⚠unary_ cosh_ bf16_ run - Unary elementwise
cosh, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cosh_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_cosh_bf16_strided. - baracuda_
kernels_ ⚠unary_ cosh_ bf16_ strided_ run - Unary elementwise
cosh, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cosh_ f16_ can_ implement - Pre-launch implementability check for
unary_cosh_f16. - baracuda_
kernels_ ⚠unary_ cosh_ f16_ run - Unary elementwise
cosh, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cosh_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_cosh_f16_strided. - baracuda_
kernels_ ⚠unary_ cosh_ f16_ strided_ run - Unary elementwise
cosh, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cosh_ f32_ can_ implement - Pre-launch implementability check for
unary_cosh_f32. - baracuda_
kernels_ ⚠unary_ cosh_ f32_ run - Unary elementwise
cosh, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cosh_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_cosh_f32_strided. - baracuda_
kernels_ ⚠unary_ cosh_ f32_ strided_ run - Unary elementwise
cosh, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cosh_ f64_ can_ implement - Pre-launch implementability check for
unary_cosh_f64. - baracuda_
kernels_ ⚠unary_ cosh_ f64_ run - Unary elementwise
cosh, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cosh_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_cosh_f64_strided. - baracuda_
kernels_ ⚠unary_ cosh_ f64_ strided_ run - Unary elementwise
cosh, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cube_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_cube_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ cube_ backward_ bf16_ run - Cube backward, bf16.
- baracuda_
kernels_ ⚠unary_ cube_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_cube_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ cube_ backward_ f16_ run - Cube backward, f16.
- baracuda_
kernels_ ⚠unary_ cube_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_cube_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ cube_ backward_ f32_ run - Cube backward, f32.
dx = dy * 3 * x². - baracuda_
kernels_ ⚠unary_ cube_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_cube_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ cube_ backward_ f64_ run - Cube backward, f64.
- baracuda_
kernels_ ⚠unary_ cube_ bf16_ can_ implement - Pre-launch implementability check for
unary_cube_bf16. - baracuda_
kernels_ ⚠unary_ cube_ bf16_ run - Unary elementwise
cube, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cube_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_cube_bf16_strided. - baracuda_
kernels_ ⚠unary_ cube_ bf16_ strided_ run - Unary elementwise
cube, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cube_ f16_ can_ implement - Pre-launch implementability check for
unary_cube_f16. - baracuda_
kernels_ ⚠unary_ cube_ f16_ run - Unary elementwise
cube, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cube_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_cube_f16_strided. - baracuda_
kernels_ ⚠unary_ cube_ f16_ strided_ run - Unary elementwise
cube, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cube_ f32_ can_ implement - Pre-launch implementability check for
unary_cube_f32. - baracuda_
kernels_ ⚠unary_ cube_ f32_ run - Unary elementwise
cube, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cube_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_cube_f32_strided. - baracuda_
kernels_ ⚠unary_ cube_ f32_ strided_ run - Unary elementwise
cube, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ cube_ f64_ can_ implement - Pre-launch implementability check for
unary_cube_f64. - baracuda_
kernels_ ⚠unary_ cube_ f64_ run - Unary elementwise
cube, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ cube_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_cube_f64_strided. - baracuda_
kernels_ ⚠unary_ cube_ f64_ strided_ run - Unary elementwise
cube, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ elu_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_elu_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ elu_ backward_ bf16_ run - ELU backward, bf16.
- baracuda_
kernels_ ⚠unary_ elu_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_elu_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ elu_ backward_ f16_ run - ELU backward, f16.
- baracuda_
kernels_ ⚠unary_ elu_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_elu_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ elu_ backward_ f32_ run - ELU backward, f32.
dx = (x > 0) ? dy : dy·α·exp(x)with α=1.0. Saved-x. - baracuda_
kernels_ ⚠unary_ elu_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_elu_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ elu_ backward_ f64_ run - ELU backward, f64.
- baracuda_
kernels_ ⚠unary_ elu_ bf16_ can_ implement - Pre-launch implementability check for
unary_elu_bf16. - baracuda_
kernels_ ⚠unary_ elu_ bf16_ run - Unary elementwise
elu(x; α), bf16, contig. - baracuda_
kernels_ ⚠unary_ elu_ bf16_ strided_ can_ implement - Implementability check for
baracuda_kernels_unary_elu_bf16_strided. Host-side only. - baracuda_
kernels_ ⚠unary_ elu_ bf16_ strided_ run - Unary elementwise
elu(x; α), bf16, strided. - baracuda_
kernels_ ⚠unary_ elu_ f16_ can_ implement - Pre-launch implementability check for
unary_elu_f16. - baracuda_
kernels_ ⚠unary_ elu_ f16_ run - Unary elementwise
elu(x; α), f16, contig. - baracuda_
kernels_ ⚠unary_ elu_ f16_ strided_ can_ implement - Implementability check for
baracuda_kernels_unary_elu_f16_strided. Host-side only. - baracuda_
kernels_ ⚠unary_ elu_ f16_ strided_ run - Unary elementwise
elu(x; α), f16, strided. - baracuda_
kernels_ ⚠unary_ elu_ f32_ can_ implement - Pre-launch implementability check for
unary_elu_f32. - baracuda_
kernels_ ⚠unary_ elu_ f32_ run - Unary elementwise
elu(x; α) = x if x>0 else α·(exp(x)-1), f32, contig. - baracuda_
kernels_ ⚠unary_ elu_ f32_ strided_ can_ implement - Implementability check for
baracuda_kernels_unary_elu_f32_strided. Host-side only. - baracuda_
kernels_ ⚠unary_ elu_ f32_ strided_ run - Unary elementwise
elu(x; α), f32, strided. - baracuda_
kernels_ ⚠unary_ elu_ f64_ can_ implement - Pre-launch implementability check for
unary_elu_f64. - baracuda_
kernels_ ⚠unary_ elu_ f64_ run - Unary elementwise
elu(x; α), f64, contig. - baracuda_
kernels_ ⚠unary_ elu_ f64_ strided_ can_ implement - Implementability check for
baracuda_kernels_unary_elu_f64_strided. Host-side only. - baracuda_
kernels_ ⚠unary_ elu_ f64_ strided_ run - Unary elementwise
elu(x; α), f64, strided. - baracuda_
kernels_ ⚠unary_ erf_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_erf_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ erf_ backward_ bf16_ run - Erf backward, bf16.
- baracuda_
kernels_ ⚠unary_ erf_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_erf_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ erf_ backward_ f16_ run - Erf backward, f16.
- baracuda_
kernels_ ⚠unary_ erf_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_erf_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ erf_ backward_ f32_ run - Erf backward, f32.
dx = dy * (2/√π) * exp(-x²). - baracuda_
kernels_ ⚠unary_ erf_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_erf_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ erf_ backward_ f64_ run - Erf backward, f64.
- baracuda_
kernels_ ⚠unary_ erf_ bf16_ can_ implement - Pre-launch implementability check for
unary_erf_bf16. - baracuda_
kernels_ ⚠unary_ erf_ bf16_ run - Unary elementwise
erf, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ erf_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_erf_bf16_strided. - baracuda_
kernels_ ⚠unary_ erf_ bf16_ strided_ run - Unary elementwise
erf, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ erf_ f16_ can_ implement - Pre-launch implementability check for
unary_erf_f16. - baracuda_
kernels_ ⚠unary_ erf_ f16_ run - Unary elementwise
erf, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ erf_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_erf_f16_strided. - baracuda_
kernels_ ⚠unary_ erf_ f16_ strided_ run - Unary elementwise
erf, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ erf_ f32_ can_ implement - Pre-launch implementability check for
unary_erf_f32. - baracuda_
kernels_ ⚠unary_ erf_ f32_ run - Unary elementwise
erf, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ erf_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_erf_f32_strided. - baracuda_
kernels_ ⚠unary_ erf_ f32_ strided_ run - Unary elementwise
erf, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ erf_ f64_ can_ implement - Pre-launch implementability check for
unary_erf_f64. - baracuda_
kernels_ ⚠unary_ erf_ f64_ run - Unary elementwise
erf, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ erf_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_erf_f64_strided. - baracuda_
kernels_ ⚠unary_ erf_ f64_ strided_ run - Unary elementwise
erf, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ erfc_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_erfc_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ erfc_ backward_ bf16_ run - Erfc backward, bf16.
- baracuda_
kernels_ ⚠unary_ erfc_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_erfc_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ erfc_ backward_ f16_ run - Erfc backward, f16.
- baracuda_
kernels_ ⚠unary_ erfc_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_erfc_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ erfc_ backward_ f32_ run - Erfc backward, f32.
dx = -dy * (2/√π) * exp(-x²). - baracuda_
kernels_ ⚠unary_ erfc_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_erfc_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ erfc_ backward_ f64_ run - Erfc backward, f64.
- baracuda_
kernels_ ⚠unary_ erfc_ bf16_ can_ implement - Pre-launch implementability check for
unary_erfc_bf16. - baracuda_
kernels_ ⚠unary_ erfc_ bf16_ run - Unary elementwise
erfc, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ erfc_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_erfc_bf16_strided. - baracuda_
kernels_ ⚠unary_ erfc_ bf16_ strided_ run - Unary elementwise
erfc, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ erfc_ f16_ can_ implement - Pre-launch implementability check for
unary_erfc_f16. - baracuda_
kernels_ ⚠unary_ erfc_ f16_ run - Unary elementwise
erfc, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ erfc_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_erfc_f16_strided. - baracuda_
kernels_ ⚠unary_ erfc_ f16_ strided_ run - Unary elementwise
erfc, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ erfc_ f32_ can_ implement - Pre-launch implementability check for
unary_erfc_f32. - baracuda_
kernels_ ⚠unary_ erfc_ f32_ run - Unary elementwise
erfc, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ erfc_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_erfc_f32_strided. - baracuda_
kernels_ ⚠unary_ erfc_ f32_ strided_ run - Unary elementwise
erfc, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ erfc_ f64_ can_ implement - Pre-launch implementability check for
unary_erfc_f64. - baracuda_
kernels_ ⚠unary_ erfc_ f64_ run - Unary elementwise
erfc, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ erfc_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_erfc_f64_strided. - baracuda_
kernels_ ⚠unary_ erfc_ f64_ strided_ run - Unary elementwise
erfc, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ exp2_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_exp2_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ exp2_ backward_ bf16_ run - Exp2 backward, bf16.
- baracuda_
kernels_ ⚠unary_ exp2_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_exp2_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ exp2_ backward_ f16_ run - Exp2 backward, f16.
- baracuda_
kernels_ ⚠unary_ exp2_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_exp2_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ exp2_ backward_ f32_ run - Exp2 backward, f32.
dx = dy * y * ln(2). - baracuda_
kernels_ ⚠unary_ exp2_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_exp2_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ exp2_ backward_ f64_ run - Exp2 backward, f64.
- baracuda_
kernels_ ⚠unary_ exp2_ bf16_ can_ implement - Pre-launch implementability check for
unary_exp2_bf16. - baracuda_
kernels_ ⚠unary_ exp2_ bf16_ run - Unary elementwise
exp2, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ exp2_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_exp2_bf16_strided. - baracuda_
kernels_ ⚠unary_ exp2_ bf16_ strided_ run - Unary elementwise
exp2, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ exp2_ f16_ can_ implement - Pre-launch implementability check for
unary_exp2_f16. - baracuda_
kernels_ ⚠unary_ exp2_ f16_ run - Unary elementwise
exp2, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ exp2_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_exp2_f16_strided. - baracuda_
kernels_ ⚠unary_ exp2_ f16_ strided_ run - Unary elementwise
exp2, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ exp2_ f32_ can_ implement - Pre-launch implementability check for
unary_exp2_f32. - baracuda_
kernels_ ⚠unary_ exp2_ f32_ run - Unary elementwise
exp2, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ exp2_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_exp2_f32_strided. - baracuda_
kernels_ ⚠unary_ exp2_ f32_ strided_ run - Unary elementwise
exp2, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ exp2_ f64_ can_ implement - Pre-launch implementability check for
unary_exp2_f64. - baracuda_
kernels_ ⚠unary_ exp2_ f64_ run - Unary elementwise
exp2, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ exp2_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_exp2_f64_strided. - baracuda_
kernels_ ⚠unary_ exp2_ f64_ strided_ run - Unary elementwise
exp2, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ exp_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_exp_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ exp_ backward_ bf16_ run - Exp backward, bf16.
- baracuda_
kernels_ ⚠unary_ exp_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_exp_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ exp_ backward_ f16_ run - Exp backward, f16.
- baracuda_
kernels_ ⚠unary_ exp_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_exp_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ exp_ backward_ f32_ run - Exp backward, f32.
dx = dy * y. Caller must pass the forward outputyassaved. - baracuda_
kernels_ ⚠unary_ exp_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_exp_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ exp_ backward_ f64_ run - Exp backward, f64.
- baracuda_
kernels_ ⚠unary_ exp_ bf16_ can_ implement - Pre-launch implementability check for
unary_exp_bf16. - baracuda_
kernels_ ⚠unary_ exp_ bf16_ run - Unary elementwise
exp, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ exp_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_exp_bf16_strided. - baracuda_
kernels_ ⚠unary_ exp_ bf16_ strided_ run - Unary elementwise
exp, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ exp_ f16_ can_ implement - Pre-launch implementability check for
unary_exp_f16. - baracuda_
kernels_ ⚠unary_ exp_ f16_ run - Unary elementwise
exp, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ exp_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_exp_f16_strided. - baracuda_
kernels_ ⚠unary_ exp_ f16_ strided_ run - Unary elementwise
exp, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ exp_ f32_ can_ implement - Pre-launch implementability check for
unary_exp_f32. - baracuda_
kernels_ ⚠unary_ exp_ f32_ run - Unary elementwise
exp, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ exp_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_exp_f32_strided. - baracuda_
kernels_ ⚠unary_ exp_ f32_ strided_ run - Unary elementwise
exp, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ exp_ f64_ can_ implement - Pre-launch implementability check for
unary_exp_f64. - baracuda_
kernels_ ⚠unary_ exp_ f64_ run - Unary elementwise
exp, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ exp_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_exp_f64_strided. - baracuda_
kernels_ ⚠unary_ exp_ f64_ strided_ run - Unary elementwise
exp, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ expm1_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_expm1_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ expm1_ backward_ bf16_ run - Expm1 backward, bf16.
- baracuda_
kernels_ ⚠unary_ expm1_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_expm1_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ expm1_ backward_ f16_ run - Expm1 backward, f16.
- baracuda_
kernels_ ⚠unary_ expm1_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_expm1_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ expm1_ backward_ f32_ run - Expm1 backward, f32.
dx = dy * (y + 1). Saved-y. - baracuda_
kernels_ ⚠unary_ expm1_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_expm1_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ expm1_ backward_ f64_ run - Expm1 backward, f64.
- baracuda_
kernels_ ⚠unary_ expm1_ bf16_ can_ implement - Pre-launch implementability check for
unary_expm1_bf16. - baracuda_
kernels_ ⚠unary_ expm1_ bf16_ run - Unary elementwise
expm1, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ expm1_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_expm1_bf16_strided. - baracuda_
kernels_ ⚠unary_ expm1_ bf16_ strided_ run - Unary elementwise
expm1, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ expm1_ f16_ can_ implement - Pre-launch implementability check for
unary_expm1_f16. - baracuda_
kernels_ ⚠unary_ expm1_ f16_ run - Unary elementwise
expm1, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ expm1_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_expm1_f16_strided. - baracuda_
kernels_ ⚠unary_ expm1_ f16_ strided_ run - Unary elementwise
expm1, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ expm1_ f32_ can_ implement - Pre-launch implementability check for
unary_expm1_f32. - baracuda_
kernels_ ⚠unary_ expm1_ f32_ run - Unary elementwise
expm1, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ expm1_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_expm1_f32_strided. - baracuda_
kernels_ ⚠unary_ expm1_ f32_ strided_ run - Unary elementwise
expm1, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ expm1_ f64_ can_ implement - Pre-launch implementability check for
unary_expm1_f64. - baracuda_
kernels_ ⚠unary_ expm1_ f64_ run - Unary elementwise
expm1, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ expm1_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_expm1_f64_strided. - baracuda_
kernels_ ⚠unary_ expm1_ f64_ strided_ run - Unary elementwise
expm1, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ floor_ bf16_ can_ implement - Pre-launch implementability check for
unary_floor_bf16. - baracuda_
kernels_ ⚠unary_ floor_ bf16_ run - Unary elementwise
floor, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ floor_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_floor_bf16_strided. - baracuda_
kernels_ ⚠unary_ floor_ bf16_ strided_ run - Unary elementwise
floor, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ floor_ f16_ can_ implement - Pre-launch implementability check for
unary_floor_f16. - baracuda_
kernels_ ⚠unary_ floor_ f16_ run - Unary elementwise
floor, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ floor_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_floor_f16_strided. - baracuda_
kernels_ ⚠unary_ floor_ f16_ strided_ run - Unary elementwise
floor, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ floor_ f32_ can_ implement - Pre-launch implementability check for
unary_floor_f32. - baracuda_
kernels_ ⚠unary_ floor_ f32_ run - Unary elementwise
floor, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ floor_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_floor_f32_strided. - baracuda_
kernels_ ⚠unary_ floor_ f32_ strided_ run - Unary elementwise
floor, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ floor_ f64_ can_ implement - Pre-launch implementability check for
unary_floor_f64. - baracuda_
kernels_ ⚠unary_ floor_ f64_ run - Unary elementwise
floor, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ floor_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_floor_f64_strided. - baracuda_
kernels_ ⚠unary_ floor_ f64_ strided_ run - Unary elementwise
floor, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ frac_ bf16_ can_ implement - Pre-launch implementability check for
unary_frac_bf16. - baracuda_
kernels_ ⚠unary_ frac_ bf16_ run - Unary elementwise
frac, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ frac_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_frac_bf16_strided. - baracuda_
kernels_ ⚠unary_ frac_ bf16_ strided_ run - Unary elementwise
frac, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ frac_ f16_ can_ implement - Pre-launch implementability check for
unary_frac_f16. - baracuda_
kernels_ ⚠unary_ frac_ f16_ run - Unary elementwise
frac, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ frac_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_frac_f16_strided. - baracuda_
kernels_ ⚠unary_ frac_ f16_ strided_ run - Unary elementwise
frac, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ frac_ f32_ can_ implement - Pre-launch implementability check for
unary_frac_f32. - baracuda_
kernels_ ⚠unary_ frac_ f32_ run - Unary elementwise
frac, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ frac_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_frac_f32_strided. - baracuda_
kernels_ ⚠unary_ frac_ f32_ strided_ run - Unary elementwise
frac, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ frac_ f64_ can_ implement - Pre-launch implementability check for
unary_frac_f64. - baracuda_
kernels_ ⚠unary_ frac_ f64_ run - Unary elementwise
frac, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ frac_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_frac_f64_strided. - baracuda_
kernels_ ⚠unary_ frac_ f64_ strided_ run - Unary elementwise
frac, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ gelu_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_gelu_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ gelu_ backward_ bf16_ run - GELU (erf-based) backward, bf16.
- baracuda_
kernels_ ⚠unary_ gelu_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_gelu_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ gelu_ backward_ f16_ run - GELU (erf-based) backward, f16.
- baracuda_
kernels_ ⚠unary_ gelu_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_gelu_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ gelu_ backward_ f32_ run - GELU (exact / erf-based) backward, f32.
dx = dy * (Φ(x) + x*φ(x)). Saved-x. - baracuda_
kernels_ ⚠unary_ gelu_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_gelu_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ gelu_ backward_ f64_ run - GELU (erf-based) backward, f64.
- baracuda_
kernels_ ⚠unary_ gelu_ bf16_ can_ implement - Pre-launch implementability check for
unary_gelu_bf16. - baracuda_
kernels_ ⚠unary_ gelu_ bf16_ run - Unary elementwise
gelu, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ gelu_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_gelu_bf16_strided. - baracuda_
kernels_ ⚠unary_ gelu_ bf16_ strided_ run - Unary elementwise
gelu, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ gelu_ erf_ bf16_ can_ implement baracuda_kernels_unary_gelu_erf_bf16_can_implement(baracuda kernels unary gelu erf bf16 can implement).- baracuda_
kernels_ ⚠unary_ gelu_ erf_ bf16_ run unary_gelu_erf, bf16, contig.- baracuda_
kernels_ ⚠unary_ gelu_ erf_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_gelu_erf_bf16_strided. - baracuda_
kernels_ ⚠unary_ gelu_ erf_ bf16_ strided_ run baracuda_kernels_unary_gelu_erf_bf16_strided_run(baracuda kernels unary gelu erf bf16 strided run).- baracuda_
kernels_ ⚠unary_ gelu_ erf_ f16_ can_ implement baracuda_kernels_unary_gelu_erf_f16_can_implement(baracuda kernels unary gelu erf f16 can implement).- baracuda_
kernels_ ⚠unary_ gelu_ erf_ f16_ run unary_gelu_erf, f16, contig.- baracuda_
kernels_ ⚠unary_ gelu_ erf_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_gelu_erf_f16_strided. - baracuda_
kernels_ ⚠unary_ gelu_ erf_ f16_ strided_ run baracuda_kernels_unary_gelu_erf_f16_strided_run(baracuda kernels unary gelu erf f16 strided run).- baracuda_
kernels_ ⚠unary_ gelu_ erf_ f32_ can_ implement baracuda_kernels_unary_gelu_erf_f32_can_implement(baracuda kernels unary gelu erf f32 can implement).- baracuda_
kernels_ ⚠unary_ gelu_ erf_ f32_ run unary_gelu_erf, f32, contig.- baracuda_
kernels_ ⚠unary_ gelu_ erf_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_gelu_erf_f32_strided. - baracuda_
kernels_ ⚠unary_ gelu_ erf_ f32_ strided_ run baracuda_kernels_unary_gelu_erf_f32_strided_run(baracuda kernels unary gelu erf f32 strided run).- baracuda_
kernels_ ⚠unary_ gelu_ erf_ f64_ can_ implement baracuda_kernels_unary_gelu_erf_f64_can_implement(baracuda kernels unary gelu erf f64 can implement).- baracuda_
kernels_ ⚠unary_ gelu_ erf_ f64_ run unary_gelu_erf, f64, contig.- baracuda_
kernels_ ⚠unary_ gelu_ erf_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_gelu_erf_f64_strided. - baracuda_
kernels_ ⚠unary_ gelu_ erf_ f64_ strided_ run baracuda_kernels_unary_gelu_erf_f64_strided_run(baracuda kernels unary gelu erf f64 strided run).- baracuda_
kernels_ ⚠unary_ gelu_ f16_ can_ implement - Pre-launch implementability check for
unary_gelu_f16. - baracuda_
kernels_ ⚠unary_ gelu_ f16_ run - Unary elementwise
gelu, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ gelu_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_gelu_f16_strided. - baracuda_
kernels_ ⚠unary_ gelu_ f16_ strided_ run - Unary elementwise
gelu, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ gelu_ f32_ can_ implement - Pre-launch implementability check for
unary_gelu_f32. - baracuda_
kernels_ ⚠unary_ gelu_ f32_ run - Unary elementwise
gelu, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ gelu_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_gelu_f32_strided. - baracuda_
kernels_ ⚠unary_ gelu_ f32_ strided_ run - Unary elementwise
gelu, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ gelu_ f64_ can_ implement - Pre-launch implementability check for
unary_gelu_f64. - baracuda_
kernels_ ⚠unary_ gelu_ f64_ run - Unary elementwise
gelu, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ gelu_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_gelu_f64_strided. - baracuda_
kernels_ ⚠unary_ gelu_ f64_ strided_ run - Unary elementwise
gelu, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_gelu_tanh_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ backward_ bf16_ run - GELU (tanh approximation) backward, bf16.
- baracuda_
kernels_ ⚠unary_ gelu_ tanh_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_gelu_tanh_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ backward_ f16_ run - GELU (tanh approximation) backward, f16.
- baracuda_
kernels_ ⚠unary_ gelu_ tanh_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_gelu_tanh_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ backward_ f32_ run - GELU (tanh approximation) backward, f32. Saved-x.
- baracuda_
kernels_ ⚠unary_ gelu_ tanh_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_gelu_tanh_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ backward_ f64_ run - GELU (tanh approximation) backward, f64.
- baracuda_
kernels_ ⚠unary_ gelu_ tanh_ bf16_ can_ implement - Pre-launch implementability check for
unary_gelu_tanh_bf16. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ bf16_ run - Unary elementwise
gelu_tanh, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_gelu_tanh_bf16_strided. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ bf16_ strided_ run - Unary elementwise
gelu_tanh, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ f16_ can_ implement - Pre-launch implementability check for
unary_gelu_tanh_f16. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ f16_ run - Unary elementwise
gelu_tanh, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_gelu_tanh_f16_strided. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ f16_ strided_ run - Unary elementwise
gelu_tanh, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ f32_ can_ implement - Pre-launch implementability check for
unary_gelu_tanh_f32. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ f32_ run - Unary elementwise
gelu_tanh, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_gelu_tanh_f32_strided. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ f32_ strided_ run - Unary elementwise
gelu_tanh, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ f64_ can_ implement - Pre-launch implementability check for
unary_gelu_tanh_f64. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ f64_ run - Unary elementwise
gelu_tanh, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_gelu_tanh_f64_strided. - baracuda_
kernels_ ⚠unary_ gelu_ tanh_ f64_ strided_ run - Unary elementwise
gelu_tanh, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ hardshrink_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_hardshrink_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardshrink_ backward_ bf16_ run - Hardshrink backward, bf16.
- baracuda_
kernels_ ⚠unary_ hardshrink_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_hardshrink_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardshrink_ backward_ f16_ run - Hardshrink backward, f16.
- baracuda_
kernels_ ⚠unary_ hardshrink_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_hardshrink_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardshrink_ backward_ f32_ run - Hardshrink backward, f32.
dx = (|x| > λ) ? dy : 0with λ=0.5. Saved-x. - baracuda_
kernels_ ⚠unary_ hardshrink_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_hardshrink_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardshrink_ backward_ f64_ run - Hardshrink backward, f64.
- baracuda_
kernels_ ⚠unary_ hardshrink_ bf16_ can_ implement - Pre-launch implementability check for
unary_hardshrink_bf16. - baracuda_
kernels_ ⚠unary_ hardshrink_ bf16_ run - Unary elementwise
hardshrink(λ=0.5), bf16, contig. - baracuda_
kernels_ ⚠unary_ hardshrink_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_hardshrink_bf16_strided. - baracuda_
kernels_ ⚠unary_ hardshrink_ bf16_ strided_ run - Unary elementwise
hardshrink(λ=0.5), bf16, strided. - baracuda_
kernels_ ⚠unary_ hardshrink_ f16_ can_ implement - Pre-launch implementability check for
unary_hardshrink_f16. - baracuda_
kernels_ ⚠unary_ hardshrink_ f16_ run - Unary elementwise
hardshrink(λ=0.5), f16, contig. - baracuda_
kernels_ ⚠unary_ hardshrink_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_hardshrink_f16_strided. - baracuda_
kernels_ ⚠unary_ hardshrink_ f16_ strided_ run - Unary elementwise
hardshrink(λ=0.5), f16, strided. - baracuda_
kernels_ ⚠unary_ hardshrink_ f32_ can_ implement - Pre-launch implementability check for
unary_hardshrink_f32. - baracuda_
kernels_ ⚠unary_ hardshrink_ f32_ run - Unary elementwise
hardshrink(λ=0.5), f32, contig. - baracuda_
kernels_ ⚠unary_ hardshrink_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_hardshrink_f32_strided. - baracuda_
kernels_ ⚠unary_ hardshrink_ f32_ strided_ run - Unary elementwise
hardshrink(λ=0.5), f32, strided. - baracuda_
kernels_ ⚠unary_ hardshrink_ f64_ can_ implement - Pre-launch implementability check for
unary_hardshrink_f64. - baracuda_
kernels_ ⚠unary_ hardshrink_ f64_ run - Unary elementwise
hardshrink(λ=0.5), f64, contig. - baracuda_
kernels_ ⚠unary_ hardshrink_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_hardshrink_f64_strided. - baracuda_
kernels_ ⚠unary_ hardshrink_ f64_ strided_ run - Unary elementwise
hardshrink(λ=0.5), f64, strided. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_hardsigmoid_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ backward_ bf16_ run - Hardsigmoid backward, bf16.
- baracuda_
kernels_ ⚠unary_ hardsigmoid_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_hardsigmoid_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ backward_ f16_ run - Hardsigmoid backward, f16.
- baracuda_
kernels_ ⚠unary_ hardsigmoid_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_hardsigmoid_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ backward_ f32_ run - Hardsigmoid backward, f32.
dx = (-3 < x < 3) ? dy / 6 : 0. Saved-x. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_hardsigmoid_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ backward_ f64_ run - Hardsigmoid backward, f64.
- baracuda_
kernels_ ⚠unary_ hardsigmoid_ bf16_ can_ implement - Pre-launch implementability check for
unary_hardsigmoid_bf16. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ bf16_ run - Unary elementwise
hardsigmoid, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_hardsigmoid_bf16_strided. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ bf16_ strided_ run - Unary elementwise
hardsigmoid, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ f16_ can_ implement - Pre-launch implementability check for
unary_hardsigmoid_f16. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ f16_ run - Unary elementwise
hardsigmoid, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_hardsigmoid_f16_strided. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ f16_ strided_ run - Unary elementwise
hardsigmoid, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ f32_ can_ implement - Pre-launch implementability check for
unary_hardsigmoid_f32. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ f32_ run - Unary elementwise
hardsigmoid, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_hardsigmoid_f32_strided. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ f32_ strided_ run - Unary elementwise
hardsigmoid, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ f64_ can_ implement - Pre-launch implementability check for
unary_hardsigmoid_f64. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ f64_ run - Unary elementwise
hardsigmoid, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_hardsigmoid_f64_strided. - baracuda_
kernels_ ⚠unary_ hardsigmoid_ f64_ strided_ run - Unary elementwise
hardsigmoid, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ hardswish_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_hardswish_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardswish_ backward_ bf16_ run - Hardswish backward, bf16.
- baracuda_
kernels_ ⚠unary_ hardswish_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_hardswish_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardswish_ backward_ f16_ run - Hardswish backward, f16.
- baracuda_
kernels_ ⚠unary_ hardswish_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_hardswish_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardswish_ backward_ f32_ run - Hardswish backward, f32. Three-region piecewise +
(2x+3)/6middle. Saved-x. - baracuda_
kernels_ ⚠unary_ hardswish_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_hardswish_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardswish_ backward_ f64_ run - Hardswish backward, f64.
- baracuda_
kernels_ ⚠unary_ hardswish_ bf16_ can_ implement - Pre-launch implementability check for
unary_hardswish_bf16. - baracuda_
kernels_ ⚠unary_ hardswish_ bf16_ run - Unary elementwise
hardswish, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ hardswish_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_hardswish_bf16_strided. - baracuda_
kernels_ ⚠unary_ hardswish_ bf16_ strided_ run - Unary elementwise
hardswish, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ hardswish_ f16_ can_ implement - Pre-launch implementability check for
unary_hardswish_f16. - baracuda_
kernels_ ⚠unary_ hardswish_ f16_ run - Unary elementwise
hardswish, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ hardswish_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_hardswish_f16_strided. - baracuda_
kernels_ ⚠unary_ hardswish_ f16_ strided_ run - Unary elementwise
hardswish, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ hardswish_ f32_ can_ implement - Pre-launch implementability check for
unary_hardswish_f32. - baracuda_
kernels_ ⚠unary_ hardswish_ f32_ run - Unary elementwise
hardswish, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ hardswish_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_hardswish_f32_strided. - baracuda_
kernels_ ⚠unary_ hardswish_ f32_ strided_ run - Unary elementwise
hardswish, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ hardswish_ f64_ can_ implement - Pre-launch implementability check for
unary_hardswish_f64. - baracuda_
kernels_ ⚠unary_ hardswish_ f64_ run - Unary elementwise
hardswish, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ hardswish_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_hardswish_f64_strided. - baracuda_
kernels_ ⚠unary_ hardswish_ f64_ strided_ run - Unary elementwise
hardswish, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ hardtanh_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_hardtanh_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardtanh_ backward_ bf16_ run - Hardtanh backward, bf16.
- baracuda_
kernels_ ⚠unary_ hardtanh_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_hardtanh_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardtanh_ backward_ f16_ run - Hardtanh backward, f16.
- baracuda_
kernels_ ⚠unary_ hardtanh_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_hardtanh_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardtanh_ backward_ f32_ run - Hardtanh backward, f32.
dx = (-1 < x < 1) ? dy : 0. Saved-x. - baracuda_
kernels_ ⚠unary_ hardtanh_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_hardtanh_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ hardtanh_ backward_ f64_ run - Hardtanh backward, f64.
- baracuda_
kernels_ ⚠unary_ hardtanh_ bf16_ can_ implement - Pre-launch implementability check for
unary_hardtanh_bf16. - baracuda_
kernels_ ⚠unary_ hardtanh_ bf16_ run - Unary elementwise
hardtanh, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ hardtanh_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_hardtanh_bf16_strided. - baracuda_
kernels_ ⚠unary_ hardtanh_ bf16_ strided_ run - Unary elementwise
hardtanh, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ hardtanh_ f16_ can_ implement - Pre-launch implementability check for
unary_hardtanh_f16. - baracuda_
kernels_ ⚠unary_ hardtanh_ f16_ run - Unary elementwise
hardtanh, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ hardtanh_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_hardtanh_f16_strided. - baracuda_
kernels_ ⚠unary_ hardtanh_ f16_ strided_ run - Unary elementwise
hardtanh, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ hardtanh_ f32_ can_ implement - Pre-launch implementability check for
unary_hardtanh_f32. - baracuda_
kernels_ ⚠unary_ hardtanh_ f32_ run - Unary elementwise
hardtanh, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ hardtanh_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_hardtanh_f32_strided. - baracuda_
kernels_ ⚠unary_ hardtanh_ f32_ strided_ run - Unary elementwise
hardtanh, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ hardtanh_ f64_ can_ implement - Pre-launch implementability check for
unary_hardtanh_f64. - baracuda_
kernels_ ⚠unary_ hardtanh_ f64_ run - Unary elementwise
hardtanh, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ hardtanh_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_hardtanh_f64_strided. - baracuda_
kernels_ ⚠unary_ hardtanh_ f64_ strided_ run - Unary elementwise
hardtanh, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_leaky_relu_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ backward_ bf16_ run - LeakyReLU backward, bf16.
- baracuda_
kernels_ ⚠unary_ leaky_ relu_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_leaky_relu_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ backward_ f16_ run - LeakyReLU backward, f16.
- baracuda_
kernels_ ⚠unary_ leaky_ relu_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_leaky_relu_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ backward_ f32_ run - LeakyReLU backward, f32.
dx = (x > 0) ? dy : dy·αwith α=0.01. Saved-x. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_leaky_relu_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ backward_ f64_ run - LeakyReLU backward, f64.
- baracuda_
kernels_ ⚠unary_ leaky_ relu_ bf16_ can_ implement - Pre-launch implementability check for
unary_leaky_relu_bf16. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ bf16_ run - Unary elementwise
leaky_relu(α=0.01), bf16, contig. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_leaky_relu_bf16_strided. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ bf16_ strided_ run - Unary elementwise
leaky_relu(α=0.01), bf16, strided. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ f16_ can_ implement - Pre-launch implementability check for
unary_leaky_relu_f16. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ f16_ run - Unary elementwise
leaky_relu(α=0.01), f16, contig. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_leaky_relu_f16_strided. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ f16_ strided_ run - Unary elementwise
leaky_relu(α=0.01), f16, strided. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ f32_ can_ implement - Pre-launch implementability check for
unary_leaky_relu_f32. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ f32_ run - Unary elementwise
leaky_relu(α=0.01), f32, contig. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_leaky_relu_f32_strided. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ f32_ strided_ run - Unary elementwise
leaky_relu(α=0.01), f32, strided. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ f64_ can_ implement - Pre-launch implementability check for
unary_leaky_relu_f64. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ f64_ run - Unary elementwise
leaky_relu(α=0.01), f64, contig. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_leaky_relu_f64_strided. - baracuda_
kernels_ ⚠unary_ leaky_ relu_ f64_ strided_ run - Unary elementwise
leaky_relu(α=0.01), f64, strided. - baracuda_
kernels_ ⚠unary_ lgamma_ bf16_ can_ implement - Pre-launch implementability check for
unary_lgamma_bf16. - baracuda_
kernels_ ⚠unary_ lgamma_ bf16_ run - Unary elementwise
lgamma, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ lgamma_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_lgamma_bf16_strided. - baracuda_
kernels_ ⚠unary_ lgamma_ bf16_ strided_ run - Unary elementwise
lgamma, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ lgamma_ f16_ can_ implement - Pre-launch implementability check for
unary_lgamma_f16. - baracuda_
kernels_ ⚠unary_ lgamma_ f16_ run - Unary elementwise
lgamma, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ lgamma_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_lgamma_f16_strided. - baracuda_
kernels_ ⚠unary_ lgamma_ f16_ strided_ run - Unary elementwise
lgamma, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ lgamma_ f32_ can_ implement - Pre-launch implementability check for
unary_lgamma_f32. - baracuda_
kernels_ ⚠unary_ lgamma_ f32_ run - Unary elementwise
lgamma, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ lgamma_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_lgamma_f32_strided. - baracuda_
kernels_ ⚠unary_ lgamma_ f32_ strided_ run - Unary elementwise
lgamma, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ lgamma_ f64_ can_ implement - Pre-launch implementability check for
unary_lgamma_f64. - baracuda_
kernels_ ⚠unary_ lgamma_ f64_ run - Unary elementwise
lgamma, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ lgamma_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_lgamma_f64_strided. - baracuda_
kernels_ ⚠unary_ lgamma_ f64_ strided_ run - Unary elementwise
lgamma, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log1p_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_log1p_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log1p_ backward_ bf16_ run - Log1p backward, bf16.
- baracuda_
kernels_ ⚠unary_ log1p_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_log1p_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log1p_ backward_ f16_ run - Log1p backward, f16.
- baracuda_
kernels_ ⚠unary_ log1p_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_log1p_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log1p_ backward_ f32_ run - Log1p backward, f32.
dx = dy / (1 + x). - baracuda_
kernels_ ⚠unary_ log1p_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_log1p_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log1p_ backward_ f64_ run - Log1p backward, f64.
- baracuda_
kernels_ ⚠unary_ log1p_ bf16_ can_ implement - Pre-launch implementability check for
unary_log1p_bf16. - baracuda_
kernels_ ⚠unary_ log1p_ bf16_ run - Unary elementwise
log1p, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log1p_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_log1p_bf16_strided. - baracuda_
kernels_ ⚠unary_ log1p_ bf16_ strided_ run - Unary elementwise
log1p, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log1p_ f16_ can_ implement - Pre-launch implementability check for
unary_log1p_f16. - baracuda_
kernels_ ⚠unary_ log1p_ f16_ run - Unary elementwise
log1p, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log1p_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_log1p_f16_strided. - baracuda_
kernels_ ⚠unary_ log1p_ f16_ strided_ run - Unary elementwise
log1p, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log1p_ f32_ can_ implement - Pre-launch implementability check for
unary_log1p_f32. - baracuda_
kernels_ ⚠unary_ log1p_ f32_ run - Unary elementwise
log1p, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log1p_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_log1p_f32_strided. - baracuda_
kernels_ ⚠unary_ log1p_ f32_ strided_ run - Unary elementwise
log1p, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log1p_ f64_ can_ implement - Pre-launch implementability check for
unary_log1p_f64. - baracuda_
kernels_ ⚠unary_ log1p_ f64_ run - Unary elementwise
log1p, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log1p_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_log1p_f64_strided. - baracuda_
kernels_ ⚠unary_ log1p_ f64_ strided_ run - Unary elementwise
log1p, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log2_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_log2_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log2_ backward_ bf16_ run - Log2 backward, bf16.
- baracuda_
kernels_ ⚠unary_ log2_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_log2_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log2_ backward_ f16_ run - Log2 backward, f16.
- baracuda_
kernels_ ⚠unary_ log2_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_log2_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log2_ backward_ f32_ run - Log2 backward, f32.
dx = dy / (x * ln(2)). - baracuda_
kernels_ ⚠unary_ log2_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_log2_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log2_ backward_ f64_ run - Log2 backward, f64.
- baracuda_
kernels_ ⚠unary_ log2_ bf16_ can_ implement - Pre-launch implementability check for
unary_log2_bf16. - baracuda_
kernels_ ⚠unary_ log2_ bf16_ run - Unary elementwise
log2, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log2_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_log2_bf16_strided. - baracuda_
kernels_ ⚠unary_ log2_ bf16_ strided_ run - Unary elementwise
log2, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log2_ f16_ can_ implement - Pre-launch implementability check for
unary_log2_f16. - baracuda_
kernels_ ⚠unary_ log2_ f16_ run - Unary elementwise
log2, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log2_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_log2_f16_strided. - baracuda_
kernels_ ⚠unary_ log2_ f16_ strided_ run - Unary elementwise
log2, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log2_ f32_ can_ implement - Pre-launch implementability check for
unary_log2_f32. - baracuda_
kernels_ ⚠unary_ log2_ f32_ run - Unary elementwise
log2, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log2_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_log2_f32_strided. - baracuda_
kernels_ ⚠unary_ log2_ f32_ strided_ run - Unary elementwise
log2, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log2_ f64_ can_ implement - Pre-launch implementability check for
unary_log2_f64. - baracuda_
kernels_ ⚠unary_ log2_ f64_ run - Unary elementwise
log2, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log2_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_log2_f64_strided. - baracuda_
kernels_ ⚠unary_ log2_ f64_ strided_ run - Unary elementwise
log2, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log10_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_log10_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log10_ backward_ bf16_ run - Log10 backward, bf16.
- baracuda_
kernels_ ⚠unary_ log10_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_log10_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log10_ backward_ f16_ run - Log10 backward, f16.
- baracuda_
kernels_ ⚠unary_ log10_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_log10_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log10_ backward_ f32_ run - Log10 backward, f32.
dx = dy / (x * ln(10)). - baracuda_
kernels_ ⚠unary_ log10_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_log10_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log10_ backward_ f64_ run - Log10 backward, f64.
- baracuda_
kernels_ ⚠unary_ log10_ bf16_ can_ implement - Pre-launch implementability check for
unary_log10_bf16. - baracuda_
kernels_ ⚠unary_ log10_ bf16_ run - Unary elementwise
log10, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log10_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_log10_bf16_strided. - baracuda_
kernels_ ⚠unary_ log10_ bf16_ strided_ run - Unary elementwise
log10, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log10_ f16_ can_ implement - Pre-launch implementability check for
unary_log10_f16. - baracuda_
kernels_ ⚠unary_ log10_ f16_ run - Unary elementwise
log10, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log10_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_log10_f16_strided. - baracuda_
kernels_ ⚠unary_ log10_ f16_ strided_ run - Unary elementwise
log10, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log10_ f32_ can_ implement - Pre-launch implementability check for
unary_log10_f32. - baracuda_
kernels_ ⚠unary_ log10_ f32_ run - Unary elementwise
log10, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log10_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_log10_f32_strided. - baracuda_
kernels_ ⚠unary_ log10_ f32_ strided_ run - Unary elementwise
log10, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log10_ f64_ can_ implement - Pre-launch implementability check for
unary_log10_f64. - baracuda_
kernels_ ⚠unary_ log10_ f64_ run - Unary elementwise
log10, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log10_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_log10_f64_strided. - baracuda_
kernels_ ⚠unary_ log10_ f64_ strided_ run - Unary elementwise
log10, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_log_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log_ backward_ bf16_ run - Log backward, bf16.
- baracuda_
kernels_ ⚠unary_ log_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_log_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log_ backward_ f16_ run - Log backward, f16.
- baracuda_
kernels_ ⚠unary_ log_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_log_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log_ backward_ f32_ run - Log backward, f32.
dx = dy / x. - baracuda_
kernels_ ⚠unary_ log_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_log_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ log_ backward_ f64_ run - Log backward, f64.
- baracuda_
kernels_ ⚠unary_ log_ bf16_ can_ implement - Pre-launch implementability check for
unary_log_bf16. - baracuda_
kernels_ ⚠unary_ log_ bf16_ run - Unary elementwise
log, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_log_bf16_strided. - baracuda_
kernels_ ⚠unary_ log_ bf16_ strided_ run - Unary elementwise
log, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log_ f16_ can_ implement - Pre-launch implementability check for
unary_log_f16. - baracuda_
kernels_ ⚠unary_ log_ f16_ run - Unary elementwise
log, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_log_f16_strided. - baracuda_
kernels_ ⚠unary_ log_ f16_ strided_ run - Unary elementwise
log, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log_ f32_ can_ implement - Pre-launch implementability check for
unary_log_f32. - baracuda_
kernels_ ⚠unary_ log_ f32_ run - Unary elementwise
log, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_log_f32_strided. - baracuda_
kernels_ ⚠unary_ log_ f32_ strided_ run - Unary elementwise
log, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ log_ f64_ can_ implement - Pre-launch implementability check for
unary_log_f64. - baracuda_
kernels_ ⚠unary_ log_ f64_ run - Unary elementwise
log, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ log_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_log_f64_strided. - baracuda_
kernels_ ⚠unary_ log_ f64_ strided_ run - Unary elementwise
log, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ logit_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_logit_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ logit_ backward_ bf16_ run - Logit backward, bf16.
- baracuda_
kernels_ ⚠unary_ logit_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_logit_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ logit_ backward_ f16_ run - Logit backward, f16.
- baracuda_
kernels_ ⚠unary_ logit_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_logit_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ logit_ backward_ f32_ run - Logit backward, f32.
dx = dy / (x * (1 - x)). Domain0 < x < 1. - baracuda_
kernels_ ⚠unary_ logit_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_logit_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ logit_ backward_ f64_ run - Logit backward, f64.
- baracuda_
kernels_ ⚠unary_ logit_ bf16_ can_ implement - Pre-launch implementability check for
unary_logit_bf16. - baracuda_
kernels_ ⚠unary_ logit_ bf16_ run - Unary elementwise
logit, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ logit_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_logit_bf16_strided. - baracuda_
kernels_ ⚠unary_ logit_ bf16_ strided_ run - Unary elementwise
logit, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ logit_ f16_ can_ implement - Pre-launch implementability check for
unary_logit_f16. - baracuda_
kernels_ ⚠unary_ logit_ f16_ run - Unary elementwise
logit, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ logit_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_logit_f16_strided. - baracuda_
kernels_ ⚠unary_ logit_ f16_ strided_ run - Unary elementwise
logit, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ logit_ f32_ can_ implement - Pre-launch implementability check for
unary_logit_f32. - baracuda_
kernels_ ⚠unary_ logit_ f32_ run - Unary elementwise
logit, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ logit_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_logit_f32_strided. - baracuda_
kernels_ ⚠unary_ logit_ f32_ strided_ run - Unary elementwise
logit, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ logit_ f64_ can_ implement - Pre-launch implementability check for
unary_logit_f64. - baracuda_
kernels_ ⚠unary_ logit_ f64_ run - Unary elementwise
logit, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ logit_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_logit_f64_strided. - baracuda_
kernels_ ⚠unary_ logit_ f64_ strided_ run - Unary elementwise
logit, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ mish_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_mish_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ mish_ backward_ bf16_ run - Mish backward, bf16.
- baracuda_
kernels_ ⚠unary_ mish_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_mish_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ mish_ backward_ f16_ run - Mish backward, f16.
- baracuda_
kernels_ ⚠unary_ mish_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_mish_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ mish_ backward_ f32_ run - Mish backward, f32.
dx = dy * (tanh(sp) + x*s*(1 - tanh(sp)^2)),sp = softplus(x). Saved-x. - baracuda_
kernels_ ⚠unary_ mish_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_mish_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ mish_ backward_ f64_ run - Mish backward, f64.
- baracuda_
kernels_ ⚠unary_ mish_ bf16_ can_ implement - Pre-launch implementability check for
unary_mish_bf16. - baracuda_
kernels_ ⚠unary_ mish_ bf16_ run - Unary elementwise
mish, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ mish_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_mish_bf16_strided. - baracuda_
kernels_ ⚠unary_ mish_ bf16_ strided_ run - Unary elementwise
mish, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ mish_ f16_ can_ implement - Pre-launch implementability check for
unary_mish_f16. - baracuda_
kernels_ ⚠unary_ mish_ f16_ run - Unary elementwise
mish, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ mish_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_mish_f16_strided. - baracuda_
kernels_ ⚠unary_ mish_ f16_ strided_ run - Unary elementwise
mish, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ mish_ f32_ can_ implement - Pre-launch implementability check for
unary_mish_f32. - baracuda_
kernels_ ⚠unary_ mish_ f32_ run - Unary elementwise
mish, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ mish_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_mish_f32_strided. - baracuda_
kernels_ ⚠unary_ mish_ f32_ strided_ run - Unary elementwise
mish, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ mish_ f64_ can_ implement - Pre-launch implementability check for
unary_mish_f64. - baracuda_
kernels_ ⚠unary_ mish_ f64_ run - Unary elementwise
mish, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ mish_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_mish_f64_strided. - baracuda_
kernels_ ⚠unary_ mish_ f64_ strided_ run - Unary elementwise
mish, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ neg_ bf16_ can_ implement - Pre-launch implementability check for
unary_neg_bf16. - baracuda_
kernels_ ⚠unary_ neg_ bf16_ run - Unary elementwise
neg, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ neg_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_neg_bf16_strided. - baracuda_
kernels_ ⚠unary_ neg_ bf16_ strided_ run - Unary elementwise
neg, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ neg_ f16_ can_ implement - Pre-launch implementability check for
unary_neg_f16. - baracuda_
kernels_ ⚠unary_ neg_ f16_ run - Unary elementwise
neg, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ neg_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_neg_f16_strided. - baracuda_
kernels_ ⚠unary_ neg_ f16_ strided_ run - Unary elementwise
neg, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ neg_ f32_ can_ implement - Pre-launch implementability check for
unary_neg_f32. Validates the problem size without launching a kernel. Returns the standard status code mapping. - baracuda_
kernels_ ⚠unary_ neg_ f32_ run - Unary elementwise
neg, f32 dtype, contiguous fast path. This is the unary-pointwise trailblazer — its safety contract carries over to every plain unary launcher (neg,abs,sqr,sqrt,rsqrt,recip,exp,log,sin,cos,tan,sign,floor,ceil,round,erf,relu,silu,gelu,tanh,sigmoid, etc.) AND every parameterized-unary launcher (unary_param_*family:powi,threshold,elu,prelu,lerp, etc.) across all dtypes. See alsobinary_add_f32_runfor the binary contig aliasing contract andternary_clamp_f32_runfor the ternary one. - baracuda_
kernels_ ⚠unary_ neg_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_neg_f32_strided. - baracuda_
kernels_ ⚠unary_ neg_ f32_ strided_ run - Unary elementwise
neg, f32 dtype, strided path. This is the unary-strided trailblazer — its safety contract (including aliasing) carries over to every other unary strided launcher AND every parameterized-unary strided launcher (powi,threshold,elu,prelu,lerp) across all dtypes. - baracuda_
kernels_ ⚠unary_ neg_ f64_ can_ implement - Pre-launch implementability check for
unary_neg_f64. - baracuda_
kernels_ ⚠unary_ neg_ f64_ run - Unary elementwise
neg, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ neg_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_neg_f64_strided. - baracuda_
kernels_ ⚠unary_ neg_ f64_ strided_ run - Unary elementwise
neg, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ powf_ bf16_ can_ implement baracuda_kernels_unary_powf_bf16_can_implement(baracuda kernels unary powf bf16 can implement).- baracuda_
kernels_ ⚠unary_ powf_ bf16_ run unary_powf, bf16, contig.- baracuda_
kernels_ ⚠unary_ powf_ bf16_ strided_ can_ implement - Implementability check for
baracuda_kernels_unary_powf_bf16_strided. Host-side only. - baracuda_
kernels_ ⚠unary_ powf_ bf16_ strided_ run baracuda_kernels_unary_powf_bf16_strided_run(baracuda kernels unary powf bf16 strided run).- baracuda_
kernels_ ⚠unary_ powf_ f16_ can_ implement baracuda_kernels_unary_powf_f16_can_implement(baracuda kernels unary powf f16 can implement).- baracuda_
kernels_ ⚠unary_ powf_ f16_ run unary_powf, f16, contig. f32 detour.- baracuda_
kernels_ ⚠unary_ powf_ f16_ strided_ can_ implement - Implementability check for
baracuda_kernels_unary_powf_f16_strided. Host-side only. - baracuda_
kernels_ ⚠unary_ powf_ f16_ strided_ run baracuda_kernels_unary_powf_f16_strided_run(baracuda kernels unary powf f16 strided run).- baracuda_
kernels_ ⚠unary_ powf_ f32_ can_ implement - Implementability check for
unary_powf_f32. - baracuda_
kernels_ ⚠unary_ powf_ f32_ run - Unary elementwise
pow(x, exponent), f32, contig. - baracuda_
kernels_ ⚠unary_ powf_ f32_ strided_ can_ implement - Implementability check for
baracuda_kernels_unary_powf_f32_strided. Host-side only. - baracuda_
kernels_ ⚠unary_ powf_ f32_ strided_ run unary_powf, f32, strided sibling.- baracuda_
kernels_ ⚠unary_ powf_ f64_ can_ implement baracuda_kernels_unary_powf_f64_can_implement(baracuda kernels unary powf f64 can implement).- baracuda_
kernels_ ⚠unary_ powf_ f64_ run unary_powf, f64, contig.pow(libdevice) is full-double precision; the f32 exponent is widened once at kernel entry.- baracuda_
kernels_ ⚠unary_ powf_ f64_ strided_ can_ implement - Implementability check for
baracuda_kernels_unary_powf_f64_strided. Host-side only. - baracuda_
kernels_ ⚠unary_ powf_ f64_ strided_ run baracuda_kernels_unary_powf_f64_strided_run(baracuda kernels unary powf f64 strided run).- baracuda_
kernels_ ⚠unary_ powi_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_powi_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ powi_ backward_ bf16_ run powiBW, bf16.- baracuda_
kernels_ ⚠unary_ powi_ backward_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_powi_backward_bf16_strided. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ powi_ backward_ bf16_ strided_ run powiBW, bf16, strided.- baracuda_
kernels_ ⚠unary_ powi_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_powi_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ powi_ backward_ f16_ run powiBW, f16.- baracuda_
kernels_ ⚠unary_ powi_ backward_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_powi_backward_f16_strided. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ powi_ backward_ f16_ strided_ run powiBW, f16, strided.- baracuda_
kernels_ ⚠unary_ powi_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_powi_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ powi_ backward_ f32_ run powibackward:dx = n · x^(n-1) · dy, f32. Saved-x.- baracuda_
kernels_ ⚠unary_ powi_ backward_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_powi_backward_f32_strided. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ powi_ backward_ f32_ strided_ run powiBW, f32, strided.- baracuda_
kernels_ ⚠unary_ powi_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_powi_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ powi_ backward_ f64_ run powiBW, f64.- baracuda_
kernels_ ⚠unary_ powi_ backward_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_powi_backward_f64_strided. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ powi_ backward_ f64_ strided_ run powiBW, f64, strided.- baracuda_
kernels_ ⚠unary_ powi_ bf16_ can_ implement - Implementability check for
baracuda_kernels_unary_powi_bf16. Host-side only. - baracuda_
kernels_ ⚠unary_ powi_ bf16_ run powiFW, bf16.- baracuda_
kernels_ ⚠unary_ powi_ bf16_ strided_ can_ implement - Implementability check for
baracuda_kernels_unary_powi_bf16_strided. Host-side only. - baracuda_
kernels_ ⚠unary_ powi_ bf16_ strided_ run powiFW, bf16, strided.- baracuda_
kernels_ ⚠unary_ powi_ f16_ can_ implement - Implementability check for
baracuda_kernels_unary_powi_f16. Host-side only. - baracuda_
kernels_ ⚠unary_ powi_ f16_ run powiFW, f16.- baracuda_
kernels_ ⚠unary_ powi_ f16_ strided_ can_ implement - Implementability check for
baracuda_kernels_unary_powi_f16_strided. Host-side only. - baracuda_
kernels_ ⚠unary_ powi_ f16_ strided_ run powiFW, f16, strided.- baracuda_
kernels_ ⚠unary_ powi_ f32_ can_ implement - Implementability check for
baracuda_kernels_unary_powi_f32. Host-side only. - baracuda_
kernels_ ⚠unary_ powi_ f32_ run - Unary elementwise
powi(x; n) = x^n(integer exponent), f32, contig. - baracuda_
kernels_ ⚠unary_ powi_ f32_ strided_ can_ implement - Implementability check for
baracuda_kernels_unary_powi_f32_strided. Host-side only. - baracuda_
kernels_ ⚠unary_ powi_ f32_ strided_ run powiFW, f32, strided.- baracuda_
kernels_ ⚠unary_ powi_ f64_ can_ implement - Implementability check for
baracuda_kernels_unary_powi_f64. Host-side only. - baracuda_
kernels_ ⚠unary_ powi_ f64_ run powiFW, f64.- baracuda_
kernels_ ⚠unary_ powi_ f64_ strided_ can_ implement - Implementability check for
baracuda_kernels_unary_powi_f64_strided. Host-side only. - baracuda_
kernels_ ⚠unary_ powi_ f64_ strided_ run powiFW, f64, strided.- baracuda_
kernels_ ⚠unary_ reciprocal_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_reciprocal_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ reciprocal_ backward_ bf16_ run - Reciprocal backward, bf16.
- baracuda_
kernels_ ⚠unary_ reciprocal_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_reciprocal_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ reciprocal_ backward_ f16_ run - Reciprocal backward, f16.
- baracuda_
kernels_ ⚠unary_ reciprocal_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_reciprocal_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ reciprocal_ backward_ f32_ run - Reciprocal backward, f32.
dx = -dy / x². Domainx != 0. - baracuda_
kernels_ ⚠unary_ reciprocal_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_reciprocal_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ reciprocal_ backward_ f64_ run - Reciprocal backward, f64.
- baracuda_
kernels_ ⚠unary_ reciprocal_ bf16_ can_ implement - Pre-launch implementability check for
unary_reciprocal_bf16. - baracuda_
kernels_ ⚠unary_ reciprocal_ bf16_ run - Unary elementwise
reciprocal, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ reciprocal_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_reciprocal_bf16_strided. - baracuda_
kernels_ ⚠unary_ reciprocal_ bf16_ strided_ run - Unary elementwise
reciprocal, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ reciprocal_ f16_ can_ implement - Pre-launch implementability check for
unary_reciprocal_f16. - baracuda_
kernels_ ⚠unary_ reciprocal_ f16_ run - Unary elementwise
reciprocal, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ reciprocal_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_reciprocal_f16_strided. - baracuda_
kernels_ ⚠unary_ reciprocal_ f16_ strided_ run - Unary elementwise
reciprocal, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ reciprocal_ f32_ can_ implement - Pre-launch implementability check for
unary_reciprocal_f32. - baracuda_
kernels_ ⚠unary_ reciprocal_ f32_ run - Unary elementwise
reciprocal, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ reciprocal_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_reciprocal_f32_strided. - baracuda_
kernels_ ⚠unary_ reciprocal_ f32_ strided_ run - Unary elementwise
reciprocal, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ reciprocal_ f64_ can_ implement - Pre-launch implementability check for
unary_reciprocal_f64. - baracuda_
kernels_ ⚠unary_ reciprocal_ f64_ run - Unary elementwise
reciprocal, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ reciprocal_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_reciprocal_f64_strided. - baracuda_
kernels_ ⚠unary_ reciprocal_ f64_ strided_ run - Unary elementwise
reciprocal, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ relu6_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_relu6_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ relu6_ backward_ bf16_ run - ReLU6 backward, bf16.
- baracuda_
kernels_ ⚠unary_ relu6_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_relu6_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ relu6_ backward_ f16_ run - ReLU6 backward, f16.
- baracuda_
kernels_ ⚠unary_ relu6_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_relu6_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ relu6_ backward_ f32_ run - ReLU6 backward, f32.
dx = (0 < x < 6) ? dy : 0. Saved-x. - baracuda_
kernels_ ⚠unary_ relu6_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_relu6_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ relu6_ backward_ f64_ run - ReLU6 backward, f64.
- baracuda_
kernels_ ⚠unary_ relu6_ bf16_ can_ implement - Pre-launch implementability check for
unary_relu6_bf16. - baracuda_
kernels_ ⚠unary_ relu6_ bf16_ run - Unary elementwise
relu6, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ relu6_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_relu6_bf16_strided. - baracuda_
kernels_ ⚠unary_ relu6_ bf16_ strided_ run - Unary elementwise
relu6, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ relu6_ f16_ can_ implement - Pre-launch implementability check for
unary_relu6_f16. - baracuda_
kernels_ ⚠unary_ relu6_ f16_ run - Unary elementwise
relu6, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ relu6_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_relu6_f16_strided. - baracuda_
kernels_ ⚠unary_ relu6_ f16_ strided_ run - Unary elementwise
relu6, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ relu6_ f32_ can_ implement - Pre-launch implementability check for
unary_relu6_f32. - baracuda_
kernels_ ⚠unary_ relu6_ f32_ run - Unary elementwise
relu6, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ relu6_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_relu6_f32_strided. - baracuda_
kernels_ ⚠unary_ relu6_ f32_ strided_ run - Unary elementwise
relu6, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ relu6_ f64_ can_ implement - Pre-launch implementability check for
unary_relu6_f64. - baracuda_
kernels_ ⚠unary_ relu6_ f64_ run - Unary elementwise
relu6, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ relu6_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_relu6_f64_strided. - baracuda_
kernels_ ⚠unary_ relu6_ f64_ strided_ run - Unary elementwise
relu6, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ relu_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_relu_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ relu_ backward_ bf16_ run - ReLU backward, bf16.
- baracuda_
kernels_ ⚠unary_ relu_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_relu_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ relu_ backward_ f16_ run - ReLU backward, f16.
- baracuda_
kernels_ ⚠unary_ relu_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_relu_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ relu_ backward_ f32_ run - ReLU backward, f32.
dx = (x > 0) ? dy : 0. Saved-x. This is the activation-BW trailblazer — its aliasing contract carries over to every otherunary_<op>_backward_<dt>_run(gelu, silu, tanh, sigmoid, elu, leaky_relu, mish, hardswish, hardsigmoid, gelu_tanh, erf, erfc, etc.) across all dtypes, both saved-x and saved-y variants. - baracuda_
kernels_ ⚠unary_ relu_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_relu_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ relu_ backward_ f64_ run - ReLU backward, f64.
- baracuda_
kernels_ ⚠unary_ relu_ bf16_ can_ implement - Pre-launch implementability check for
unary_relu_bf16. - baracuda_
kernels_ ⚠unary_ relu_ bf16_ run - Unary elementwise
relu, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ relu_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_relu_bf16_strided. - baracuda_
kernels_ ⚠unary_ relu_ bf16_ strided_ run - Unary elementwise
relu, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ relu_ f16_ can_ implement - Pre-launch implementability check for
unary_relu_f16. - baracuda_
kernels_ ⚠unary_ relu_ f16_ run - Unary elementwise
relu, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ relu_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_relu_f16_strided. - baracuda_
kernels_ ⚠unary_ relu_ f16_ strided_ run - Unary elementwise
relu, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ relu_ f32_ can_ implement - Pre-launch implementability check for
unary_relu_f32. - baracuda_
kernels_ ⚠unary_ relu_ f32_ run - Unary elementwise
relu, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ relu_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_relu_f32_strided. - baracuda_
kernels_ ⚠unary_ relu_ f32_ strided_ run - Unary elementwise
relu, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ relu_ f64_ can_ implement - Pre-launch implementability check for
unary_relu_f64. - baracuda_
kernels_ ⚠unary_ relu_ f64_ run - Unary elementwise
relu, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ relu_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_relu_f64_strided. - baracuda_
kernels_ ⚠unary_ relu_ f64_ strided_ run - Unary elementwise
relu, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ round_ bf16_ can_ implement - Pre-launch implementability check for
unary_round_bf16. - baracuda_
kernels_ ⚠unary_ round_ bf16_ run - Unary elementwise
round, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ round_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_round_bf16_strided. - baracuda_
kernels_ ⚠unary_ round_ bf16_ strided_ run - Unary elementwise
round, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ round_ f16_ can_ implement - Pre-launch implementability check for
unary_round_f16. - baracuda_
kernels_ ⚠unary_ round_ f16_ run - Unary elementwise
round, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ round_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_round_f16_strided. - baracuda_
kernels_ ⚠unary_ round_ f16_ strided_ run - Unary elementwise
round, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ round_ f32_ can_ implement - Pre-launch implementability check for
unary_round_f32. - baracuda_
kernels_ ⚠unary_ round_ f32_ run - Unary elementwise
round, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ round_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_round_f32_strided. - baracuda_
kernels_ ⚠unary_ round_ f32_ strided_ run - Unary elementwise
round, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ round_ f64_ can_ implement - Pre-launch implementability check for
unary_round_f64. - baracuda_
kernels_ ⚠unary_ round_ f64_ run - Unary elementwise
round, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ round_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_round_f64_strided. - baracuda_
kernels_ ⚠unary_ round_ f64_ strided_ run - Unary elementwise
round, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ rsqrt_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_rsqrt_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ rsqrt_ backward_ bf16_ run - Rsqrt backward, bf16.
- baracuda_
kernels_ ⚠unary_ rsqrt_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_rsqrt_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ rsqrt_ backward_ f16_ run - Rsqrt backward, f16.
- baracuda_
kernels_ ⚠unary_ rsqrt_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_rsqrt_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ rsqrt_ backward_ f32_ run - Rsqrt backward, f32.
dx = -0.5 * dy * y³. Saved-y. - baracuda_
kernels_ ⚠unary_ rsqrt_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_rsqrt_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ rsqrt_ backward_ f64_ run - Rsqrt backward, f64.
- baracuda_
kernels_ ⚠unary_ rsqrt_ bf16_ can_ implement - Pre-launch implementability check for
unary_rsqrt_bf16. - baracuda_
kernels_ ⚠unary_ rsqrt_ bf16_ run - Unary elementwise
rsqrt, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ rsqrt_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_rsqrt_bf16_strided. - baracuda_
kernels_ ⚠unary_ rsqrt_ bf16_ strided_ run - Unary elementwise
rsqrt, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ rsqrt_ f16_ can_ implement - Pre-launch implementability check for
unary_rsqrt_f16. - baracuda_
kernels_ ⚠unary_ rsqrt_ f16_ run - Unary elementwise
rsqrt, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ rsqrt_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_rsqrt_f16_strided. - baracuda_
kernels_ ⚠unary_ rsqrt_ f16_ strided_ run - Unary elementwise
rsqrt, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ rsqrt_ f32_ can_ implement - Pre-launch implementability check for
unary_rsqrt_f32. - baracuda_
kernels_ ⚠unary_ rsqrt_ f32_ run - Unary elementwise
rsqrt, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ rsqrt_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_rsqrt_f32_strided. - baracuda_
kernels_ ⚠unary_ rsqrt_ f32_ strided_ run - Unary elementwise
rsqrt, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ rsqrt_ f64_ can_ implement - Pre-launch implementability check for
unary_rsqrt_f64. - baracuda_
kernels_ ⚠unary_ rsqrt_ f64_ run - Unary elementwise
rsqrt, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ rsqrt_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_rsqrt_f64_strided. - baracuda_
kernels_ ⚠unary_ rsqrt_ f64_ strided_ run - Unary elementwise
rsqrt, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ selu_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_selu_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ selu_ backward_ bf16_ run - SELU backward, bf16.
- baracuda_
kernels_ ⚠unary_ selu_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_selu_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ selu_ backward_ f16_ run - SELU backward, f16.
- baracuda_
kernels_ ⚠unary_ selu_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_selu_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ selu_ backward_ f32_ run - SELU backward, f32.
x>0 → dy*scale;x<=0 → dy*scale*alpha*exp(x). Saved-x. - baracuda_
kernels_ ⚠unary_ selu_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_selu_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ selu_ backward_ f64_ run - SELU backward, f64.
- baracuda_
kernels_ ⚠unary_ selu_ bf16_ can_ implement - Pre-launch implementability check for
unary_selu_bf16. - baracuda_
kernels_ ⚠unary_ selu_ bf16_ run - Unary elementwise
selu, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ selu_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_selu_bf16_strided. - baracuda_
kernels_ ⚠unary_ selu_ bf16_ strided_ run - Unary elementwise
selu, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ selu_ f16_ can_ implement - Pre-launch implementability check for
unary_selu_f16. - baracuda_
kernels_ ⚠unary_ selu_ f16_ run - Unary elementwise
selu, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ selu_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_selu_f16_strided. - baracuda_
kernels_ ⚠unary_ selu_ f16_ strided_ run - Unary elementwise
selu, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ selu_ f32_ can_ implement - Pre-launch implementability check for
unary_selu_f32. - baracuda_
kernels_ ⚠unary_ selu_ f32_ run - Unary elementwise
selu, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ selu_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_selu_f32_strided. - baracuda_
kernels_ ⚠unary_ selu_ f32_ strided_ run - Unary elementwise
selu, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ selu_ f64_ can_ implement - Pre-launch implementability check for
unary_selu_f64. - baracuda_
kernels_ ⚠unary_ selu_ f64_ run - Unary elementwise
selu, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ selu_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_selu_f64_strided. - baracuda_
kernels_ ⚠unary_ selu_ f64_ strided_ run - Unary elementwise
selu, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sigmoid_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_sigmoid_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sigmoid_ backward_ bf16_ run - Sigmoid backward, bf16.
- baracuda_
kernels_ ⚠unary_ sigmoid_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_sigmoid_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sigmoid_ backward_ f16_ run - Sigmoid backward, f16.
- baracuda_
kernels_ ⚠unary_ sigmoid_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_sigmoid_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sigmoid_ backward_ f32_ run - Sigmoid backward, f32.
dx = dy * y * (1 - y). Saved-y. - baracuda_
kernels_ ⚠unary_ sigmoid_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_sigmoid_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sigmoid_ backward_ f64_ run - Sigmoid backward, f64.
- baracuda_
kernels_ ⚠unary_ sigmoid_ bf16_ can_ implement - Pre-launch implementability check for
unary_sigmoid_bf16. - baracuda_
kernels_ ⚠unary_ sigmoid_ bf16_ run - Unary elementwise
sigmoid, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sigmoid_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_sigmoid_bf16_strided. - baracuda_
kernels_ ⚠unary_ sigmoid_ bf16_ strided_ run - Unary elementwise
sigmoid, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sigmoid_ f16_ can_ implement - Pre-launch implementability check for
unary_sigmoid_f16. - baracuda_
kernels_ ⚠unary_ sigmoid_ f16_ run - Unary elementwise
sigmoid, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sigmoid_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_sigmoid_f16_strided. - baracuda_
kernels_ ⚠unary_ sigmoid_ f16_ strided_ run - Unary elementwise
sigmoid, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sigmoid_ f32_ can_ implement - Pre-launch implementability check for
unary_sigmoid_f32. - baracuda_
kernels_ ⚠unary_ sigmoid_ f32_ run - Unary elementwise
sigmoid, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sigmoid_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_sigmoid_f32_strided. - baracuda_
kernels_ ⚠unary_ sigmoid_ f32_ strided_ run - Unary elementwise
sigmoid, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sigmoid_ f64_ can_ implement - Pre-launch implementability check for
unary_sigmoid_f64. - baracuda_
kernels_ ⚠unary_ sigmoid_ f64_ run - Unary elementwise
sigmoid, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sigmoid_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_sigmoid_f64_strided. - baracuda_
kernels_ ⚠unary_ sigmoid_ f64_ strided_ run - Unary elementwise
sigmoid, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sign_ bf16_ can_ implement - Pre-launch implementability check for
unary_sign_bf16. - baracuda_
kernels_ ⚠unary_ sign_ bf16_ run - Unary elementwise
sign, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sign_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_sign_bf16_strided. - baracuda_
kernels_ ⚠unary_ sign_ bf16_ strided_ run - Unary elementwise
sign, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sign_ f16_ can_ implement - Pre-launch implementability check for
unary_sign_f16. - baracuda_
kernels_ ⚠unary_ sign_ f16_ run - Unary elementwise
sign, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sign_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_sign_f16_strided. - baracuda_
kernels_ ⚠unary_ sign_ f16_ strided_ run - Unary elementwise
sign, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sign_ f32_ can_ implement - Pre-launch implementability check for
unary_sign_f32. - baracuda_
kernels_ ⚠unary_ sign_ f32_ run - Unary elementwise
sign, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sign_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_sign_f32_strided. - baracuda_
kernels_ ⚠unary_ sign_ f32_ strided_ run - Unary elementwise
sign, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sign_ f64_ can_ implement - Pre-launch implementability check for
unary_sign_f64. - baracuda_
kernels_ ⚠unary_ sign_ f64_ run - Unary elementwise
sign, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sign_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_sign_f64_strided. - baracuda_
kernels_ ⚠unary_ sign_ f64_ strided_ run - Unary elementwise
sign, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ silu_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_silu_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ silu_ backward_ bf16_ run - SiLU backward, bf16.
- baracuda_
kernels_ ⚠unary_ silu_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_silu_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ silu_ backward_ f16_ run - SiLU backward, f16.
- baracuda_
kernels_ ⚠unary_ silu_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_silu_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ silu_ backward_ f32_ run - SiLU (Swish) backward, f32.
dx = dy * s * (1 + x*(1-s))withs = sigmoid(x). Saved-x. - baracuda_
kernels_ ⚠unary_ silu_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_silu_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ silu_ backward_ f64_ run - SiLU backward, f64.
- baracuda_
kernels_ ⚠unary_ silu_ bf16_ can_ implement - Pre-launch implementability check for
unary_silu_bf16. - baracuda_
kernels_ ⚠unary_ silu_ bf16_ run - Unary elementwise
silu, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ silu_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_silu_bf16_strided. - baracuda_
kernels_ ⚠unary_ silu_ bf16_ strided_ run - Unary elementwise
silu, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ silu_ f16_ can_ implement - Pre-launch implementability check for
unary_silu_f16. - baracuda_
kernels_ ⚠unary_ silu_ f16_ run - Unary elementwise
silu, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ silu_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_silu_f16_strided. - baracuda_
kernels_ ⚠unary_ silu_ f16_ strided_ run - Unary elementwise
silu, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ silu_ f32_ can_ implement - Pre-launch implementability check for
unary_silu_f32. - baracuda_
kernels_ ⚠unary_ silu_ f32_ run - Unary elementwise
silu, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ silu_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_silu_f32_strided. - baracuda_
kernels_ ⚠unary_ silu_ f32_ strided_ run - Unary elementwise
silu, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ silu_ f64_ can_ implement - Pre-launch implementability check for
unary_silu_f64. - baracuda_
kernels_ ⚠unary_ silu_ f64_ run - Unary elementwise
silu, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ silu_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_silu_f64_strided. - baracuda_
kernels_ ⚠unary_ silu_ f64_ strided_ run - Unary elementwise
silu, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sin_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_sin_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sin_ backward_ bf16_ run - Sin backward, bf16.
- baracuda_
kernels_ ⚠unary_ sin_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_sin_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sin_ backward_ f16_ run - Sin backward, f16.
- baracuda_
kernels_ ⚠unary_ sin_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_sin_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sin_ backward_ f32_ run - Sin backward, f32.
dx = dy * cos(x). Caller must pass the forward inputxassaved. - baracuda_
kernels_ ⚠unary_ sin_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_sin_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sin_ backward_ f64_ run - Sin backward, f64.
- baracuda_
kernels_ ⚠unary_ sin_ bf16_ can_ implement - Pre-launch implementability check for
unary_sin_bf16. - baracuda_
kernels_ ⚠unary_ sin_ bf16_ run - Unary elementwise
sin, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sin_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_sin_bf16_strided. - baracuda_
kernels_ ⚠unary_ sin_ bf16_ strided_ run - Unary elementwise
sin, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sin_ f16_ can_ implement - Pre-launch implementability check for
unary_sin_f16. - baracuda_
kernels_ ⚠unary_ sin_ f16_ run - Unary elementwise
sin, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sin_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_sin_f16_strided. - baracuda_
kernels_ ⚠unary_ sin_ f16_ strided_ run - Unary elementwise
sin, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sin_ f32_ can_ implement - Pre-launch implementability check for
unary_sin_f32. - baracuda_
kernels_ ⚠unary_ sin_ f32_ run - Unary elementwise
sin, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sin_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_sin_f32_strided. - baracuda_
kernels_ ⚠unary_ sin_ f32_ strided_ run - Unary elementwise
sin, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sin_ f64_ can_ implement - Pre-launch implementability check for
unary_sin_f64. - baracuda_
kernels_ ⚠unary_ sin_ f64_ run - Unary elementwise
sin, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sin_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_sin_f64_strided. - baracuda_
kernels_ ⚠unary_ sin_ f64_ strided_ run - Unary elementwise
sin, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sinh_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_sinh_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sinh_ backward_ bf16_ run - Sinh backward, bf16.
- baracuda_
kernels_ ⚠unary_ sinh_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_sinh_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sinh_ backward_ f16_ run - Sinh backward, f16.
- baracuda_
kernels_ ⚠unary_ sinh_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_sinh_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sinh_ backward_ f32_ run - Sinh backward, f32.
dx = dy * cosh(x). Saved-x. - baracuda_
kernels_ ⚠unary_ sinh_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_sinh_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sinh_ backward_ f64_ run - Sinh backward, f64.
- baracuda_
kernels_ ⚠unary_ sinh_ bf16_ can_ implement - Pre-launch implementability check for
unary_sinh_bf16. - baracuda_
kernels_ ⚠unary_ sinh_ bf16_ run - Unary elementwise
sinh, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sinh_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_sinh_bf16_strided. - baracuda_
kernels_ ⚠unary_ sinh_ bf16_ strided_ run - Unary elementwise
sinh, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sinh_ f16_ can_ implement - Pre-launch implementability check for
unary_sinh_f16. - baracuda_
kernels_ ⚠unary_ sinh_ f16_ run - Unary elementwise
sinh, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sinh_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_sinh_f16_strided. - baracuda_
kernels_ ⚠unary_ sinh_ f16_ strided_ run - Unary elementwise
sinh, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sinh_ f32_ can_ implement - Pre-launch implementability check for
unary_sinh_f32. - baracuda_
kernels_ ⚠unary_ sinh_ f32_ run - Unary elementwise
sinh, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sinh_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_sinh_f32_strided. - baracuda_
kernels_ ⚠unary_ sinh_ f32_ strided_ run - Unary elementwise
sinh, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sinh_ f64_ can_ implement - Pre-launch implementability check for
unary_sinh_f64. - baracuda_
kernels_ ⚠unary_ sinh_ f64_ run - Unary elementwise
sinh, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sinh_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_sinh_f64_strided. - baracuda_
kernels_ ⚠unary_ sinh_ f64_ strided_ run - Unary elementwise
sinh, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ softplus_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_softplus_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ softplus_ backward_ bf16_ run - Softplus backward, bf16.
- baracuda_
kernels_ ⚠unary_ softplus_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_softplus_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ softplus_ backward_ f16_ run - Softplus backward, f16.
- baracuda_
kernels_ ⚠unary_ softplus_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_softplus_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ softplus_ backward_ f32_ run - Softplus backward, f32.
dx = dy / (1 + exp(-x)). Saved-x. - baracuda_
kernels_ ⚠unary_ softplus_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_softplus_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ softplus_ backward_ f64_ run - Softplus backward, f64.
- baracuda_
kernels_ ⚠unary_ softplus_ bf16_ can_ implement - Pre-launch implementability check for
unary_softplus_bf16. - baracuda_
kernels_ ⚠unary_ softplus_ bf16_ run - Unary elementwise
softplus, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ softplus_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_softplus_bf16_strided. - baracuda_
kernels_ ⚠unary_ softplus_ bf16_ strided_ run - Unary elementwise
softplus, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ softplus_ f16_ can_ implement - Pre-launch implementability check for
unary_softplus_f16. - baracuda_
kernels_ ⚠unary_ softplus_ f16_ run - Unary elementwise
softplus, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ softplus_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_softplus_f16_strided. - baracuda_
kernels_ ⚠unary_ softplus_ f16_ strided_ run - Unary elementwise
softplus, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ softplus_ f32_ can_ implement - Pre-launch implementability check for
unary_softplus_f32. - baracuda_
kernels_ ⚠unary_ softplus_ f32_ run - Unary elementwise
softplus, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ softplus_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_softplus_f32_strided. - baracuda_
kernels_ ⚠unary_ softplus_ f32_ strided_ run - Unary elementwise
softplus, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ softplus_ f64_ can_ implement - Pre-launch implementability check for
unary_softplus_f64. - baracuda_
kernels_ ⚠unary_ softplus_ f64_ run - Unary elementwise
softplus, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ softplus_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_softplus_f64_strided. - baracuda_
kernels_ ⚠unary_ softplus_ f64_ strided_ run - Unary elementwise
softplus, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ softshrink_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_softshrink_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ softshrink_ backward_ bf16_ run - Softshrink backward, bf16.
- baracuda_
kernels_ ⚠unary_ softshrink_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_softshrink_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ softshrink_ backward_ f16_ run - Softshrink backward, f16.
- baracuda_
kernels_ ⚠unary_ softshrink_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_softshrink_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ softshrink_ backward_ f32_ run - Softshrink backward, f32.
dx = (|x| > λ) ? dy : 0with λ=0.5. Saved-x. - baracuda_
kernels_ ⚠unary_ softshrink_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_softshrink_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ softshrink_ backward_ f64_ run - Softshrink backward, f64.
- baracuda_
kernels_ ⚠unary_ softshrink_ bf16_ can_ implement - Pre-launch implementability check for
unary_softshrink_bf16. - baracuda_
kernels_ ⚠unary_ softshrink_ bf16_ run - Unary elementwise
softshrink(λ=0.5), bf16, contig. - baracuda_
kernels_ ⚠unary_ softshrink_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_softshrink_bf16_strided. - baracuda_
kernels_ ⚠unary_ softshrink_ bf16_ strided_ run - Unary elementwise
softshrink(λ=0.5), bf16, strided. - baracuda_
kernels_ ⚠unary_ softshrink_ f16_ can_ implement - Pre-launch implementability check for
unary_softshrink_f16. - baracuda_
kernels_ ⚠unary_ softshrink_ f16_ run - Unary elementwise
softshrink(λ=0.5), f16, contig. - baracuda_
kernels_ ⚠unary_ softshrink_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_softshrink_f16_strided. - baracuda_
kernels_ ⚠unary_ softshrink_ f16_ strided_ run - Unary elementwise
softshrink(λ=0.5), f16, strided. - baracuda_
kernels_ ⚠unary_ softshrink_ f32_ can_ implement - Pre-launch implementability check for
unary_softshrink_f32. - baracuda_
kernels_ ⚠unary_ softshrink_ f32_ run - Unary elementwise
softshrink(λ=0.5), f32, contig. - baracuda_
kernels_ ⚠unary_ softshrink_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_softshrink_f32_strided. - baracuda_
kernels_ ⚠unary_ softshrink_ f32_ strided_ run - Unary elementwise
softshrink(λ=0.5), f32, strided. - baracuda_
kernels_ ⚠unary_ softshrink_ f64_ can_ implement - Pre-launch implementability check for
unary_softshrink_f64. - baracuda_
kernels_ ⚠unary_ softshrink_ f64_ run - Unary elementwise
softshrink(λ=0.5), f64, contig. - baracuda_
kernels_ ⚠unary_ softshrink_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_softshrink_f64_strided. - baracuda_
kernels_ ⚠unary_ softshrink_ f64_ strided_ run - Unary elementwise
softshrink(λ=0.5), f64, strided. - baracuda_
kernels_ ⚠unary_ softsign_ bf16_ can_ implement - Pre-launch implementability check for
unary_softsign_bf16. - baracuda_
kernels_ ⚠unary_ softsign_ bf16_ run - Unary elementwise
softsign, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ softsign_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_softsign_bf16_strided. - baracuda_
kernels_ ⚠unary_ softsign_ bf16_ strided_ run - Unary elementwise
softsign, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ softsign_ f16_ can_ implement - Pre-launch implementability check for
unary_softsign_f16. - baracuda_
kernels_ ⚠unary_ softsign_ f16_ run - Unary elementwise
softsign, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ softsign_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_softsign_f16_strided. - baracuda_
kernels_ ⚠unary_ softsign_ f16_ strided_ run - Unary elementwise
softsign, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ softsign_ f32_ can_ implement - Pre-launch implementability check for
unary_softsign_f32. - baracuda_
kernels_ ⚠unary_ softsign_ f32_ run - Unary elementwise
softsign, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ softsign_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_softsign_f32_strided. - baracuda_
kernels_ ⚠unary_ softsign_ f32_ strided_ run - Unary elementwise
softsign, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ softsign_ f64_ can_ implement - Pre-launch implementability check for
unary_softsign_f64. - baracuda_
kernels_ ⚠unary_ softsign_ f64_ run - Unary elementwise
softsign, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ softsign_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_softsign_f64_strided. - baracuda_
kernels_ ⚠unary_ softsign_ f64_ strided_ run - Unary elementwise
softsign, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sqrt_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_sqrt_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sqrt_ backward_ bf16_ run - Sqrt backward, bf16.
- baracuda_
kernels_ ⚠unary_ sqrt_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_sqrt_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sqrt_ backward_ f16_ run - Sqrt backward, f16.
- baracuda_
kernels_ ⚠unary_ sqrt_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_sqrt_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sqrt_ backward_ f32_ run - Sqrt backward, f32.
dx = dy / (2 * y). Saved-y. Callers must ensurey[i] != 0. - baracuda_
kernels_ ⚠unary_ sqrt_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_sqrt_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ sqrt_ backward_ f64_ run - Sqrt backward, f64.
- baracuda_
kernels_ ⚠unary_ sqrt_ bf16_ can_ implement - Pre-launch implementability check for
unary_sqrt_bf16. - baracuda_
kernels_ ⚠unary_ sqrt_ bf16_ run - Unary elementwise
sqrt, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sqrt_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_sqrt_bf16_strided. - baracuda_
kernels_ ⚠unary_ sqrt_ bf16_ strided_ run - Unary elementwise
sqrt, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sqrt_ f16_ can_ implement - Pre-launch implementability check for
unary_sqrt_f16. - baracuda_
kernels_ ⚠unary_ sqrt_ f16_ run - Unary elementwise
sqrt, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sqrt_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_sqrt_f16_strided. - baracuda_
kernels_ ⚠unary_ sqrt_ f16_ strided_ run - Unary elementwise
sqrt, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sqrt_ f32_ can_ implement - Pre-launch implementability check for
unary_sqrt_f32. - baracuda_
kernels_ ⚠unary_ sqrt_ f32_ run - Unary elementwise
sqrt, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sqrt_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_sqrt_f32_strided. - baracuda_
kernels_ ⚠unary_ sqrt_ f32_ strided_ run - Unary elementwise
sqrt, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ sqrt_ f64_ can_ implement - Pre-launch implementability check for
unary_sqrt_f64. - baracuda_
kernels_ ⚠unary_ sqrt_ f64_ run - Unary elementwise
sqrt, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ sqrt_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_sqrt_f64_strided. - baracuda_
kernels_ ⚠unary_ sqrt_ f64_ strided_ run - Unary elementwise
sqrt, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ square_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_square_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ square_ backward_ bf16_ run - Square backward, bf16.
- baracuda_
kernels_ ⚠unary_ square_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_square_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ square_ backward_ f16_ run - Square backward, f16.
- baracuda_
kernels_ ⚠unary_ square_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_square_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ square_ backward_ f32_ run - Square backward, f32.
dx = dy * 2 * x. - baracuda_
kernels_ ⚠unary_ square_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_square_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ square_ backward_ f64_ run - Square backward, f64.
- baracuda_
kernels_ ⚠unary_ square_ bf16_ can_ implement - Pre-launch implementability check for
unary_square_bf16. - baracuda_
kernels_ ⚠unary_ square_ bf16_ run - Unary elementwise
square, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ square_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_square_bf16_strided. - baracuda_
kernels_ ⚠unary_ square_ bf16_ strided_ run - Unary elementwise
square, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ square_ f16_ can_ implement - Pre-launch implementability check for
unary_square_f16. - baracuda_
kernels_ ⚠unary_ square_ f16_ run - Unary elementwise
square, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ square_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_square_f16_strided. - baracuda_
kernels_ ⚠unary_ square_ f16_ strided_ run - Unary elementwise
square, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ square_ f32_ can_ implement - Pre-launch implementability check for
unary_square_f32. - baracuda_
kernels_ ⚠unary_ square_ f32_ run - Unary elementwise
square, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ square_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_square_f32_strided. - baracuda_
kernels_ ⚠unary_ square_ f32_ strided_ run - Unary elementwise
square, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ square_ f64_ can_ implement - Pre-launch implementability check for
unary_square_f64. - baracuda_
kernels_ ⚠unary_ square_ f64_ run - Unary elementwise
square, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ square_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_square_f64_strided. - baracuda_
kernels_ ⚠unary_ square_ f64_ strided_ run - Unary elementwise
square, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ step_ bf16_ can_ implement baracuda_kernels_unary_step_bf16_can_implement(baracuda kernels unary step bf16 can implement).- baracuda_
kernels_ ⚠unary_ step_ bf16_ run unary_step, bf16, contig.- baracuda_
kernels_ ⚠unary_ step_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_step_bf16_strided. - baracuda_
kernels_ ⚠unary_ step_ bf16_ strided_ run baracuda_kernels_unary_step_bf16_strided_run(baracuda kernels unary step bf16 strided run).- baracuda_
kernels_ ⚠unary_ step_ f16_ can_ implement baracuda_kernels_unary_step_f16_can_implement(baracuda kernels unary step f16 can implement).- baracuda_
kernels_ ⚠unary_ step_ f16_ run unary_step, f16, contig.- baracuda_
kernels_ ⚠unary_ step_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_step_f16_strided. - baracuda_
kernels_ ⚠unary_ step_ f16_ strided_ run baracuda_kernels_unary_step_f16_strided_run(baracuda kernels unary step f16 strided run).- baracuda_
kernels_ ⚠unary_ step_ f32_ can_ implement baracuda_kernels_unary_step_f32_can_implement(baracuda kernels unary step f32 can implement).- baracuda_
kernels_ ⚠unary_ step_ f32_ run unary_step, f32, contig.- baracuda_
kernels_ ⚠unary_ step_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_step_f32_strided. - baracuda_
kernels_ ⚠unary_ step_ f32_ strided_ run baracuda_kernels_unary_step_f32_strided_run(baracuda kernels unary step f32 strided run).- baracuda_
kernels_ ⚠unary_ step_ f64_ can_ implement baracuda_kernels_unary_step_f64_can_implement(baracuda kernels unary step f64 can implement).- baracuda_
kernels_ ⚠unary_ step_ f64_ run unary_step, f64, contig.- baracuda_
kernels_ ⚠unary_ step_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_step_f64_strided. - baracuda_
kernels_ ⚠unary_ step_ f64_ strided_ run baracuda_kernels_unary_step_f64_strided_run(baracuda kernels unary step f64 strided run).- baracuda_
kernels_ ⚠unary_ tan_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_tan_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ tan_ backward_ bf16_ run - Tan backward, bf16.
- baracuda_
kernels_ ⚠unary_ tan_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_tan_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ tan_ backward_ f16_ run - Tan backward, f16.
- baracuda_
kernels_ ⚠unary_ tan_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_tan_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ tan_ backward_ f32_ run - Tan backward, f32.
dx = dy * (1 + tan(x)²). Saved-x. - baracuda_
kernels_ ⚠unary_ tan_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_tan_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ tan_ backward_ f64_ run - Tan backward, f64.
- baracuda_
kernels_ ⚠unary_ tan_ bf16_ can_ implement - Pre-launch implementability check for
unary_tan_bf16. - baracuda_
kernels_ ⚠unary_ tan_ bf16_ run - Unary elementwise
tan, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ tan_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_tan_bf16_strided. - baracuda_
kernels_ ⚠unary_ tan_ bf16_ strided_ run - Unary elementwise
tan, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ tan_ f16_ can_ implement - Pre-launch implementability check for
unary_tan_f16. - baracuda_
kernels_ ⚠unary_ tan_ f16_ run - Unary elementwise
tan, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ tan_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_tan_f16_strided. - baracuda_
kernels_ ⚠unary_ tan_ f16_ strided_ run - Unary elementwise
tan, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ tan_ f32_ can_ implement - Pre-launch implementability check for
unary_tan_f32. - baracuda_
kernels_ ⚠unary_ tan_ f32_ run - Unary elementwise
tan, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ tan_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_tan_f32_strided. - baracuda_
kernels_ ⚠unary_ tan_ f32_ strided_ run - Unary elementwise
tan, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ tan_ f64_ can_ implement - Pre-launch implementability check for
unary_tan_f64. - baracuda_
kernels_ ⚠unary_ tan_ f64_ run - Unary elementwise
tan, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ tan_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_tan_f64_strided. - baracuda_
kernels_ ⚠unary_ tan_ f64_ strided_ run - Unary elementwise
tan, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ tanh_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_tanh_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ tanh_ backward_ bf16_ run - Tanh backward, bf16.
- baracuda_
kernels_ ⚠unary_ tanh_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_tanh_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ tanh_ backward_ f16_ run - Tanh backward, f16.
- baracuda_
kernels_ ⚠unary_ tanh_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_tanh_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ tanh_ backward_ f32_ run - Tanh backward, f32.
dx = dy * (1 - y²). Saved-y. - baracuda_
kernels_ ⚠unary_ tanh_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_tanh_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ tanh_ backward_ f64_ run - Tanh backward, f64.
- baracuda_
kernels_ ⚠unary_ tanh_ bf16_ can_ implement - Pre-launch implementability check for
unary_tanh_bf16. - baracuda_
kernels_ ⚠unary_ tanh_ bf16_ run - Unary elementwise
tanh, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ tanh_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_tanh_bf16_strided. - baracuda_
kernels_ ⚠unary_ tanh_ bf16_ strided_ run - Unary elementwise
tanh, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ tanh_ f16_ can_ implement - Pre-launch implementability check for
unary_tanh_f16. - baracuda_
kernels_ ⚠unary_ tanh_ f16_ run - Unary elementwise
tanh, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ tanh_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_tanh_f16_strided. - baracuda_
kernels_ ⚠unary_ tanh_ f16_ strided_ run - Unary elementwise
tanh, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ tanh_ f32_ can_ implement - Pre-launch implementability check for
unary_tanh_f32. - baracuda_
kernels_ ⚠unary_ tanh_ f32_ run - Unary elementwise
tanh, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ tanh_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_tanh_f32_strided. - baracuda_
kernels_ ⚠unary_ tanh_ f32_ strided_ run - Unary elementwise
tanh, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ tanh_ f64_ can_ implement - Pre-launch implementability check for
unary_tanh_f64. - baracuda_
kernels_ ⚠unary_ tanh_ f64_ run - Unary elementwise
tanh, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ tanh_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_tanh_f64_strided. - baracuda_
kernels_ ⚠unary_ tanh_ f64_ strided_ run - Unary elementwise
tanh, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ tanhshrink_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_tanhshrink_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ tanhshrink_ backward_ bf16_ run - Tanhshrink backward, bf16.
- baracuda_
kernels_ ⚠unary_ tanhshrink_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_tanhshrink_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ tanhshrink_ backward_ f16_ run - Tanhshrink backward, f16.
- baracuda_
kernels_ ⚠unary_ tanhshrink_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_tanhshrink_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ tanhshrink_ backward_ f32_ run - Tanhshrink backward, f32.
dx = dy * tanh(x)². - baracuda_
kernels_ ⚠unary_ tanhshrink_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_tanhshrink_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ tanhshrink_ backward_ f64_ run - Tanhshrink backward, f64.
- baracuda_
kernels_ ⚠unary_ tanhshrink_ bf16_ can_ implement - Pre-launch implementability check for
unary_tanhshrink_bf16. - baracuda_
kernels_ ⚠unary_ tanhshrink_ bf16_ run - Unary elementwise
tanhshrink, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ tanhshrink_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_tanhshrink_bf16_strided. - baracuda_
kernels_ ⚠unary_ tanhshrink_ bf16_ strided_ run - Unary elementwise
tanhshrink, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ tanhshrink_ f16_ can_ implement - Pre-launch implementability check for
unary_tanhshrink_f16. - baracuda_
kernels_ ⚠unary_ tanhshrink_ f16_ run - Unary elementwise
tanhshrink, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ tanhshrink_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_tanhshrink_f16_strided. - baracuda_
kernels_ ⚠unary_ tanhshrink_ f16_ strided_ run - Unary elementwise
tanhshrink, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ tanhshrink_ f32_ can_ implement - Pre-launch implementability check for
unary_tanhshrink_f32. - baracuda_
kernels_ ⚠unary_ tanhshrink_ f32_ run - Unary elementwise
tanhshrink, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ tanhshrink_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_tanhshrink_f32_strided. - baracuda_
kernels_ ⚠unary_ tanhshrink_ f32_ strided_ run - Unary elementwise
tanhshrink, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ tanhshrink_ f64_ can_ implement - Pre-launch implementability check for
unary_tanhshrink_f64. - baracuda_
kernels_ ⚠unary_ tanhshrink_ f64_ run - Unary elementwise
tanhshrink, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ tanhshrink_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_tanhshrink_f64_strided. - baracuda_
kernels_ ⚠unary_ tanhshrink_ f64_ strided_ run - Unary elementwise
tanhshrink, f64 dtype, strided path. - baracuda_
kernels_ ⚠unary_ threshold_ backward_ bf16_ can_ implement - Pre-launch implementability check for
unary_threshold_backward_bf16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ threshold_ backward_ bf16_ run thresholdBW, bf16.- baracuda_
kernels_ ⚠unary_ threshold_ backward_ f16_ can_ implement - Pre-launch implementability check for
unary_threshold_backward_f16. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ threshold_ backward_ f16_ run thresholdBW, f16.- baracuda_
kernels_ ⚠unary_ threshold_ backward_ f32_ can_ implement - Pre-launch implementability check for
unary_threshold_backward_f32. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ threshold_ backward_ f32_ run thresholdbackward:dx = (x > t) ? dy : 0, f32. Saved-x.- baracuda_
kernels_ ⚠unary_ threshold_ backward_ f64_ can_ implement - Pre-launch implementability check for
unary_threshold_backward_f64. Host-side validation; no kernel launch. - baracuda_
kernels_ ⚠unary_ threshold_ backward_ f64_ run thresholdBW, f64.- baracuda_
kernels_ ⚠unary_ threshold_ bf16_ can_ implement - Implementability check for
baracuda_kernels_unary_threshold_bf16. Host-side only. - baracuda_
kernels_ ⚠unary_ threshold_ bf16_ run thresholdFW, bf16.- baracuda_
kernels_ ⚠unary_ threshold_ f16_ can_ implement - Implementability check for
baracuda_kernels_unary_threshold_f16. Host-side only. - baracuda_
kernels_ ⚠unary_ threshold_ f16_ run thresholdFW, f16.- baracuda_
kernels_ ⚠unary_ threshold_ f32_ can_ implement - Implementability check for
baracuda_kernels_unary_threshold_f32. Host-side only. - baracuda_
kernels_ ⚠unary_ threshold_ f32_ run - Unary elementwise
threshold(x; t, v) = (x > t) ? x : v, f32, contig. - baracuda_
kernels_ ⚠unary_ threshold_ f64_ can_ implement - Implementability check for
baracuda_kernels_unary_threshold_f64. Host-side only. - baracuda_
kernels_ ⚠unary_ threshold_ f64_ run thresholdFW, f64. The f32 params widen to f64 losslessly.- baracuda_
kernels_ ⚠unary_ trunc_ bf16_ can_ implement - Pre-launch implementability check for
unary_trunc_bf16. - baracuda_
kernels_ ⚠unary_ trunc_ bf16_ run - Unary elementwise
trunc, bf16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ trunc_ bf16_ strided_ can_ implement - Pre-launch implementability check for
unary_trunc_bf16_strided. - baracuda_
kernels_ ⚠unary_ trunc_ bf16_ strided_ run - Unary elementwise
trunc, bf16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ trunc_ f16_ can_ implement - Pre-launch implementability check for
unary_trunc_f16. - baracuda_
kernels_ ⚠unary_ trunc_ f16_ run - Unary elementwise
trunc, f16 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ trunc_ f16_ strided_ can_ implement - Pre-launch implementability check for
unary_trunc_f16_strided. - baracuda_
kernels_ ⚠unary_ trunc_ f16_ strided_ run - Unary elementwise
trunc, f16 dtype, strided path. - baracuda_
kernels_ ⚠unary_ trunc_ f32_ can_ implement - Pre-launch implementability check for
unary_trunc_f32. - baracuda_
kernels_ ⚠unary_ trunc_ f32_ run - Unary elementwise
trunc, f32 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ trunc_ f32_ strided_ can_ implement - Pre-launch implementability check for
unary_trunc_f32_strided. - baracuda_
kernels_ ⚠unary_ trunc_ f32_ strided_ run - Unary elementwise
trunc, f32 dtype, strided path. - baracuda_
kernels_ ⚠unary_ trunc_ f64_ can_ implement - Pre-launch implementability check for
unary_trunc_f64. - baracuda_
kernels_ ⚠unary_ trunc_ f64_ run - Unary elementwise
trunc, f64 dtype, contiguous fast path. - baracuda_
kernels_ ⚠unary_ trunc_ f64_ strided_ can_ implement - Pre-launch implementability check for
unary_trunc_f64_strided. - baracuda_
kernels_ ⚠unary_ trunc_ f64_ strided_ run - Unary elementwise
trunc, f64 dtype, strided path. - baracuda_
kernels_ ⚠unique_ consecutive_ f32_ can_ implement baracuda_kernels_unique_consecutive_f32_can_implement(baracuda kernels unique consecutive f32 can implement).- baracuda_
kernels_ ⚠unique_ consecutive_ f32_ run - Unique-consecutive, f32. Emits one cell per run-start; output
slot order is atomic-counter race order.
counter[row]holds the actual unique count post-launch. - baracuda_
kernels_ ⚠unique_ consecutive_ f64_ can_ implement baracuda_kernels_unique_consecutive_f64_can_implement(baracuda kernels unique consecutive f64 can implement).- baracuda_
kernels_ ⚠unique_ consecutive_ f64_ run - Unique-consecutive, f64.
- baracuda_
kernels_ ⚠unique_ consecutive_ i32_ can_ implement baracuda_kernels_unique_consecutive_i32_can_implement(baracuda kernels unique consecutive i32 can implement).- baracuda_
kernels_ ⚠unique_ consecutive_ i32_ run - Unique-consecutive, i32.
- baracuda_
kernels_ ⚠unsorted_ segment_ max_ backward_ f32_ can_ implement - Implementability check for
unsorted_segment_max_backward_f32. - baracuda_
kernels_ ⚠unsorted_ segment_ max_ backward_ f32_ run unsorted_segment_max_backward— f32.- baracuda_
kernels_ ⚠unsorted_ segment_ max_ backward_ f64_ can_ implement - Implementability check for
unsorted_segment_max_backward_f64. - baracuda_
kernels_ ⚠unsorted_ segment_ max_ backward_ f64_ run unsorted_segment_max_backward— f64.- baracuda_
kernels_ ⚠unsorted_ segment_ max_ f32_ can_ implement - Implementability check for
unsorted_segment_max_f32. - baracuda_
kernels_ ⚠unsorted_ segment_ max_ f32_ run out[s, d] = max_{n : seg[n] == s} input[n, d]— unsorted; atomicMax-via-CAS. Output pre-initialized to-infby the launcher. f32.- baracuda_
kernels_ ⚠unsorted_ segment_ max_ f64_ can_ implement - Implementability check for
unsorted_segment_max_f64. - baracuda_
kernels_ ⚠unsorted_ segment_ max_ f64_ run unsorted_segment_max— f64.- baracuda_
kernels_ ⚠unsorted_ segment_ max_ i64idx_ f32_ can_ implement baracuda_kernels_unsorted_segment_max_i64idx_f32_can_implement(baracuda kernels unsorted segment max i64idx f32 can implement).- baracuda_
kernels_ ⚠unsorted_ segment_ max_ i64idx_ f32_ run baracuda_kernels_unsorted_segment_max_i64idx_f32_run(baracuda kernels unsorted segment max i64idx f32 run).- baracuda_
kernels_ ⚠unsorted_ segment_ max_ i64idx_ f64_ can_ implement baracuda_kernels_unsorted_segment_max_i64idx_f64_can_implement(baracuda kernels unsorted segment max i64idx f64 can implement).- baracuda_
kernels_ ⚠unsorted_ segment_ max_ i64idx_ f64_ run baracuda_kernels_unsorted_segment_max_i64idx_f64_run(baracuda kernels unsorted segment max i64idx f64 run).- baracuda_
kernels_ ⚠unsorted_ segment_ mean_ backward_ f32_ can_ implement - Implementability check for
unsorted_segment_mean_backward_f32. - baracuda_
kernels_ ⚠unsorted_ segment_ mean_ backward_ f32_ run unsorted_segment_mean_backward— f32.- baracuda_
kernels_ ⚠unsorted_ segment_ mean_ backward_ f64_ can_ implement - Implementability check for
unsorted_segment_mean_backward_f64. - baracuda_
kernels_ ⚠unsorted_ segment_ mean_ backward_ f64_ run unsorted_segment_mean_backward— f64.- baracuda_
kernels_ ⚠unsorted_ segment_ mean_ backward_ i64idx_ f32_ can_ implement baracuda_kernels_unsorted_segment_mean_backward_i64idx_f32_can_implement(baracuda kernels unsorted segment mean backward i64idx f32 can implement).- baracuda_
kernels_ ⚠unsorted_ segment_ mean_ backward_ i64idx_ f32_ run baracuda_kernels_unsorted_segment_mean_backward_i64idx_f32_run(baracuda kernels unsorted segment mean backward i64idx f32 run).- baracuda_
kernels_ ⚠unsorted_ segment_ mean_ backward_ i64idx_ f64_ can_ implement baracuda_kernels_unsorted_segment_mean_backward_i64idx_f64_can_implement(baracuda kernels unsorted segment mean backward i64idx f64 can implement).- baracuda_
kernels_ ⚠unsorted_ segment_ mean_ backward_ i64idx_ f64_ run baracuda_kernels_unsorted_segment_mean_backward_i64idx_f64_run(baracuda kernels unsorted segment mean backward i64idx f64 run).- baracuda_
kernels_ ⚠unsorted_ segment_ mean_ f32_ can_ implement - Implementability check for
unsorted_segment_mean_f32. - baracuda_
kernels_ ⚠unsorted_ segment_ mean_ f32_ run out[s, d] = mean_{n : seg[n] == s} input[n, d]— unsorted. Workspace:num_segments * sizeof(i32)for per-segment counts. f32.- baracuda_
kernels_ ⚠unsorted_ segment_ mean_ f64_ can_ implement - Implementability check for
unsorted_segment_mean_f64. - baracuda_
kernels_ ⚠unsorted_ segment_ mean_ f64_ run unsorted_segment_mean— f64.- baracuda_
kernels_ ⚠unsorted_ segment_ mean_ i64idx_ f32_ can_ implement baracuda_kernels_unsorted_segment_mean_i64idx_f32_can_implement(baracuda kernels unsorted segment mean i64idx f32 can implement).- baracuda_
kernels_ ⚠unsorted_ segment_ mean_ i64idx_ f32_ run baracuda_kernels_unsorted_segment_mean_i64idx_f32_run(baracuda kernels unsorted segment mean i64idx f32 run).- baracuda_
kernels_ ⚠unsorted_ segment_ mean_ i64idx_ f64_ can_ implement baracuda_kernels_unsorted_segment_mean_i64idx_f64_can_implement(baracuda kernels unsorted segment mean i64idx f64 can implement).- baracuda_
kernels_ ⚠unsorted_ segment_ mean_ i64idx_ f64_ run baracuda_kernels_unsorted_segment_mean_i64idx_f64_run(baracuda kernels unsorted segment mean i64idx f64 run).- baracuda_
kernels_ ⚠unsorted_ segment_ min_ backward_ f32_ can_ implement - Implementability check for
unsorted_segment_min_backward_f32. - baracuda_
kernels_ ⚠unsorted_ segment_ min_ backward_ f32_ run unsorted_segment_min_backward— f32.- baracuda_
kernels_ ⚠unsorted_ segment_ min_ backward_ f64_ can_ implement - Implementability check for
unsorted_segment_min_backward_f64. - baracuda_
kernels_ ⚠unsorted_ segment_ min_ backward_ f64_ run unsorted_segment_min_backward— f64.- baracuda_
kernels_ ⚠unsorted_ segment_ min_ f32_ can_ implement - Implementability check for
unsorted_segment_min_f32. - baracuda_
kernels_ ⚠unsorted_ segment_ min_ f32_ run out[s, d] = min_{n : seg[n] == s} input[n, d]— unsorted. f32.- baracuda_
kernels_ ⚠unsorted_ segment_ min_ f64_ can_ implement - Implementability check for
unsorted_segment_min_f64. - baracuda_
kernels_ ⚠unsorted_ segment_ min_ f64_ run unsorted_segment_min— f64.- baracuda_
kernels_ ⚠unsorted_ segment_ min_ i64idx_ f32_ can_ implement baracuda_kernels_unsorted_segment_min_i64idx_f32_can_implement(baracuda kernels unsorted segment min i64idx f32 can implement).- baracuda_
kernels_ ⚠unsorted_ segment_ min_ i64idx_ f32_ run baracuda_kernels_unsorted_segment_min_i64idx_f32_run(baracuda kernels unsorted segment min i64idx f32 run).- baracuda_
kernels_ ⚠unsorted_ segment_ min_ i64idx_ f64_ can_ implement baracuda_kernels_unsorted_segment_min_i64idx_f64_can_implement(baracuda kernels unsorted segment min i64idx f64 can implement).- baracuda_
kernels_ ⚠unsorted_ segment_ min_ i64idx_ f64_ run baracuda_kernels_unsorted_segment_min_i64idx_f64_run(baracuda kernels unsorted segment min i64idx f64 run).- baracuda_
kernels_ ⚠unsorted_ segment_ prod_ backward_ f32_ can_ implement - Implementability check for
unsorted_segment_prod_backward_f32. - baracuda_
kernels_ ⚠unsorted_ segment_ prod_ backward_ f32_ run unsorted_segment_prod_backward— f32. Shares the kernel with the sorted variant; distinct symbol for SKU tagging.- baracuda_
kernels_ ⚠unsorted_ segment_ prod_ backward_ f64_ can_ implement - Implementability check for
unsorted_segment_prod_backward_f64. - baracuda_
kernels_ ⚠unsorted_ segment_ prod_ backward_ f64_ run unsorted_segment_prod_backward— f64.- baracuda_
kernels_ ⚠unsorted_ segment_ prod_ f32_ can_ implement - Implementability check for
unsorted_segment_prod_f32. - baracuda_
kernels_ ⚠unsorted_ segment_ prod_ f32_ run unsorted_segment_prodFW — f32.- baracuda_
kernels_ ⚠unsorted_ segment_ prod_ f64_ can_ implement - Implementability check for
unsorted_segment_prod_f64. - baracuda_
kernels_ ⚠unsorted_ segment_ prod_ f64_ run unsorted_segment_prodFW — f64.- baracuda_
kernels_ ⚠unsorted_ segment_ sum_ backward_ f32_ can_ implement - Implementability check for
unsorted_segment_sum_backward_f32. - baracuda_
kernels_ ⚠unsorted_ segment_ sum_ backward_ f32_ run - Same kernel as
segment_sum_backward_f32; distinct symbol for SKU-tagging differentiation. - baracuda_
kernels_ ⚠unsorted_ segment_ sum_ backward_ f64_ can_ implement - Implementability check for
unsorted_segment_sum_backward_f64. - baracuda_
kernels_ ⚠unsorted_ segment_ sum_ backward_ f64_ run unsorted_segment_sum_backward— f64.- baracuda_
kernels_ ⚠unsorted_ segment_ sum_ backward_ i64idx_ f32_ can_ implement baracuda_kernels_unsorted_segment_sum_backward_i64idx_f32_can_implement(baracuda kernels unsorted segment sum backward i64idx f32 can implement).- baracuda_
kernels_ ⚠unsorted_ segment_ sum_ backward_ i64idx_ f32_ run baracuda_kernels_unsorted_segment_sum_backward_i64idx_f32_run(baracuda kernels unsorted segment sum backward i64idx f32 run).- baracuda_
kernels_ ⚠unsorted_ segment_ sum_ backward_ i64idx_ f64_ can_ implement baracuda_kernels_unsorted_segment_sum_backward_i64idx_f64_can_implement(baracuda kernels unsorted segment sum backward i64idx f64 can implement).- baracuda_
kernels_ ⚠unsorted_ segment_ sum_ backward_ i64idx_ f64_ run baracuda_kernels_unsorted_segment_sum_backward_i64idx_f64_run(baracuda kernels unsorted segment sum backward i64idx f64 run).- baracuda_
kernels_ ⚠unsorted_ segment_ sum_ f32_ can_ implement - Implementability check for
unsorted_segment_sum_f32. - baracuda_
kernels_ ⚠unsorted_ segment_ sum_ f32_ run out[s, d] = Σ_{n : seg[n] == s} input[n, d]— unsorted seg ids; atomicAdd into output. Output pre-zeroed by the launcher. f32.- baracuda_
kernels_ ⚠unsorted_ segment_ sum_ f64_ can_ implement - Implementability check for
unsorted_segment_sum_f64. - baracuda_
kernels_ ⚠unsorted_ segment_ sum_ f64_ run unsorted_segment_sum— f64.- baracuda_
kernels_ ⚠unsorted_ segment_ sum_ i64idx_ f32_ can_ implement baracuda_kernels_unsorted_segment_sum_i64idx_f32_can_implement(baracuda kernels unsorted segment sum i64idx f32 can implement).- baracuda_
kernels_ ⚠unsorted_ segment_ sum_ i64idx_ f32_ run baracuda_kernels_unsorted_segment_sum_i64idx_f32_run(baracuda kernels unsorted segment sum i64idx f32 run).- baracuda_
kernels_ ⚠unsorted_ segment_ sum_ i64idx_ f64_ can_ implement baracuda_kernels_unsorted_segment_sum_i64idx_f64_can_implement(baracuda kernels unsorted segment sum i64idx f64 can implement).- baracuda_
kernels_ ⚠unsorted_ segment_ sum_ i64idx_ f64_ run baracuda_kernels_unsorted_segment_sum_i64idx_f64_run(baracuda kernels unsorted segment sum i64idx f64 run).- baracuda_
kernels_ ⚠upsample_ bilinear_ 2d_ bw_ bf16_ run - Alias for
baracuda_kernels_interpolate_bilinear_2d_backward_bf16_run. - baracuda_
kernels_ ⚠upsample_ bilinear_ 2d_ bw_ f16_ run - Alias for
baracuda_kernels_interpolate_bilinear_2d_backward_f16_run. - baracuda_
kernels_ ⚠upsample_ bilinear_ 2d_ bw_ f32_ run - Alias for
baracuda_kernels_interpolate_bilinear_2d_backward_f32_run. - baracuda_
kernels_ ⚠upsample_ bilinear_ 2d_ bw_ f64_ run - Alias for
baracuda_kernels_interpolate_bilinear_2d_backward_f64_run. - baracuda_
kernels_ ⚠upsample_ bilinear_ 2d_ fw_ bf16_ run - Alias for
baracuda_kernels_interpolate_bilinear_2d_bf16_run. - baracuda_
kernels_ ⚠upsample_ bilinear_ 2d_ fw_ f16_ run - Alias for
baracuda_kernels_interpolate_bilinear_2d_f16_run. - baracuda_
kernels_ ⚠upsample_ bilinear_ 2d_ fw_ f32_ run - Alias for
baracuda_kernels_interpolate_bilinear_2d_f32_rununder the new Phase 19.2upsample_*naming convention. - baracuda_
kernels_ ⚠upsample_ bilinear_ 2d_ fw_ f64_ run - Alias for
baracuda_kernels_interpolate_bilinear_2d_f64_run. - baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ bw_ bf16_ can_ implement baracuda_kernels_upsample_nearest_2d_bw_bf16_can_implement(baracuda kernels upsample nearest 2d bw bf16 can implement).- baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ bw_ bf16_ run upsample_nearest_2dBW, bf16. # Safety: as f32 BW.- baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ bw_ f16_ can_ implement baracuda_kernels_upsample_nearest_2d_bw_f16_can_implement(baracuda kernels upsample nearest 2d bw f16 can implement).- baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ bw_ f16_ run upsample_nearest_2dBW, f16. # Safety: as f32 BW. Uses thebaracuda::atomic::add<__half>(CAS-based) helper.- baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ bw_ f32_ can_ implement baracuda_kernels_upsample_nearest_2d_bw_f32_can_implement(baracuda kernels upsample nearest 2d bw f32 can implement).- baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ bw_ f32_ run upsample_nearest_2dBW, f32. Caller pre-zerosdinput.- baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ bw_ f64_ can_ implement baracuda_kernels_upsample_nearest_2d_bw_f64_can_implement(baracuda kernels upsample nearest 2d bw f64 can implement).- baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ bw_ f64_ run upsample_nearest_2dBW, f64. # Safety: as f32 BW.- baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ fw_ bf16_ can_ implement baracuda_kernels_upsample_nearest_2d_fw_bf16_can_implement(baracuda kernels upsample nearest 2d fw bf16 can implement).- baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ fw_ bf16_ run upsample_nearest_2dFW, bf16. # Safety: as f32.- baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ fw_ f16_ can_ implement baracuda_kernels_upsample_nearest_2d_fw_f16_can_implement(baracuda kernels upsample nearest 2d fw f16 can implement).- baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ fw_ f16_ run upsample_nearest_2dFW, f16. # Safety: as f32.- baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ fw_ f32_ can_ implement baracuda_kernels_upsample_nearest_2d_fw_f32_can_implement(baracuda kernels upsample nearest 2d fw f32 can implement).- baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ fw_ f32_ run upsample(x, mode='nearest')FW, f32.input:[N, C, IH, IW];output:[N, C, OH, OW]. NCHW. Coordinate mapping: nearest underalign_corners=false.- baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ fw_ f64_ can_ implement baracuda_kernels_upsample_nearest_2d_fw_f64_can_implement(baracuda kernels upsample nearest 2d fw f64 can implement).- baracuda_
kernels_ ⚠upsample_ nearest_ 2d_ fw_ f64_ run upsample_nearest_2dFW, f64. # Safety: as f32.- baracuda_
kernels_ ⚠where_ backward_ bf16_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ backward_ bf16_ run wherebackward, bf16.- baracuda_
kernels_ ⚠where_ backward_ f16_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ backward_ f16_ run wherebackward, f16.- baracuda_
kernels_ ⚠where_ backward_ f32_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ backward_ f32_ run wherebackward, f32. Writesda = cond ? dy : 0anddb = cond ? 0 : dy.- baracuda_
kernels_ ⚠where_ backward_ f64_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ backward_ f64_ run wherebackward, f64.- baracuda_
kernels_ ⚠where_ bf16_ can_ implement - Pre-launch check for
where_bf16. - baracuda_
kernels_ ⚠where_ bf16_ run where(cond, a, b), bf16 values + u8 cond, contig fast path.- baracuda_
kernels_ ⚠where_ bf16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ bf16_ strided_ run where(cond, a, b), bf16 values, strided / broadcast path.- baracuda_
kernels_ ⚠where_ f16_ can_ implement - Pre-launch check for
where_f16. - baracuda_
kernels_ ⚠where_ f16_ run where(cond, a, b), f16 values + u8 cond, contig fast path.- baracuda_
kernels_ ⚠where_ f16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ f16_ strided_ run where(cond, a, b), f16 values, strided / broadcast path.- baracuda_
kernels_ ⚠where_ f32_ can_ implement - Pre-launch check for
where_f32. - baracuda_
kernels_ ⚠where_ f32_ run where(cond, a, b), f32 values + u8 cond, contig fast path. This is the where-ternary trailblazer — its safety + aliasing contract carries over to every other where-family launcher across all value dtypes and cond-dtype variants (where_u32cond_*,where_i64cond_*).- baracuda_
kernels_ ⚠where_ f32_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ f32_ strided_ run where(cond, a, b), f32 values, strided / broadcast path.- baracuda_
kernels_ ⚠where_ f64_ can_ implement - Pre-launch check for
where_f64. - baracuda_
kernels_ ⚠where_ f64_ run where(cond, a, b), f64 values + u8 cond, contig fast path.- baracuda_
kernels_ ⚠where_ f64_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ f64_ strided_ run where(cond, a, b), f64 values, strided / broadcast path.- baracuda_
kernels_ ⚠where_ i64cond_ bf16_ can_ implement - Pre-launch check for
where_i64cond_bf16. - baracuda_
kernels_ ⚠where_ i64cond_ bf16_ run where(cond, a, b), i64 cond + bf16 values, contig fast path.- baracuda_
kernels_ ⚠where_ i64cond_ bf16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ i64cond_ bf16_ strided_ run where(cond, a, b), i64 cond + bf16 values, strided / broadcast.- baracuda_
kernels_ ⚠where_ i64cond_ f16_ can_ implement - Pre-launch check for
where_i64cond_f16. - baracuda_
kernels_ ⚠where_ i64cond_ f16_ run where(cond, a, b), i64 cond + f16 values, contig fast path.- baracuda_
kernels_ ⚠where_ i64cond_ f16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ i64cond_ f16_ strided_ run where(cond, a, b), i64 cond + f16 values, strided / broadcast.- baracuda_
kernels_ ⚠where_ i64cond_ f32_ can_ implement - Pre-launch check for
where_i64cond_f32. - baracuda_
kernels_ ⚠where_ i64cond_ f32_ run where(cond, a, b), i64 cond + f32 values, contig fast path.- baracuda_
kernels_ ⚠where_ i64cond_ f32_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ i64cond_ f32_ strided_ run where(cond, a, b), i64 cond + f32 values, strided / broadcast.- baracuda_
kernels_ ⚠where_ i64cond_ f64_ can_ implement - Pre-launch check for
where_i64cond_f64. - baracuda_
kernels_ ⚠where_ i64cond_ f64_ run where(cond, a, b), i64 cond + f64 values, contig fast path.- baracuda_
kernels_ ⚠where_ i64cond_ f64_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ i64cond_ f64_ strided_ run where(cond, a, b), i64 cond + f64 values, strided / broadcast.- baracuda_
kernels_ ⚠where_ i64cond_ fp8e4m3_ can_ implement baracuda_kernels_where_i64cond_fp8e4m3_can_implement(baracuda kernels where i64cond fp8e4m3 can implement).- baracuda_
kernels_ ⚠where_ i64cond_ fp8e4m3_ run where(cond, a, b), i64 cond + Fp8E4M3 values, contig fast path.- baracuda_
kernels_ ⚠where_ i64cond_ fp8e4m3_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ i64cond_ fp8e4m3_ strided_ run baracuda_kernels_where_i64cond_fp8e4m3_strided_run(baracuda kernels where i64cond fp8e4m3 strided run).- baracuda_
kernels_ ⚠where_ i64cond_ i8_ can_ implement baracuda_kernels_where_i64cond_i8_can_implement(baracuda kernels where i64cond i8 can implement).- baracuda_
kernels_ ⚠where_ i64cond_ i8_ run where(cond, a, b), i64 cond + i8 values, contig fast path.- baracuda_
kernels_ ⚠where_ i64cond_ i8_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ i64cond_ i8_ strided_ run baracuda_kernels_where_i64cond_i8_strided_run(baracuda kernels where i64cond i8 strided run).- baracuda_
kernels_ ⚠where_ i64cond_ i16_ can_ implement baracuda_kernels_where_i64cond_i16_can_implement(baracuda kernels where i64cond i16 can implement).- baracuda_
kernels_ ⚠where_ i64cond_ i16_ run where(cond, a, b), i64 cond + i16 values, contig fast path.- baracuda_
kernels_ ⚠where_ i64cond_ i16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ i64cond_ i16_ strided_ run baracuda_kernels_where_i64cond_i16_strided_run(baracuda kernels where i64cond i16 strided run).- baracuda_
kernels_ ⚠where_ i64cond_ i32_ can_ implement baracuda_kernels_where_i64cond_i32_can_implement(baracuda kernels where i64cond i32 can implement).- baracuda_
kernels_ ⚠where_ i64cond_ i32_ run where(cond, a, b), i64 cond + i32 values, contig fast path.- baracuda_
kernels_ ⚠where_ i64cond_ i32_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ i64cond_ i32_ strided_ run baracuda_kernels_where_i64cond_i32_strided_run(baracuda kernels where i64cond i32 strided run).- baracuda_
kernels_ ⚠where_ i64cond_ i64_ can_ implement baracuda_kernels_where_i64cond_i64_can_implement(baracuda kernels where i64cond i64 can implement).- baracuda_
kernels_ ⚠where_ i64cond_ i64_ run where(cond, a, b), i64 cond + i64 values, contig fast path.- baracuda_
kernels_ ⚠where_ i64cond_ i64_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ i64cond_ i64_ strided_ run baracuda_kernels_where_i64cond_i64_strided_run(baracuda kernels where i64cond i64 strided run).- baracuda_
kernels_ ⚠where_ i64cond_ u8_ can_ implement baracuda_kernels_where_i64cond_u8_can_implement(baracuda kernels where i64cond u8 can implement).- baracuda_
kernels_ ⚠where_ i64cond_ u8_ run where(cond, a, b), i64 cond + u8 values, contig fast path.- baracuda_
kernels_ ⚠where_ i64cond_ u8_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ i64cond_ u8_ strided_ run baracuda_kernels_where_i64cond_u8_strided_run(baracuda kernels where i64cond u8 strided run).- baracuda_
kernels_ ⚠where_ i64cond_ u32_ can_ implement baracuda_kernels_where_i64cond_u32_can_implement(baracuda kernels where i64cond u32 can implement).- baracuda_
kernels_ ⚠where_ i64cond_ u32_ run where(cond, a, b), i64 cond + u32 values, contig fast path.- baracuda_
kernels_ ⚠where_ i64cond_ u32_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ i64cond_ u32_ strided_ run baracuda_kernels_where_i64cond_u32_strided_run(baracuda kernels where i64cond u32 strided run).- baracuda_
kernels_ ⚠where_ u8cond_ fp8e4m3_ can_ implement baracuda_kernels_where_u8cond_fp8e4m3_can_implement(baracuda kernels where u8cond fp8e4m3 can implement).- baracuda_
kernels_ ⚠where_ u8cond_ fp8e4m3_ run where(cond, a, b), u8 cond + Fp8E4M3 values, contig fast path.- baracuda_
kernels_ ⚠where_ u8cond_ fp8e4m3_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u8cond_ fp8e4m3_ strided_ run baracuda_kernels_where_u8cond_fp8e4m3_strided_run(baracuda kernels where u8cond fp8e4m3 strided run).- baracuda_
kernels_ ⚠where_ u8cond_ i8_ can_ implement - Pre-launch check for
where_u8cond_i8. - baracuda_
kernels_ ⚠where_ u8cond_ i8_ run where(cond, a, b), u8 cond + i8 values, contig fast path.- baracuda_
kernels_ ⚠where_ u8cond_ i8_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u8cond_ i8_ strided_ run where(cond, a, b), u8 cond + i8 values, strided / broadcast.- baracuda_
kernels_ ⚠where_ u8cond_ i16_ can_ implement - Pre-launch check for
where_u8cond_i16. - baracuda_
kernels_ ⚠where_ u8cond_ i16_ run where(cond, a, b), u8 cond + i16 values, contig fast path.- baracuda_
kernels_ ⚠where_ u8cond_ i16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u8cond_ i16_ strided_ run where(cond, a, b), u8 cond + i16 values, strided / broadcast.- baracuda_
kernels_ ⚠where_ u8cond_ i32_ can_ implement - Pre-launch check for
where_u8cond_i32. - baracuda_
kernels_ ⚠where_ u8cond_ i32_ run where(cond, a, b), u8 cond + i32 values, contig fast path.- baracuda_
kernels_ ⚠where_ u8cond_ i32_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u8cond_ i32_ strided_ run where(cond, a, b), u8 cond + i32 values, strided / broadcast.- baracuda_
kernels_ ⚠where_ u8cond_ i64_ can_ implement - Pre-launch check for
where_u8cond_i64. - baracuda_
kernels_ ⚠where_ u8cond_ i64_ run where(cond, a, b), u8 cond + i64 values, contig fast path.- baracuda_
kernels_ ⚠where_ u8cond_ i64_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u8cond_ i64_ strided_ run where(cond, a, b), u8 cond + i64 values, strided / broadcast.- baracuda_
kernels_ ⚠where_ u8cond_ u8_ can_ implement - Pre-launch check for
where_u8cond_u8. - baracuda_
kernels_ ⚠where_ u8cond_ u8_ run where(cond, a, b), u8 cond + u8 values, contig fast path.- baracuda_
kernels_ ⚠where_ u8cond_ u8_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u8cond_ u8_ strided_ run where(cond, a, b), u8 cond + u8 values, strided / broadcast.- baracuda_
kernels_ ⚠where_ u8cond_ u32_ can_ implement - Pre-launch check for
where_u8cond_u32. - baracuda_
kernels_ ⚠where_ u8cond_ u32_ run where(cond, a, b), u8 cond + u32 values, contig fast path.- baracuda_
kernels_ ⚠where_ u8cond_ u32_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u8cond_ u32_ strided_ run where(cond, a, b), u8 cond + u32 values, strided / broadcast.- baracuda_
kernels_ ⚠where_ u32cond_ bf16_ can_ implement - Pre-launch check for
where_u32cond_bf16. - baracuda_
kernels_ ⚠where_ u32cond_ bf16_ run where(cond, a, b), u32 cond + bf16 values, contig fast path.- baracuda_
kernels_ ⚠where_ u32cond_ bf16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u32cond_ bf16_ strided_ run where(cond, a, b), u32 cond + bf16 values, strided / broadcast.- baracuda_
kernels_ ⚠where_ u32cond_ f16_ can_ implement - Pre-launch check for
where_u32cond_f16. - baracuda_
kernels_ ⚠where_ u32cond_ f16_ run where(cond, a, b), u32 cond + f16 values, contig fast path.- baracuda_
kernels_ ⚠where_ u32cond_ f16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u32cond_ f16_ strided_ run where(cond, a, b), u32 cond + f16 values, strided / broadcast.- baracuda_
kernels_ ⚠where_ u32cond_ f32_ can_ implement - Pre-launch check for
where_u32cond_f32. - baracuda_
kernels_ ⚠where_ u32cond_ f32_ run where(cond, a, b), u32 cond + f32 values, contig fast path.- baracuda_
kernels_ ⚠where_ u32cond_ f32_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u32cond_ f32_ strided_ run where(cond, a, b), u32 cond + f32 values, strided / broadcast path. Each operand carries its own stride array.- baracuda_
kernels_ ⚠where_ u32cond_ f64_ can_ implement - Pre-launch check for
where_u32cond_f64. - baracuda_
kernels_ ⚠where_ u32cond_ f64_ run where(cond, a, b), u32 cond + f64 values, contig fast path.- baracuda_
kernels_ ⚠where_ u32cond_ f64_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u32cond_ f64_ strided_ run where(cond, a, b), u32 cond + f64 values, strided / broadcast.- baracuda_
kernels_ ⚠where_ u32cond_ fp8e4m3_ can_ implement baracuda_kernels_where_u32cond_fp8e4m3_can_implement(baracuda kernels where u32cond fp8e4m3 can implement).- baracuda_
kernels_ ⚠where_ u32cond_ fp8e4m3_ run where(cond, a, b), u32 cond + Fp8E4M3 values, contig fast path.- baracuda_
kernels_ ⚠where_ u32cond_ fp8e4m3_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u32cond_ fp8e4m3_ strided_ run baracuda_kernels_where_u32cond_fp8e4m3_strided_run(baracuda kernels where u32cond fp8e4m3 strided run).- baracuda_
kernels_ ⚠where_ u32cond_ i8_ can_ implement baracuda_kernels_where_u32cond_i8_can_implement(baracuda kernels where u32cond i8 can implement).- baracuda_
kernels_ ⚠where_ u32cond_ i8_ run where(cond, a, b), u32 cond + i8 values, contig fast path.- baracuda_
kernels_ ⚠where_ u32cond_ i8_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u32cond_ i8_ strided_ run baracuda_kernels_where_u32cond_i8_strided_run(baracuda kernels where u32cond i8 strided run).- baracuda_
kernels_ ⚠where_ u32cond_ i16_ can_ implement baracuda_kernels_where_u32cond_i16_can_implement(baracuda kernels where u32cond i16 can implement).- baracuda_
kernels_ ⚠where_ u32cond_ i16_ run where(cond, a, b), u32 cond + i16 values, contig fast path.- baracuda_
kernels_ ⚠where_ u32cond_ i16_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u32cond_ i16_ strided_ run baracuda_kernels_where_u32cond_i16_strided_run(baracuda kernels where u32cond i16 strided run).- baracuda_
kernels_ ⚠where_ u32cond_ i32_ can_ implement baracuda_kernels_where_u32cond_i32_can_implement(baracuda kernels where u32cond i32 can implement).- baracuda_
kernels_ ⚠where_ u32cond_ i32_ run where(cond, a, b), u32 cond + i32 values, contig fast path.- baracuda_
kernels_ ⚠where_ u32cond_ i32_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u32cond_ i32_ strided_ run baracuda_kernels_where_u32cond_i32_strided_run(baracuda kernels where u32cond i32 strided run).- baracuda_
kernels_ ⚠where_ u32cond_ i64_ can_ implement baracuda_kernels_where_u32cond_i64_can_implement(baracuda kernels where u32cond i64 can implement).- baracuda_
kernels_ ⚠where_ u32cond_ i64_ run where(cond, a, b), u32 cond + i64 values, contig fast path.- baracuda_
kernels_ ⚠where_ u32cond_ i64_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u32cond_ i64_ strided_ run baracuda_kernels_where_u32cond_i64_strided_run(baracuda kernels where u32cond i64 strided run).- baracuda_
kernels_ ⚠where_ u32cond_ u8_ can_ implement baracuda_kernels_where_u32cond_u8_can_implement(baracuda kernels where u32cond u8 can implement).- baracuda_
kernels_ ⚠where_ u32cond_ u8_ run where(cond, a, b), u32 cond + u8 values, contig fast path.- baracuda_
kernels_ ⚠where_ u32cond_ u8_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u32cond_ u8_ strided_ run baracuda_kernels_where_u32cond_u8_strided_run(baracuda kernels where u32cond u8 strided run).- baracuda_
kernels_ ⚠where_ u32cond_ u32_ can_ implement baracuda_kernels_where_u32cond_u32_can_implement(baracuda kernels where u32cond u32 can implement).- baracuda_
kernels_ ⚠where_ u32cond_ u32_ run where(cond, a, b), u32 cond + u32 values, contig fast path.- baracuda_
kernels_ ⚠where_ u32cond_ u32_ strided_ can_ implement - Pre-launch check companion.
- baracuda_
kernels_ ⚠where_ u32cond_ u32_ strided_ run baracuda_kernels_where_u32cond_u32_strided_run(baracuda kernels where u32cond u32 strided run).- baracuda_
kernels_ ⚠write_ slice_ b1_ can_ implement - Implementability check for
baracuda_kernels_write_slice_b1. Host-side only. - baracuda_
kernels_ ⚠write_ slice_ b1_ run - WriteSlice, 1-byte element (i8 / u8 / S8 / U8 / Bool / Fp8E4M3 / Fp8E5M2). Generic per-slab-element memcpy kernel.
- baracuda_
kernels_ ⚠write_ slice_ b2_ can_ implement - Implementability check for
baracuda_kernels_write_slice_b2. Host-side only. - baracuda_
kernels_ ⚠write_ slice_ b2_ run - WriteSlice, 2-byte element (f16 / bf16). See
b1variant for the contract. - baracuda_
kernels_ ⚠write_ slice_ b4_ can_ implement - Implementability check for
baracuda_kernels_write_slice_b4. Host-side only. - baracuda_
kernels_ ⚠write_ slice_ b4_ run - WriteSlice, 4-byte element (f32 / F32Strict / i32).
- baracuda_
kernels_ ⚠write_ slice_ b8_ can_ implement - Implementability check for
baracuda_kernels_write_slice_b8. Host-side only. - baracuda_
kernels_ ⚠write_ slice_ b8_ run - WriteSlice, 8-byte element (f64 / i64 / Complex32).
- baracuda_
kernels_ ⚠write_ slice_ b16_ can_ implement - Implementability check for
baracuda_kernels_write_slice_b16. Host-side only. - baracuda_
kernels_ ⚠write_ slice_ b16_ run - WriteSlice, 16-byte element (Complex64).
- baracuda_
kernels_ ⚠write_ slice_ nibble_ can_ implement - Implementability check for
baracuda_kernels_write_slice_nibble. Host-side only. - baracuda_
kernels_ ⚠write_ slice_ nibble_ run - WriteSlice, nibble-packed (S4 / U4 — two elements per byte).
Constraint:
range_start[rank-1]andrange_end[rank-1]must both be even so no read-modify-write straddles a byte boundary. Shape / range_start arrays passed in are byte-counted on the innermost axis (Rust side halves before calling). - cublas
Cgemm ⚠Strided Batched cublasCgemmStridedBatched— single-precision complex strided- batched matrix-matrix multiply.Complex32(==cuComplex==cuFloatComplex) analogue ofcublasSgemmStridedBatched. Used by the WY-blocked batched-unmqrplan (crate’sBatchedOrmqrWyPlan<Complex32>) —transa = CUBLAS_OP_CselectsV^Hfor the first GEMM andT^Hfor the second GEMM when applyingQ^H.- cublas
Cgeqrf ⚠Batched cublasCgeqrfBatched. Complex32 analogue.tau_array[b]iscuComplex(NOT real-typed even though tau is real-magnitude for real Householder — cuBLAS uses complex tau across the complex family so the sameapplyroutines can dispatch uniformly).- cublas
Create_ ⚠v2 cublasCreate_v2— create a cuBLAS handle.- cublas
Destroy_ ⚠v2 cublasDestroy_v2— destroy a cuBLAS handle.- cublas
Dgemm ⚠Strided Batched cublasDgemmStridedBatched— double-precision strided-batched matrix-matrix multiply. f64 analogue ofcublasSgemmStridedBatched.- cublas
Dgeqrf ⚠Batched cublasDgeqrfBatched. f64 analogue.- cublas
Dtrsm ⚠ cublasDtrsm— double-precision triangular solve. f64 analogue ofcublasStrsm.- cublas
Gemm ⚠Ex cublasGemmEx— mixed-precision GEMM with explicit dtype tags.- cublas
Gemm ⚠Strided Batched Ex cublasGemmStridedBatchedEx— mixed-precision strided-batched GEMM with explicit dtype tags (Phase 74). TheExsibling ofcublasSgemmStridedBatched: each batch sloticomputesC[i] := α · op(A[i]) · op(B[i]) + β · C[i]where the slot-ioperand is reached by addingi * stride_*(in elements) to the base pointer.stride_a/stride_bmay be0to broadcast one matrix across all slots;stride_cmust step disjoint output regions.- cublas
SetStream_ ⚠v2 cublasSetStream_v2— bind a CUDA stream to the cuBLAS handle.- cublas
Sgemm ⚠Strided Batched cublasSgemmStridedBatched— single-precision strided-batched matrix-matrix multiply. Each slot computesC[i] := α · op(A[i]) · op(B[i]) + β · C[i]whereA[i],B[i],C[i]are reached by steppingstride{A,B,C}element counts from the respective base pointers.- cublas
Sgeqrf ⚠Batched cublasSgeqrfBatched— batched QR factorization (single precision). EachAarray[b]is overwritten in place with thegeqrf-packedR(upper) + Householder reflectors (strict lower);TauArray[b]receives the Householder scalars.- cublas
Strsm ⚠ cublasStrsm— single-precision triangular solve.- cublas
Zgemm ⚠Strided Batched cublasZgemmStridedBatched— double-precision complex strided- batched matrix-matrix multiply.Complex64analogue ofcublasCgemmStridedBatched.- cublas
Zgeqrf ⚠Batched cublasZgeqrfBatched. Complex64 analogue.- cufft
Destroy ⚠ cufftDestroy(plan). Frees the plan’s internal workspace.- cufft
Exec ⚠C2C cufftExecC2C(plan, idata, odata, direction)— complex-to- complex single-precision exec.directionisCUFFT_FORWARDorCUFFT_INVERSE. Inverse is unnormalized.- cufft
Exec ⚠C2R cufftExecC2R(plan, idata, odata)— complex-to-real single precision. Input length isnx/2 + 1, output length isnx. Unnormalized — caller must scale by1/nx.- cufft
Exec ⚠D2Z cufftExecD2Z(plan, idata, odata)— real-to-complex double precision. Same semantics ascufftExecR2C.- cufft
Exec ⚠R2C cufftExecR2C(plan, idata, odata)— real-to-complex single precision. Input length isnx, output length isnx/2 + 1(Hermitian-half).- cufft
Exec ⚠Z2D cufftExecZ2D(plan, idata, odata)— complex-to-real double precision. Unnormalized.- cufft
Exec ⚠Z2Z cufftExecZ2Z(plan, idata, odata, direction)— complex-to- complex double precision. Same semantics ascufftExecC2C.- cufft
Plan1d ⚠ cufftPlan1d(plan, nx, type, batch). Allocates a 1-D plan (single FFT of lengthnx, orbatchindependent FFTs each of lengthnxlaid out contiguously). cuFFT’s plan struct owns its own workspace internally — no caller-supplied workspace is required for the basic 1-D APIs.- cufft
Plan ⚠Many cufftPlanMany(plan, rank, n, inembed, istride, idist, onembed, ostride, odist, type, batch).- cufft
SetStream ⚠ cufftSetStream(plan, stream). Binds subsequent exec calls on this plan to the given CUDA stream. Returns 0 on success.- curand
Create ⚠Generator curandCreateGenerator(generator, rng_type). Returns 0 on success.- curand
Destroy ⚠Generator curandDestroyGenerator(generator). Returns 0 on success.- curand
Generate ⚠Normal curandGenerateNormal(generator, ptr, n, mean, stddev)— writesnnormally-distributedfloatsamples toptr. Note: cuRAND requiresnbe even for the Box-Muller pair generator. Returns 0 on success.- curand
Generate ⚠Normal Double curandGenerateNormalDouble(generator, ptr, n, mean, stddev). Same parity contract ascurandGenerateNormal. Returns 0 on success.- curand
Generate ⚠Uniform curandGenerateUniform(generator, ptr, n)— writesnfloatsamples in(0, 1]toptr. Returns 0 on success.- curand
Generate ⚠Uniform Double curandGenerateUniformDouble(generator, ptr, n)— writesndoublesamples in(0, 1]toptr. Returns 0 on success.- curand
SetPseudo ⚠Random Generator Seed curandSetPseudoRandomGeneratorSeed(generator, seed). Returns 0 on success.- curand
SetStream ⚠ curandSetStream(generator, stream). Binds subsequent generator calls to the given CUDA stream. Returns 0 on success.- cusolver
DnCgeqrf ⚠ cusolverDnCgeqrf— single-precision complex QR factorization, in place. The packed output uses the same convention as the real variant: strict lower triangle +tauencode the Householder reflectors; the upper triangle holdsR.- cusolver
DnCgeqrf_ ⚠buffer Size cusolverDnCgeqrf_bufferSize— workspace query for single-precision complex QR factorization. MirrorscusolverDnSgeqrf_bufferSize.- cusolver
DnCheevd ⚠ cusolverDnCheevd— complex-Hermitian eigh (Complex32).Ais overwritten in place with the eigenvectors (column-major);Wreceives thenreal eigenvalues sorted ascending.- cusolver
DnCheevd_ ⚠buffer Size cusolverDnCheevd_bufferSize— complex-Hermitian divide-and-conquer eigh, single precision (Complex32). Eigenvalues are real-valuedfloat.- cusolver
DnCreate ⚠ cusolverDnCreate(handle). Returns 0 on success.- cusolver
DnCreate ⚠Gesvdj Info cusolverDnCreateGesvdjInfo— allocate a Jacobi-SVD params object with cuSOLVER’s defaults (tol = 1e-7for f32 /1e-12for f64,max_sweeps = 100,sort_eig = 1).- cusolver
DnCreate ⚠Params cusolverDnCreateParams— allocate the opaque params struct used by all 64-bit cuSOLVER APIs. Plan layer creates one lazily on firstrun(mirroring the handle lifecycle).- cusolver
DnCunmqr ⚠ cusolverDnCunmqr— applyQ,Q^T, orQ^Hfrom a complexgeqrffactorization to a complexCin place.- cusolver
DnCunmqr_ ⚠buffer Size cusolverDnCunmqr_bufferSize.- cusolver
DnDDgels ⚠ cusolverDnDDgels. f64 analogue.- cusolver
DnDDgels_ ⚠buffer Size cusolverDnDDgels_bufferSize. f64 analogue.- cusolver
DnDestroy ⚠ cusolverDnDestroy(handle). Returns 0 on success.- cusolver
DnDestroy ⚠Gesvdj Info cusolverDnDestroyGesvdjInfo. Returns 0 on success.- cusolver
DnDestroy ⚠Params cusolverDnDestroyParams. Returns 0 on success.- cusolver
DnDgeqrf ⚠ cusolverDnDgeqrf. f64 analogue.- cusolver
DnDgeqrf_ ⚠buffer Size cusolverDnDgeqrf_bufferSize. f64 analogue.- cusolver
DnDgesvd ⚠ cusolverDnDgesvd. f64 analogue.- cusolver
DnDgesvd_ ⚠buffer Size cusolverDnDgesvd_bufferSize. f64 analogue.- cusolver
DnDgesvda ⚠Strided Batched cusolverDnDgesvdaStridedBatched. f64 analogue.- cusolver
DnDgesvda ⚠Strided Batched_ buffer Size cusolverDnDgesvdaStridedBatched_bufferSize. f64 analogue.- cusolver
DnDgesvdj ⚠Batched cusolverDnDgesvdjBatched. f64 analogue.- cusolver
DnDgesvdj ⚠Batched_ buffer Size cusolverDnDgesvdjBatched_bufferSize. f64 analogue.- cusolver
DnDgetrf ⚠ cusolverDnDgetrf. f64 analogue.- cusolver
DnDgetrf_ ⚠buffer Size cusolverDnDgetrf_bufferSize. f64 analogue.- cusolver
DnDgetrs ⚠ cusolverDnDgetrs. f64 analogue.- cusolver
DnDormqr ⚠ cusolverDnDormqr. f64 analogue.- cusolver
DnDormqr_ ⚠buffer Size cusolverDnDormqr_bufferSize. f64 analogue.- cusolver
DnDpotrf ⚠ cusolverDnDpotrf. f64 analogue.- cusolver
DnDpotrf ⚠Batched cusolverDnDpotrfBatched. f64 analogue.- cusolver
DnDpotrf_ ⚠buffer Size cusolverDnDpotrf_bufferSize. f64 analogue.- cusolver
DnDsyevd ⚠ cusolverDnDsyevd. f64 analogue.- cusolver
DnDsyevd_ ⚠buffer Size cusolverDnDsyevd_bufferSize. f64 analogue.- cusolver
DnSSgels ⚠ cusolverDnSSgels— least-squares solvemin ||A·x - b||²form ≥ nfull-rankA. Iterative refinement; returnsniters≥ 0 on convergence,-Non fallback-needed. Single precision.- cusolver
DnSSgels_ ⚠buffer Size cusolverDnSSgels_bufferSize— query bytes (the routine’s workspace is supplied as a raw byte buffer, not a typed element count, distinct from the*_bufferSizeentries above).- cusolver
DnSet ⚠Stream cusolverDnSetStream(handle, stream). Binds subsequent cuSOLVER calls to the given CUDA stream. Returns 0 on success.- cusolver
DnSgeqrf ⚠ cusolverDnSgeqrf— QR factorization in place.Ais overwritten: upper triangle =R, strict lower triangle +tau= Householder reflectors that encodeQ. To materializeQas a dense matrix, follow withormqragainst an identity.- cusolver
DnSgeqrf_ ⚠buffer Size cusolverDnSgeqrf_bufferSize.- cusolver
DnSgesvd ⚠ cusolverDnSgesvd— SVD:A = U · diag(S) · V^T. Thejobu/jobvcharacters are ASCII bytes:'A'(full U/V^T),'S'(thin U/V^T),'O'(overwrite A — disallowed at plan layer),'N'(skip).- cusolver
DnSgesvd_ ⚠buffer Size cusolverDnSgesvd_bufferSize.- cusolver
DnSgesvda ⚠Strided Batched cusolverDnSgesvdaStridedBatched— f32 rectangular-batched approximate-SVD. Each batch slot factors a[m, n]matrix intoU: [m, rank],S: [rank],V: [n, rank](column-major; cuSOLVER returnsV, notV^T). The host arrayh_R_nrmF(sizebatch_size) receives per-slot residual Frobenius norms.- cusolver
DnSgesvda ⚠Strided Batched_ buffer Size cusolverDnSgesvdaStridedBatched_bufferSize— query the device workspace size (in elements, multiply bysizeof(f32)for bytes) for the f32 rectangular-batched approximate-SVD.- cusolver
DnSgesvdj ⚠Batched cusolverDnSgesvdjBatched— batched Jacobi SVDA = U · diag(S) · V^T(single precision). Each matrix is square[m, m](cuSOLVER’s Jacobi-batched API requires square input; thin rectangular is achievable viagesvdaStridedBatched— deferred). The plan surfacesV(notV^T); callers apply the transpose if needed.- cusolver
DnSgesvdj ⚠Batched_ buffer Size cusolverDnSgesvdjBatched_bufferSize.jobzis0(no vectors) or1(compute U / V). For batched, each matrix inAis independently SVD’d; outputs are packed[batch * m * m]etc.- cusolver
DnSgetrf ⚠ cusolverDnSgetrf— LU factorization with partial pivoting in place.A := L · U(withLunit-diagonal, stored in the strict lower triangle;Uin the upper triangle).ipiv[i]is the row swap performed at stepi(1-based per LAPACK convention).- cusolver
DnSgetrf_ ⚠buffer Size cusolverDnSgetrf_bufferSize— query workspace element count.- cusolver
DnSgetrs ⚠ cusolverDnSgetrs— solveop(A) · X = Busing the packedLU- cusolver
DnSormqr ⚠ cusolverDnSormqr— applyQ(orQ^T) fromgeqrfoutput to a matrixCin place. WithC = Ithis materializesQas a dense matrix for the “thin” or “full” QR.- cusolver
DnSormqr_ ⚠buffer Size cusolverDnSormqr_bufferSize.transselectsQvsQ^T;sideselects left vs right multiply.- cusolver
DnSpotrf ⚠ cusolverDnSpotrf— Cholesky factorization in-place (A := LorA := U). Writes the unused triangle untouched.dev_inforeturns 0 on success,k > 0if the leadingk-minor is not positive definite (factorization halted at stepk).- cusolver
DnSpotrf ⚠Batched cusolverDnSpotrfBatched(handle, uplo, n, Aarray, lda, infoArray, batchSize). Each matrix inAarray[batch_size]is factored independently in-place. Returns 0 on success; per-matrix factor info lands ininfoArray[i].- cusolver
DnSpotrf_ ⚠buffer Size cusolverDnSpotrf_bufferSize— query workspace bytes (as element count, must be multiplied bysizeof(T)forcudaMalloc).- cusolver
DnSsyevd ⚠ cusolverDnSsyevd— real-symmetric eigh, f32.Ais overwritten in place with the eigenvectors (column-major) whenjobz == VECTOR.Wreceives theneigenvalues sorted ascending.- cusolver
DnSsyevd_ ⚠buffer Size cusolverDnSsyevd_bufferSize— query workspace element count for real-symmetric divide-and-conquer eigh, f32.- cusolver
DnXgeev ⚠ cusolverDnXgeev— general (non-symmetric) eigendecomposition.Ais destroyed in place (used as scratch by the LAPACK- equivalent algorithm).Wreceives thencomplex eigenvalues;VL/VR(when requested) receive the column-major left / right complex eigenvectors. For non-Hermitian input the eigenvalues can be complex even when the input is real, hence the always-complexWstorage.- cusolver
DnXgeev_ ⚠buffer Size cusolverDnXgeev_bufferSize— query the host + device byte counts forcusolverDnXgeevat the given problem size and element types. The two output pointers receive byte counts (NOT element counts — different from the legacy_bufferSizeAPIs).- cusolver
DnZgeqrf ⚠ cusolverDnZgeqrf— double-precision complex QR factorization.- cusolver
DnZgeqrf_ ⚠buffer Size cusolverDnZgeqrf_bufferSize. f64-complex analogue of the C variant.- cusolver
DnZheevd ⚠ cusolverDnZheevd.Complex64analogue.- cusolver
DnZheevd_ ⚠buffer Size cusolverDnZheevd_bufferSize.Complex64analogue.- cusolver
DnZunmqr ⚠ cusolverDnZunmqr. f64-complex analogue.- cusolver
DnZunmqr_ ⚠buffer Size cusolverDnZunmqr_bufferSize. f64-complex analogue.
Type Aliases§
- cuFloat
Complex cuFloatComplexis the canonical CUDA name for the single-precision complex struct — an alias forcuComplex. Surfaced so cuSOLVER’s complex APIs (cusolverDn{C,Z}unmqr, …) can spell their signatures in the same vocabulary as the NVIDIA headers.- cublas
Diag Type_ t - cuBLAS diag-type tag for triangular solves (
trsm).CUBLAS_DIAG_NON_UNIT = 0,CUBLAS_DIAG_UNIT = 1. - cublas
Fill Mode_ t - cuBLAS fill-mode tag re-used by cuSOLVER for triangular factorizations.
CUBLAS_FILL_MODE_LOWER = 0,CUBLAS_FILL_MODE_UPPER = 1. - cublas
Handle_ t - Opaque cuBLAS handle. Used by
cublas*geqrfBatched(which lives in cuBLAS, not cuSOLVER) and any future cuBLAS-routed linalg paths. - cuda
Data Type cudaDataTypetag used by the 64-bit cuSOLVER APIs (Xgeev,Xgesvd, …) to identify tensor element types. These constants originate in<library_types.h>and are stable across CUDA versions.- cufft
Handle - Opaque cuFFT plan handle. Unusually for CUDA libraries this is an
integer ID (
int), not a pointer. A value of-1is reserved at the safe-plan layer as the “not yet created” sentinel — cuFFT itself returns small non-negative integers for live handles. - cufft
Result - cuFFT result code type.
CUFFT_SUCCESS = 0. Any non-zero return is mapped to a negative status at the safe-plan layer for distinct error reporting. - curand
Generator_ t - Opaque cuRAND generator handle. Treated as a stateful object owned by safe Rust at the plan layer — never inspect its internals here.
- cusolver
DnHandle_ t - Opaque cuSOLVER dense handle. Stateful object; the plan layer creates
one lazily on first
runand reuses across launches. - cusolver
DnParams_ t - Opaque parameter struct used by the 64-bit cuSOLVER APIs (
Xgeev,Xpotrf, …). The struct holds advanced configuration (algorithm choice, precision modes) — for the trailblazer the plan layer leaves it at defaults. Created viacusolverDnCreateParamsand destroyed viacusolverDnDestroyParams. - cusolver
EigMode_ t - cuSOLVER eig-mode enum tag (used by
syevd/heevd/Xgeev).0 = NOVECTOR(compute eigenvalues only),1 = VECTOR(eigenvalues + eigenvectors). Routed through as ani32for the legacy syevd / heevd APIs. TheCUSOLVER_EIG_MODE_NOVECTOR/_VECTORconstants live further down (originally introduced forgesvdjBatched’sjobzargument; reused verbatim here for the eig family). - gesvdj
Info_ t - Opaque cuSOLVER Jacobi-SVD parameter object. Stateful; created
once per plan, reused across launches, destroyed on plan drop.
Used by
cusolverDn*gesvdjBatchedfor the batched-SVD path.