baracuda 0.0.1-alpha.68

Idiomatic Rust wrappers for the NVIDIA CUDA stack (Driver API, Runtime API, NVRTC, cuBLAS, cuDNN, NCCL, NVML, ...). Umbrella crate.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
# baracuda

![A great barracuda — the project's namesake, minus one letter.](https://raw.githubusercontent.com/ciresnave/baracuda/refs/heads/main/assets/barracuda.png)

> **About the name.** Yes, we know — it's spelled **barracuda** (two Rs). That
> name was taken on crates.io, so we dropped one R and kept swimming.

A unified Rust ML-op facade over the NVIDIA CUDA ecosystem.

![License](https://img.shields.io/badge/license-MIT%2FApache--2.0-blue)
![Status](https://img.shields.io/badge/status-alpha.67-orange)
![CUDA](https://img.shields.io/badge/CUDA-12.x-76b900)
![Tests](https://img.shields.io/badge/regression-2280%2B%2F0-success)

## What baracuda is

baracuda is a Rust workspace that exposes every primitive an ML framework
expects — the union of PyTorch (`torch.*` + `nn.functional`) and JAX
(`jax.lax.*` + `jax.numpy.*`) — through a single `Plan`-based crate surface
called [`baracuda-kernels`]. Internally each plan dispatches to:

1. The appropriate NVIDIA-library wrapper crate (cuBLAS, cuDNN, cuFFT,
   cuSOLVER, cuRAND, cuSPARSE, cuTENSOR, NPP, CV-CUDA, CUTLASS) when one
   already covers the op well, or
2. A bespoke hand-rolled `.cu` kernel shipped in [`baracuda-kernels-sys`]
   when no NVIDIA library covers the op (or covers it poorly at the shapes
   that matter for modern transformer / vision / GNN workloads).

Callers import **one** crate (`baracuda-kernels`) and reach for **one** API
style. The dispatch decision — which is observable through
`Plan::sku()` for telemetry — is otherwise invisible. Switching from a
CUTLASS-backed SKU to a bespoke-backed SKU is a layout flag, not an import
change.

baracuda is for downstream Rust ML / inference / training frameworks that
need access to the full CUDA stack without re-vendoring it themselves. The
workspace also ships idiomatic stand-alone wrappers for every CUDA library
under `crates/baracuda-<lib>` if you want to skip the kernel facade and
talk to one library directly.

## Status

**In active development — alpha.67.** **2280+ GPU tests passing,
zero failures** across the 6 critical test crates on an RTX 4070
(sm_89; the `baracuda-kernels` suite alone is 2180/0 across 513 test
binaries). alpha.67 (Phase 74, Fuel-ask) ships the **plain dense FP
GEMM family**: 12 cuBLAS-backed flat C symbols
(`baracuda_kernels_gemm_dense_{f32, f64, f16, bf16}_*`) with
RRR / RCR / **CRR** layouts, flexible leading dims, and
strided-batch folded into the base signature — plus the
`DenseGemmPlan<T>` typed plan, the `ReduceToPlan<T, N>`
broadcast-reverse reduction facade, `UnaryKind::Step`, and gelu
flavor-disambiguation docs. This closes the last non-baracuda CUDA
surface in Fuel (its own cuBLAS MatMul wrapper). alpha.66 (Fuel-ask)
exposed **per-device VRAM queries** on the Driver-API `Device` —
`vram_free()` / `vram_total()` / `vram_info()` wrap
`cuMemGetInfo_v2` so downstream optimizers can read runtime memory
pressure without round-tripping through a failed allocation. Phase 73 follow-up (alpha.65) lands a **17-33× decode
speedup** via a focused FlashDecodingPlan (split-K, seq_q=1),
**4× win at GQA shapes** via the new `num_kv_heads` descriptor
field, the long-awaited `FlashSdpaPlan` GQA-broadcast routing fix,
and a SDPA gap closure that makes baracuda's standard MHA shape
**50% faster than PyTorch** by making `fa2` a default feature. The
ConcatPlan (KV-cache decode) and reduce_axis (small-shape rows)
kernels also got significant rewrites — 13× and 2.6-15.4×
respectively. Phase 63 (alpha.63, Fuel-ask) closes the FlashAttention
saved-tensor wiring gap: **NEW
`baracuda_kernels_fa2_sdpa_lse_size(batch, num_heads, seq_q) -> usize`**
dense LSE pre-allocation helper (sibling of the existing varlen
`_varlen_lse_size`); **load-bearing "LSE saved-tensor contract"
docs** on the FW + BW trailblazers naming the exact FW→saved-LSE→BW
handoff; **new docs guide** at `docs/guides/fa2-saved-tensor-contract.md`
showing the wiring pattern downstream autograd frameworks should
use. **Test investment**: 8 new tests — 3 host-only `lse_size`
helper sanity tests + 5 GPU FW→BW roundtrip tests (f16/bf16 ×
d128 causal/noncausal + d64 causal) + 4 BW feature-surface tests
backfilling the gaps left by `fa2_backward_smoke.rs` (BW with
sliding window, softcap, ALiBi, and all-features composed). The
existing `softmax_lse` output on `baracuda_kernels_fa2_sdpa_<dt>_run`
(v1) and `..._run_v2` (Phase 59a) — turns out Fuel didn't need a
new "v3 with lse" variant; the FW has been writing LSE since
alpha.56, just lacked the size helper + saved-tensor wiring
documentation. Phase 62 (alpha.62, Fuel-ask) lifts the in-place
op contract from contig-only (Phase 61) to **strided** by shipping
11 new affine in-place FFI symbols (4 contig int dtype backfill —
`i32`/`i64`/`u8`/`i8` matching forward affine; 7 strided variants
across the full forward-strided dtype matrix — `f32`/`f64`/`i32`/`i64`/`u8`/`bf16`/`f16`)
and documenting the **stride-equality precondition** for
same-pointer aliasing on the unary / binary / ternary strided
trailblazers as a stable public contract. NEW
`baracuda_kernels_types::strides_equal` host helper for callers
to validate the precondition before dispatch. Zero new bespoke
kernels for the elementwise unary/binary/ternary families —
their existing strided launchers are aliasing-safe under the
contract. Test investment: 14 host-only unit tests +
17 GPU smoke tests for the new FFI surface + 7 aliasing-contract
proof tests across contig + strided trailblazers (also backfills
the Phase 61 contig contract that shipped without test coverage).
Phase 61 (alpha.61) completed the alpha.55 baseline with bf16/f16
in-place affine + the original contig same-pointer contract doc.
Phase 60 lifted FA2 FW to the full Candle-fork-extended
9-head_dim set ({32, 64, 96, 128, 160, 192, 224, 256, 512}) via
12 new vendored `.cu` files + 32 FW smoke tests. Phase 59a + 59b
added the full FA2 v2.8.3 surface (FW + BW + varlen across
head_dims 32-256, GQA, ALiBi, sliding window, softcap) plus 48
new smoke tests, closing Fuel's FA2-retirement requirements. Phase 59c (consolidation pass, alpha.59) fixed a
pre-existing parallel-test race in the bespoke flash kernel's
SMEM-carveout call surfaced by Phase 59a's 5-head_dim fanout, plus
updated `flash_sdpa_backward_smoke` to force the bespoke backend on
f16/bf16 (Phase 59b made FA2 the new default BW backend, breaking
source-compat for the existing bespoke BW smoke tests).
Phase 42-44
add three opt-in backends (FA2, mHC.cu, ozIMMU); none are on the
default build path. Phase 44b internalized the ozIMMU sources
(clean-fork; cutf submodule retired; Linux + Windows both build).
Phase 49 adds the `baracuda-optim` sibling crate (Adam / LAMB / SGD
via vendored Apex `multi_tensor_apply`) — gated behind the `optim`
feature so inference-only consumers don't pay the FFI surface cost.
Phase 55 adds the `baracuda-transformer-engine` sibling crate
(NVIDIA TransformerEngine FP8 cast + delayed-scaling recipe,
Apache-2.0) — gated behind the `tensor_engine` feature. On Ada
(sm_89) the FP8 wins are bandwidth-saving only (KV cache, weights);
the recipe machinery is forward-compatible with Hopper / Blackwell
where the MMA throughput win also materializes.

Phase coverage (see [`ARCHITECTURE.md`](ARCHITECTURE.md) for the phase
matrix):

| Phase | Scope | Status |
| --- | --- | --- |
| 59a | FA2 FW expansion (alpha.59) — full upstream feature parity (head_dim fanout {32,64,96,192,256}; GQA; ALiBi; sliding window; softcap): vendored 20 new `.cu` files from Dao-AILab FA2 v2.8.3 (head_dims 32/64/96/192/256 × {fp16, bf16} × {causal, non-causal}) bringing the FW vendor coverage to the full upstream set of {32, 64, 96, 128, 192, 256}. Upstream FA2 v2.8.3 does NOT ship head_dims 160/224/512 — those are permanently out-of-scope (no source). Launcher (`kernels/attention/fa2_launcher.cu`) rewritten to dispatch all 6 supported head_dims via runtime switch. NEW `..._run_v2` + `..._can_implement_v2` FFI entry points (+4 symbols) carrying ALiBi slopes + per-head-or-per-batch layout selector + sliding window left/right bounds + Gemma-2-style softcap. v1 entry points preserved for backwards-compat. GQA-divisible head counts (`num_heads % num_heads_k == 0`) now accepted on the FA2 path. `FlashSdpaDescriptor` is now `#[non_exhaustive]` with `::new(...)` + chainable `with_window_size_left`/`with_window_size_right`/`with_softcap` builders (Phase 32 convention). `FlashSdpaArgs` gained `alibi_slopes: Option<TensorRef<f32, 2>>`. Bespoke backend rejects sliding-window/softcap/ALiBi at select-time with clear errors. ~33 descriptor + ~30 args callsites migrated to the builder pattern. 4 new smoke test files (26 new test functions): `fa2_hdim_fanout_smoke` (20), `fa2_gqa_smoke` (1), `fa2_alibi_smoke` (3), `fa2_sliding_window_smoke` (3), `fa2_softcap_smoke` (4). Out of scope (Phase 59b territory): BW path, varlen, split-KV. | done |
| 59c | Bespoke flash SMEM-carveout race fix + flash_sdpa_backward smoke test routing fix (alpha.59 consolidation pass): added `std::mutex`-serialized helper `set_dynamic_smem_serialized` around all `cudaFuncSetAttribute(MaxDynamicSharedMemorySize)` calls in `baracuda_flash_sdpa.cuh` + `baracuda_flash_sdpa_sm89.cuh` (5 call sites total: FW + BW dQ + BW dKdV + sm_89 FW + sm_89 strided FW). Pre-existing flake (root cause: Phase 6 / Milestone 6.6 host wrapper) that surfaced as `CutlassInternal(1001)` (= `cudaErrorMissingConfiguration`) at ~33% rate on Phase 59a's 20-test hdim fanout, specifically for d_k=96 + fp16 (smem ~50 KiB, just past the 48 KiB cudaFuncSetAttribute trigger). Confirmed fix via 3 stress runs after fix: 60/60 tests pass. Also fixed `flash_sdpa_backward_smoke`'s f16/bf16 paths to explicitly request `BackendKind::Bespoke` — Phase 59b made FA2 the default BW backend for f16/bf16 (more permissive heuristic), which broke source-compat for the existing bespoke BW smoke tests (they fed `lse: f16` not `lse_f32`). | done |
| 59b | FA2 BW + varlen (alpha.59; closes Fuel's FA2-retirement requirements): vendored 24 new BW `.cu` files (`flash_bwd_hdim{32,64,96,128,192,256}_{fp16,bf16}_{,causal}_sm80.cu` — full FA2 v2.8.3 BW set, mirrors 59a FW vendor 1:1) plus 3 new BW headers (`flash_bwd_kernel.h`, `flash_bwd_launch_template.h`, `flash_bwd_preprocess_kernel.h`). **Key finding**: varlen does NOT have a separate .cu file family upstream — FA2 v2.8.3 dispatches varlen via a runtime `cu_seqlens_q != nullptr` check inside the existing FW/BW launch templates, so the same per-(headdim, dtype, causal) instantiations serve dense and varlen callers. NEW `kernels/attention/fa2_backward_launcher.cu` (BW dispatch, supports dense + varlen via two `fill_*_params` helpers) + `fa2_varlen_launcher.cu` (varlen FW). +12 new FFI symbols (BW dense ×2 + can_implement ×2 + workspace_size; varlen FW ×2 + can_implement ×2 + lse_size; varlen BW ×2 + can_implement ×2 + workspace_size). API: `FlashSdpaBackwardDescriptor` is now `#[non_exhaustive]` with `::new(...)` + sliding-window/softcap builders. `FlashSdpaBackwardArgs` gained `lse_f32: Option<TensorRef<f32, 3>>` (FA2 stores LSE in f32 regardless of T) + `alibi_slopes`. `FlashSdpaBackwardPlan` extended with `BackendChoice::FlashAttentionV2` arm (additive — bespoke path source-compat preserved). NEW `FlashSdpaVarlenPlan` / `FlashSdpaVarlenBackwardPlan` plan families with packed-batch `[total_q, H, D]` layout + `cu_seqlens_q`/`cu_seqlens_k` index tensors + f32 LSE `[H, total_q + 128*B]`. BW workspace = `dq_accum + dsoftmax_d` (sizes via `..._backward_workspace_size`); launcher zero-fills via `cudaMemsetAsync`. Determinism: FA2 BW uses atomicAdd into dq_accum, so NOT bit-stable run-to-run (precision SKU tags this honestly). 2 new smoke test files: `fa2_backward_smoke.rs` (12 tests: workspace sizing + eligibility + e2e BW for d ∈ {64,128,192,256} × {f16,bf16} × {causal,non-causal}), `fa2_varlen_smoke.rs` (5 tests: plan selection, lse_size formula, varlen FW with 3 packed sequences, varlen BW with 2 sequences, varlen × GQA). | done |
| 60 | FA2 head_dim {160, 224, 512} FW expansion (alpha.60) — **corrects Phase 59a's incorrect "permanently out-of-scope" claim**. The Candle fork (`EricLBuehler/candle`) has carried hd160/192/224/256 since 2023-07 (PR #245 by Laurent Mazare); hd224 was restored by PR #2688 (Michael Feil, 2024-12-31); hd512 was added by PR #3417 (Eric Buehler, merged 2026-03-28 — adds the `cudaDeviceGetAttribute(cudaDevAttrMaxSharedMemoryPerBlockOptin)` SMEM opt-in path and updates the splitkv block-size formula). Phase 60 vendors the 12 missing FW `.cu` files from those PRs into baracuda's FA2 tree (8 hd160/224 from `EricLBuehler/candle@main`; 4 hd512 from `huggingface/candle@5430d32c`) plus the corresponding `flash_fwd_launch_template.h` + `static_switch.h` patches. **BW path NOT extended** — hd160/224 fall on FA2 BW kernel's `kBlockKSmem = (kHeadDim % 64 == 0) ? 64 : 32` constraint (BW atom_layout assumes 64); hd512 needs `kBlockM = 32` to fit any SMEM budget but BW kernel_traits static-asserts `kBlockM >= 64`. Upstream FA2 and the Candle fork ship no BW for these three either — limitation is fundamental to FA2's BW algorithm, not an oversight. Phase 60 attempted both paths; the experiment + reasoning is documented in `VENDOR.md`, in code comments at the dropped registration sites, and in `FA2_BW_SUPPORTED_HEAD_DIMS` (kept at `{32, 64, 96, 128, 192, 256}`). Callers needing BW at hd160/224/512 transparently fall back to the bespoke 3-kernel SDPA BW pipeline (the only path that was supporting them previously, anyway). 12 new FW smoke test functions in `fa2_hdim_fanout_smoke`. `FA2_SUPPORTED_HEAD_DIMS` (FW) lifted to `{32, 64, 96, 128, 160, 192, 224, 256, 512}` — full Candle-fork-extended set. | done |
| 63 | FlashAttention saved-tensor contract — dense LSE size helper + FW/BW wiring docs (alpha.63, Fuel-ask). Closes the wiring gap for downstream autograd integration of baracuda's FA2 backward. **1 new FFI symbol**: `baracuda_kernels_fa2_sdpa_lse_size(batch, num_heads, seq_q) -> usize` returning f32 element count — sibling of the existing `_varlen_lse_size` (Phase 59b). FW v1 + v2 have written `softmax_lse` since alpha.56 — turns out Fuel didn't need a new "v3 with lse" forward variant as their ask suggested; just the size helper for pre-allocation + clarity on the saved-tensor pattern. **Documented "LSE saved-tensor contract"** as a stable public contract on the FW + BW trailblazers (`baracuda_kernels_fa2_sdpa_f16_run` / `..._backward_f16_run`): pre-allocate via `_lse_size`, pass same f32 buffer to FW as output and BW as input, ALiBi/sliding-window/softcap params must match between FW and BW. **NEW docs guide** at `docs/guides/fa2-saved-tensor-contract.md` (~180 lines, ASCII handoff diagram + side-by-side FW/BW code samples). BW head_dim cap confirmed at 256 (matches Fuel's Vulkan limit); hd160/224/512 BW remains structurally not supported by FA2 (per Phase 60), callers fall back to bespoke `SdpaBackwardPlan`. **Test investment**: 3 host-only `lse_size` helper tests + 5 GPU FW→BW roundtrip tests (the load-bearing wiring proof — pre-allocate LSE, FW writes it, BW reads it, verify dQ/dK/dV finite + non-zero) + 4 BW feature-surface tests (sliding window, softcap, ALiBi, all-features composed — backfills the gaps in the existing `fa2_backward_smoke.rs` which only tested the base path with `alibi_slopes: None`). All 12 new tests pass on RTX 4070. Option B (recompute-LSE backward variant) explicitly rejected: 2× backward compute for zero functional benefit when the saved-tensor pattern already works. PagedAttention backward filed as "ask if needed" per Fuel — not preemptively built. | done |
| 62 | Strided in-place op support + comprehensive test investment (alpha.62, Fuel-ask). Lifts the in-place contract from contig-only (Phase 61) to strided. **11 new FFI symbols** on the affine in-place family: 4 contig int dtype backfill (`i32`/`i64`/`u8`/`i8` matching forward affine matrix) + 7 strided variants across the full forward-strided dtype set (`f32`/`f64`/`i32`/`i64`/`u8`/`bf16`/`f16`). Half-precision strided uses the same f32-scalar / upcast-to-f32 / downcast pattern as the forward strided f16/bf16 kernels. **Same-pointer aliasing contract documented for the strided trailblazers** (`unary_neg_f32_strided_run`, `binary_add_f32_strided_run`, `ternary_clamp_f32_strided_run`) as a stable public contract: aliasing is safe IFF the aliased input's stride array equals `stride_y` element-for-element. NEW `baracuda_kernels_types::strides_equal(a, b)` host helper for callers to validate the precondition before dispatching. **Zero new bespoke CUDA kernels** for the elementwise unary/binary/ternary families — their existing strided launchers are structurally aliasing-safe under the contract (each thread reads its own stride-offset cell before writing, same per-thread pattern as the contig case). Unblocks Fuel's strided in-place op fanout (every existing strided unary/binary/ternary forward kernel becomes an in-place candidate with `x_ptr == y_ptr` + equal strides). **Test investment**: 14 new host-only unit tests for `strides_equal` + `contiguous_stride` in `baracuda-kernels-types`; 17 new GPU direct-FFI smoke tests covering all 11 new affine in-place symbols + backfill tests for the alpha.55 baseline (f32/f64) and alpha.61 half-precision (bf16/f16) contig in-place; 7 aliasing-contract proof tests across contig + strided trailblazers (also backfills the Phase 61 contig contract that shipped without test coverage). Multi-pass families (Softmax / LayerNorm / RMSNorm / etc.) explicitly out of scope — same as Phase 61. | done |
| 61 | In-place op infrastructure completion + same-pointer aliasing contract (alpha.61, Fuel-ask) — 2 new bf16/f16 FFI symbols + docstring tightening. (1) `baracuda_kernels_affine_inplace_{bf16,f16}_run` complete the 4-dtype matrix on top of the alpha.60 f32/f64 in-place affine helper, with f32-scalar ABI matching the forward `affine_{bf16,f16}_run` convention (avoids passing `__nv_bfloat16`/`__half` by value through the C ABI). Kernels reuse the forward upcast-to-f32 / downcast-to-storage pattern from `affine_contig_kernel_{f16,bf16}`. Unblocks Fuel's `Op::AddScalar`/`Op::MulScalar` in-place rewrites + weight-decay scaling on bf16/f16 model weights without the previous Cast → Affine → Cast scratch-buffer round-trip. (2) Documented same-pointer aliasing safety as a stable public contract on the three contig elementwise trailblazers — `unary_neg_f32_run` (covers ~30 plain unary launchers + `unary_param_*` family across all dtypes via the existing "Same device-pointer contract..." inheritance line), `binary_add_f32_run` (covers ~20 binary launchers), `ternary_clamp_f32_run` (already documented since alpha.36). Unblocks Fuel's planned in-place expansion (16+ unary in-place op families + 4 binary in-place op families + ClampInplace + PowIInplace) with zero new baracuda symbols for the elementwise case — Fuel dispatches the forward symbol with `x_ptr == y_ptr` (or `a_ptr == y_ptr` for binary). Strided in-place variants (Phase 62 candidate) deferred — v1 contract from Fuel's executor is contiguous + zero-offset. | done |
| 0 | Crate scaffolding, shared type vocabulary | done |
| 1 | int8 GEMM RRR (Fuel-blocking, 18 SKUs) | done |
| 2 | FP8 / int4 / bin GEMM completion | done |
| 3 | Elementwise + shape / layout (Categories B, B', C, C', D, N) | done |
| 4 | Reductions + scans + random (Categories E, F, Q) | done |
| 5 | Normalization + softmax + loss (Categories G, H, R) | done |
| 6 | Attention + linalg + FFT (Categories K, Linalg, U) | done |
| 7 | Convolution + pooling + indexing + embedding + segment (Categories I, J, L, M, S) | done |
| 8 | Quantization helpers + GGUF + MoE (Category P, V) | done |
| 9 | Sort / topk / image / NMS (Categories O, T) | done |
| 10 | sm_89 (Ada Lovelace) tuning sweep | done |
| 11 | Fuel feedback integration (alpha.27) — ScalarType ergonomics, Conv/Pool fanout, GGUF Q8_K MMVQ, i64 indices, Sparsemax cap lift, atomicAdd-via-CAS, build-env probe | done |
| 12 | PowI + ArgMax/Min u32/i32 outputs (alpha.28) — `IndexOutputElement` sealed trait | done |
| 13 | WriteSlice + Contiguize + sub-byte casts + Triu/Tril (alpha.29) — KV-cache fast path, retires Fuel's D2H/CPU/H2D fallback, plus `DeviceBuffer::zero()` (alpha.30) | done |
| 14 | Strided FFI siblings (alpha.31) — Affine, PowI, Triu/Tril, RoPE+SDPA, GGUF MMVQ activation-strided + W byte offset; 56 new FFI symbols | done |
| 15 | Quick wins + correctness cleanup (alpha.32) — MMVQ alignment guard, OneHot/Nonzero i64 wrappers, MoE fixture race fix | done |
| 16 | Pool completion (alpha.33) — bit-exact adaptive pool {1,2,3}d, bespoke LpPool {1,2}d, bespoke FractionalMaxPool {2,3}d; 48 new FFI symbols | done |
| 17 | SDPA / attention completion (alpha.34) — Flash SDPA sm_89 strided FW + SDPA BW GQA-broadcast atomicAdd | done |
| 18 | Sub-byte / quantized completeness (alpha.35) — f16/bf16 activations for `GgufMmvqPlan` across all 11 block formats × contig + strided; 44 new FFI symbols | done |
| 19 | Fuel retirement asks (alpha.36) — pool/conv FFI facade for cuDNN-backed plans + Upsample Nearest2d + NEW im2col/im2col1d/col2im1d bespoke; vendored Fuel Q8_1 for inspection; 140 new FFI symbols. Surfaced 1.0-freeze prereq for broader library-backed FFI facade audit | done |
| 20 | MoE — Item 4 from Fuel retirement (alpha.37): batched MMVQ × N-experts (36 new FFI symbols across 11 GGUF block formats × 3 activation dtypes + 3 pure-FP); MoE absorb-and-expose proved to be a no-op (Fuel hadn't evolved their kernels since Phase 8.5 vendor; 5 baracuda-side symbols already match) + 2 direct-FFI smoke tests | done |
| 21 | Bilinear interpolate expansion (alpha.38): `align_corners` + scale-factor overrides + f16/bf16 fanout (FW+BW). Breaking change to existing f32/f64 signatures. | done |
| 22 | MMVQ ncols≥64 debug assertion + cuSOLVER FFI facade (alpha.39): 10 cuSOLVER-backed plan families (Cholesky, LU, QR+ormqr, SVD/SvdBatched/SvdaBatched, eigh real+complex, eig, lstsq, solve, inverse) wrapped behind ~50 flat C symbols in `baracuda-kernels-sys/src/cusolver_facade.rs`; closes the Phase 19 library-backed FFI facade gap for cuSOLVER. No feature gate (cuSOLVER ships with the CUDA toolkit). | done |
| 23 | cuFFT + cuRAND FFI facade (alpha.40): 6 cuFFT plan families (FFT 1d/Nd C2C, R2C, C2R) × c32/c64 + f32/f64 + 2 cuRAND families (Uniform, Normal) × f32/f64 = 32 flat C symbols in `baracuda-kernels-sys/src/{cufft,curand}_facade.rs`. cuSPARSE skipped — no baracuda-kernels plans wrap it today. | done |
| 24 | Cutlass GEMM re-export FFI facade (alpha.41): 210 trampolines (70 SKU families × {run, workspace_size, can_implement}) in `baracuda-kernels-sys/src/cutlass_reexport.rs` exposing the full Cutlass GEMM surface (fp16/bf16/tf32/f32_simt/f64/s8/u8 × {rcr, rrr} × {plain, bias, bias+relu/gelu/silu} + strided-batched fp16/bf16). cuTENSOR / NPP / CV-CUDA skipped — no baracuda-kernels plans wrap them. Completes the Phase 19 library-backed FFI facade 1.0-freeze prereq. | done |
| 25-26 | Segment/EmbeddingBag BW completion + BatchedOrmqrWy complex (alpha.42): 9 new Rust plans + 24 new FFI symbols for Segment Max/Min/Prod BW (sorted + unsorted, f32/f64), Unsorted Segment Prod FW (`atomicCAS`-retry mul), EmbeddingBag Max FW+BW (f32/f64/f16/bf16 × i32/i64). Plus BatchedOrmqrWy complex (Complex32, Complex64) via the bespoke WY-block kernels + cuBLAS C/Z gemmStridedBatched (4 new bespoke FFI + 2 cuBLAS symbols). | done |
| 27 | Q8_1 perf inspection (alpha.42 doc-only): Multi-M MMVQ opportunity identified, kept doc-only — bigger ROI than reformatting Q8_1. | done |
| 28 | API hygiene for 1.0 prep (alpha.43): new `KernelDtype` umbrella marker trait extending `Element`/`IntElement`/`FpElement`/`BinElement`; `#[non_exhaustive]` audit across the op-family `*Kind` enums + auxiliary tag enums + `Error` types. `ElementKind` / `LayoutSku` / `ArchSku` / `EpilogueKind` / `ActivationKind` / `Workspace` intentionally left exhaustive (hot-path-dispatched). | done |
| 29 | Cross-implementation benchmark suite (alpha.44): 10 new criterion+CUDA-event benches comparing baracuda against cuBLAS / cuDNN at LLM-typical shapes (GEMM f32/f16/bf16, MMVQ all qtypes, Softmax, LayerNorm, RMSNorm, Conv2d, MaxPool2d, Reductions, Elementwise, Flash SDPA+GQA). ~2,750 LOC of bench code + 13 bench binaries total. Critical finding: baracuda f16/bf16 GEMM is **2-4× slower than cuBLAS at M=1/M=32** (decode regime); validates the deferred Phase 27 multi-M MMVQ port. See [`BENCHMARKS.md`](crates/baracuda-kernels-bench/BENCHMARKS.md) for the methodology + sample run. | done |
| 30 | f16/bf16 GEMM cuBLAS fast-path (alpha.45): adds `PlanPreference::prefer_backend: Option<BackendKind>` + thread-local cuBLAS-handle cache to `GemmPlan`. Heuristic: cuBLAS for f16/bf16 at `2 ≤ M < 128` (decode batch); CUTLASS otherwise. **3× speedup at M=32 f16** (55.6µs → 19.0µs, parity with cuBLAS direct). M=1 stays on CUTLASS (cuBLAS RCR→col-major transa=T mapping slower than CUTLASS sm_80 GEMV-tile at K=N≥2048). Capture-mode auto-fallback to CUTLASS (cuBLAS-classic not capture-safe). 9 new smoke tests. | done |
| 31 | Fuel Phase 6c.2 storage.rs unblock (alpha.46): 5 gaps closed — ELU α parameter (breaking; 8 sigs modified), `powf` (8 new), `step` + `gelu_erf` (16 new), cast `u32`/`i16` (36 new × 2 directions), `reduce_sum_to`/`reduce_max_to` broadcast-reverse reductions (8 new). **~76 new/modified FFI symbols + 17 new smoke tests.** Unblocks Fuel's full PTX retirement (AFFINE/UNARY/BINARY/CAST/REDUCE/INDEXING/TERNARY/FILL/SORT modules). | done |
| 32 | Descriptor `#[non_exhaustive]` + builder pattern (alpha.47): 18 descriptors retrofitted with `::new()` builders + chainable setters (`with_stride`/`with_padding`/`with_dilation`/etc.). Conv {1,2,3}d + ConvTranspose {1,2,3}d + Pool {1,2,3}d + AdaptivePool {1,2,3}d + LpPool {1,2}d + FractionalMaxPool {2,3}d + Interpolate + InterpolateBackward. **Breaking change for downstream struct-literal callers** — pre-1.0 hardening. Migration: `Conv2dDescriptor { ... }` → `Conv2dDescriptor::new(input_shape, filter_shape, element).with_stride(...)`. | done |
| 33 | Multi-M MMVQ via Q8_1 staging (alpha.48): closes Phase 27's deferred opportunity. NEW `GgufMmvqMultiMPlan` + `quantize_q8_1` staging kernel + 4 Q8_0 multi-M launchers (M ∈ {1, 2, 4, 8}). **Bench: 7.29-7.96× speedup at M=8** on Llama-2 7B layer shapes (4096²; 11008×4096; 32000×4096). Q8_0 only this phase (clean partial); 9 remaining block formats (Q4_0/Q4_1/Q5_0/Q5_1/Q2_K..Q6_K) are mechanical fanout for a follow-up. 8 new FFI symbols (3 staging + 4 multi-M + 1 workspace). | done |
| 34 | Multi-M MMVQ block format fanout (alpha.49): 9 remaining GGUF formats shipped — Q4_0, Q4_1, Q5_0, Q5_1, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K. 36 new FFI symbols (9 fmts × 4 M-sizes). **Bench at N=K=4096 M=8**: Q5_0 **17.32×**, Q5_1 15.05×, Q4_0 12.78×, Q4_1 12.15×, Q8_0 8.79× — type-0/1 formats massively exceeded Phase 27's 3-7× target. K-quants (Q2_K..Q6_K) hit 3-7× at M=8 (larger 256-elem super-blocks dilute weight-reuse savings). Q8_K MMVQ correctly rejected at select() — bespoke per Phase 11.4. | done |
| 35 | Test-infra hardening (alpha.50): **first zero-failure regression** in the entire Phase 22-35 sweep (2229/0 across 638 binaries). Five fixes: (a) `mmvq_w_offset_alignment_misaligned_rejected_debug` `#[cfg(debug_assertions)]` gate; (b) cuBLAS handle retry with 5× linear backoff (Phase 30 parallel-init race); (c) cuDNN handle retry on CTC path (1001 NOT_INITIALIZED race); (d) `Stream::capture` panic-safe Drop guard (ThreadLocal capture state leak under cargo's thread reuse → cudaErrorStreamCaptureImplicit on subsequent tests); (e) **`cudaResourceDesc` 48→128 byte expansion + `repr(align(8))`** (Rust struct under-allocated by 16+ bytes AND missing 8-byte alignment that the union's `void*`/`size_t` arms require — caused release-only STATUS_ACCESS_VIOLATION in wave5_smoke). | done |
| 36 | Fuel 6c.4 unblock — Phase 1/3 (alpha.51): RoPE apply with precomputed cos/sin tables (FW+BW × 4 fp dtypes; 16 symbols) + Fill missing dtypes & strided variant (3 new contig + 11 strided; 28 symbols) + Argsort dtype fanout (u8/i8/u32/i16/bf16/f16/fp8e4m3; 14 symbols). 58 new FFI declarations total. | done |
| 37 | Fuel 6c.4 part 2/4 (alpha.52): Reduce family Gap 1 — `reduce_min_to`/`prod_to` broadcast-reverse for 4 fp dtypes (16 symbols) + integer-dtype single-axis sum/min/max/prod + argmin/argmax for U8/I8/U32/I16/I32/I64 (48 symbols, with U64/I64 widened accumulator + store-time narrow on Sum/Prod). 64 new FFI declarations total. Documented bit-exact wrap-on-overflow contract for u8/u32 sum/prod. | done |
| 38 | Fuel 6c.4 part 3/4 (alpha.53): Ternary `where_cond` dtype-matrix fanout — Cond lifted to template parameter; U8 (existing, untouched) + U32 + I64 cond × {f32/f64/f16/bf16, u8/i8/u32/i16/i32/i64, fp8e4m3} value × {contig, strided}. 87 new FFI declarations (58 `_run` + 29 `_can_implement`). Existing `where_<value>_run` family preserved bit-identically (default Cond=uint8_t). | done |
| 39 | Fuel 6c.4 part 4/4 (alpha.53, bundled with Phase 38): Indexing Tier 1 — NEW scatter (pure assign) + index_add for {f32/f64/f16/bf16} × {i32, i64idx} (16 syms) + gather u8idx extras for {f32, f64} (2 syms). 18 new FFI symbols total. Existing per-axis stride arrays meant no separate contig/strided split needed. f16/bf16 index_add uses the Phase 11.3 `atomic::add<T>` atomicCAS helper. Scatter documented + tested with disjoint-target indices (last-writer-wins on collisions, caller-aware non-determinism). | done |
| 40 | Fuel 6c.4 final cleanup (alpha.54): multi-block radix argsort via CUB `DeviceSegmentedRadixSort` for `row_len > 1024` (4 dtypes × 3 entries = 12 syms; bitonic stays for ≤1024) + Indexing Tier 2 integer value-dtype matrix (gather/index_select/scatter for u8/i8/u16/i16/u32/i32/i64 × i32/i64idx = 38 syms; index_add for i32/u32/i64 only = 6 syms). 56 new C symbols total. New `atomic::add<int64_t>` specialization via `unsigned long long*` reinterpret. Tier 3 (fp8e4m3 + sub-32-bit ints for index_add) deferred — no concrete caller. | done |
| 41 | Fuel 6c.5 final unblock (alpha.55): RoPE interleaved-pair (Gap 7) + RoPE THD-layout (Gap 8) variants. 28 new FFI symbols (FW+BW × 4 fp dtypes × 2 variants + `_can_implement` companions). **Closes the entire Fuel 6c.4/6c.5 batch ask** — Fuel can now drop the last `Id::Reduce` PTX module + retire `fuel-cuda-kernels` workspace member + drop the `cudaforge` build dep. Discovery: existing `rope_apply_*` was already using `(2k, 2k+1)` pairing (not `(i, i+d/2)` as the brief stated) → interleaved symbols are name-aliases on the same kernel; THD is genuinely new. | done |
| 42 | Flash Attention v2 vendor + `FlashSdpaPlan` backend (alpha.56): Tri Dao's FA2 v2.8.3 (BSD-3) vendored under `crates/baracuda-kernels-sys/vendor/flash-attention/` — Tier 1 (head_dim=128, fp16+bf16, sm_80, FW only) — wired as `BackendKind::FlashAttentionV2` on `FlashSdpaPlan` behind the `fa2` cargo feature. Heuristic routes long-context (seq_q×seq_k ≥ 1024×1024) shapes to FA2, bespoke otherwise; `PlanPreference::prefer_backend` overrides. PyTorch shim headers (at::PhiloxCudaState + C10_CUDA_CHECK) decouple the vendor from torch deps. Tier 2 (BW, varlen, paged, other head_dims) deferred. | done |
| 43 | mHC.cu vendor + `HyperConnectionPlan` family (alpha.56): DeepSeek-AI's Manifold-Constrained Hyper-Connections residual-mixing op (arXiv:2512.24880) from AndreSlavescu/mHC.cu (MIT) vendored under `crates/baracuda-kernels-sys/vendor/mhc/` — Tier 1 (static-H, bf16 only) — exposed as `HyperConnectionPlan` behind the `mhc` cargo feature. Replaces bare `y = x + sublayer(x)` residual with a learned `n×n` Sinkhorn-Knopp doubly-stochastic mixing matrix. Tier 2 (BW, dynamic-H, fp16/f32) deferred. Requires cuBLAS-Lt (already linked). | done |
| 44 | ozIMMU FP64-via-Int8-TC backend (alpha.56): enp1s0/ozIMMU (MIT) — Ootomo/Ozaki/Yokota's Ozaki-scheme DGEMM that synthesizes FP64 from S² int8 tensor-core matmuls — vendored under `crates/baracuda-ozimmu-sys/vendor/ozimmu/` with `cutf` submodule pinned alongside. NEW `baracuda-ozimmu-sys` + `baracuda-ozimmu` sibling crates. Wired into `GemmPlan` f64 path as opt-in `BackendKind::Ozaki { slices }` (default stays on CUTLASS/cuBLAS DGEMM — Ozaki is NOT bit-equivalent). Two patches: direct-link mode (no LD_PRELOAD), exclude `cublas.cu`/`culip.cu`. | done |
| 46 | FlashInfer cherry-pick — paged-KV decode + sort-free sampling + cascade attention (alpha.57 Checkpoint A, **closed in the alpha.58 consolidation pass**): surgical extraction of three FlashInfer kernel families (Apache-2.0, v0.6.12, commit `eee0d75f`) vendored under `crates/baracuda-kernels-sys/vendor/flashinfer/` (~12 kLOC across 25 headers, no wholesale wrap). NEW plan families: `BatchPagedDecodePlan` + `PagedKvAppendPlan` (vLLM-style paged KV cache decode), `TopKTopPSamplingPlan` (sort-free TopK/TopP/MinP/combined samplers), `CascadeAttentionPlan` (LSE-merge for prefix-cache sharing). NEW `BackendKind::FlashInfer` + `RandomKind::Multinomial` discriminants. NEW `flashinfer` cargo feature on both `baracuda-kernels-sys` and `baracuda-kernels` (default OFF). 7 MSVC-portability patches to vendored headers (see `vendor/flashinfer/VENDOR.md`). **Checkpoint B (alpha.58 consolidation)**: `flashinfer_paged_decode_launcher.cu` now compiles cleanly under MSVC nvcc — root cause was `std::max(unsigned long, size_t)` type mismatch inside `decode.cuh` (the earlier hypothesis about `cudaLaunchKernel_ptsz` was incorrect). Patched via `static_cast<size_t>(...)` on both arguments; launcher TU also carries a defensive `cudaLaunchKernel` shim macro. All 4 launchers now build under the `flashinfer` feature. | done |
| 44b | ozIMMU clean-fork + cutf elimination + Windows port (alpha.57): full internalize of ozIMMU sources (no longer vendored — we own them at `crates/baracuda-ozimmu-sys/cuda/`). `cutf` submodule eliminated entirely (upstream went offline); ~360 LOC of useful FP / cp_async utilities preserved as native `baracuda_fp_bits.cuh` + `baracuda_cp_async.cuh`; ~2,200 LOC of cutf duplicates deleted. Portable `baracuda::Uint128` replaces `__uint128_t` for Windows compile (typedef alias on Linux — bit-for-bit preservation). LD_PRELOAD path removed entirely. Linux + Windows both build clean. | done |
| 44c | ozIMMU RIKEN-RCCS perf-enhancement variants (alpha.57, no version bump): folds in `accelerator_for_ozIMMU` (Uchino/Ozaki/Imamura 2024, arXiv:2409.13313) — three new variants `EF` (group-wise error-free summation; chains int8 cublasGemmEx with `beta_i=1` to delay int32→f64 materialization), `RN` (nearest-rounding `(a+t)-t` split; ~2 extra effective bits per slice), `H` (= EF + RN), plus n-blocking (chunk `n > 12288` into 8192-wide pieces on the int8 GEMM call). Variant selected via `BackendKind::Ozaki { slices }` discriminant's high-3-bits field; `ozaki_slices::{base,ef,rn,h}(s)` helper constructors in `baracuda-kernels-types::sku`. NEW `OzakiVariant` enum + `Handle::dgemm_with_variant` on `baracuda-ozimmu`. Source-compatible with Phase 44b callers (`slices: 8` decodes as Base/S=8). **Discovered + fixed a pre-existing Phase 44b MSVC bug** in `axby` / `axy_complex`: upstream's `(1l << 44)` overflows on Windows (where `long` is 32-bit, LLP64) → silent `inf` output. Fixed by switching to `static_cast<double>(1ull << 44)`. 9 new accuracy/variant/n-blocking smoke tests, all green on RTX 4070; the pre-existing Phase 44b accuracy_smoke tests (4 cases) also unbreak. | done |
| 47 | Fused Linear Cross-Entropy (alpha.56, single-kernel port from LinkedIn's Liger-Kernel BSD-2): NEW `FusedLinearCrossEntropyPlan` family that fuses lm_head GEMM + CE loss in a chunked outer loop, never materializing the `[BT, V]` logits tensor. At BT=16K, V=128K, bf16 (Llama-3-class) saves **5-10 GiB of activation memory**. Bespoke per-chunk fused softmax+CE+gradient kernel (FP32 accumulator across 4 fp dtypes — f16/bf16/f32/f64); GEMMs dispatched via `cublasGemmEx`. Backward produces `grad_input`+`grad_weight` during the FW pass (chunked loop); BW call just scales by `dy_scalar` (no-op when `dy=1.0`, the typical "CE is the last layer" case). 16 new bespoke FFI symbols (per_row + per_row_cast + scalar_finalize + inplace_scale, each × 4 dtypes) + 1 count-non-ignore helper + `cublasGemmEx` binding. NEW `LossKind::FusedLinearCrossEntropy` variant. **Algorithm credit**: LinkedIn Liger-Kernel (BSD-2-Clause, clean-room CUDA reimplementation — no source vendored). | done |
| 45 | SmoothQuant compose + YaRN/LongRoPE Rust helper (alpha.56, no version bump — consolidation phase will bump): **two zero-new-CUDA pure-Rust additions**. (a) `SmoothQuantLinearPlan<TIn, TWQ>` (in `crates/baracuda-kernels/src/quantize/smoothquant.rs`) composes the existing Phase 8.3 `quantized_linear_w8a8` kernel + `fill_<dt>` broadcast for the per-tensor activation scale. Caller supplies pre-smoothed-and-quantized int8 activations + int8 weights (smoothing itself is offline Python per the SmoothQuant paper — mit-han-lab/smoothquant MIT, Xiao et al. ICML 2023; not in scope). (b) `RopeScaledTableBuilder` + `RopeScaling` enum (Linear / YaRN / LongRoPE, in `crates/baracuda-kernels/src/attention/rope_scaling.rs`) — host-side cos/sin table builder feeding the Phase 36 `rope_apply_<dt>_run` kernel. YaRN (jquesnelle/yarn MIT, Peng et al. arXiv:2309.00071) implements §3.2 NTK-by-parts frequency interpolation + §3.3 attention-temperature absorption into cos/sin. LongRoPE (microsoft/LongRoPE MIT, Ding et al. arXiv:2402.13753) multiplies inv-freq by caller-supplied per-dim factors (evolutionary search itself is offline + out of scope). Existing Phase 36 `RopeApply*` types source-compat preserved. | done |
| 51 | Arbitrary-mask `FlashSdpaPlan` + spec-decode composition doc (alpha.57, no version bump — consolidation phase will bump): NEW optional `mask: TensorRef<f32, 4>` field on `FlashSdpaArgs` routing to a bespoke arbmask SDPA kernel that adds an f32 `[B, H, Q, K]` additive bias to `S = Q·K^T·scale` before softmax. Unlocks spec-decode tree masks (EAGLE / Medusa / lookahead), MoE expert masking, prefix-LM, sliding-window with attention sinks — all entirely from caller-side composition. 4 dtypes (f32/f16/bf16/f64) × `_run` + `_can_implement` = 8 new FFI symbols. `is_causal` composes with the mask correctly (`-INF + finite == -INF`). New header `baracuda_attn_arbmask.cuh` reuses Phase 6.6's online-softmax tile pipeline; 1 new .cu instantiation file. FA2 vendor untouched (FA2 v2.8.3's `Mask` template has no arbitrary-mask hook). Runnable example at `crates/baracuda-kernels/examples/speculative_decode_compose.rs`; design doc at [`docs/guides/spec-decode.md`](docs/guides/spec-decode.md). FW only; BW deferred. | done |
| 50 | Mamba-2 SSD chunk-scan + Dao-AILab causal-conv1d (alpha.57, gated behind `mamba` cargo feature): **opens the state-space LLM class (Mamba-2 8B, Codestral-Mamba, Falcon-Mamba, Zamba2 — Mamba-1 selective_scan deferred to Phase 50b).** NEW `SsdChunkScanPlan` + `SsdChunkScanBackwardPlan` (lives under `attention` because of the SSD-as-attention duality) and `CausalConv1dPlan` + `CausalConv1dBackwardPlan` (top-level module — bespoke kernels, no cuDNN dep). Vendor attribution + LICENSE at `crates/baracuda-kernels-sys/vendor/causal-conv1d/` (Tri Dao, BSD-3) and `crates/baracuda-kernels-sys/vendor/mamba/` (state-spaces/mamba, Apache-2.0). Hand-port of the upstream Triton SSD reference + causal-conv1d primitive. **Dtypes**: causal-conv1d f32/f16/bf16/f64 × widths 2/3/4 × {SiLU, identity}; SSD f32/f16/bf16 (no f64 upstream). FW caps state at D,N ≤ 256; BW tighter at 64 (SMEM budget). 30 new FFI symbols (8 causal-conv1d FW + 8 BW + 6 SSD FW + 6 SSD BW + 2 can_implement extras). 5 new smoke tests (causal_conv1d_smoke/bw + ssd_chunk_scan_smoke/bw + mamba2_block_smoke). | done |
| 50b | Mamba-1 `selective_scan` (alpha.57, gated behind the same `mamba` cargo feature as Phase 50): **completes the state-space LLM coverage by adding the original Mamba-1 op family that powers Mamba-7B, Falcon-Mamba, and Codestral-Mamba** — Phase 50's SSD covers Mamba-2 / Codestral-Mamba / Falcon-Mamba / Zamba2, but every Mamba-1-shipping model still uses v1's `selective_scan`, not v2's SSD reformulation. NEW `SelectiveScanPlan` + `SelectiveScanBackwardPlan` (sibling to `SsdChunkScanPlan` under `attention/`). Shape `(B, L, D, N)` with the full Mamba-1 surface: optional `D[d]` skip, optional SiLU-gated `z[t, d]` tail, optional `delta_bias[d]` + optional `softplus(delta)` mapping (all 9 args of upstream `selective_scan_fn` wired). Dtypes f32/f16/bf16 (complex deferred — no shipping LLM uses it). Hand-port of `state-spaces/mamba`'s `csrc/selective_scan/` under Apache-2.0; same `vendor/mamba/` directory as Phase 50 (VENDOR.md updated, no new LICENSE file). FW caps state at `N ≤ 256`; BW uses two-pass record-then-reverse with `B*D*L*N*sizeof(T)` workspace. NEW `AttentionKind::SelectiveScan = 8` variant (`#[non_exhaustive]` so source-compat). 17 new FFI symbols (3 FW + 3 FW-can-impl + 3 BW + 1 workspace-bytes + module-internal launchers). 3 new smoke tests (selective_scan_smoke covering 4 option-combinations + f16/bf16 loose-tol, selective_scan_bw_smoke with FD checks on du/ddelta/dA + topology rejection, mamba1_block_smoke end-to-end). | done |
| 52 | NCCL foundation crate pair (alpha.57, no version bump — consolidation phase will bump): `baracuda-nccl-sys` (raw FFI types + libloading lazy-resolve, NO bindgen / NO link-time dep) + `baracuda-nccl` (safe `Communicator` with full collective surface — `all_reduce` / `reduce` / `reduce_scatter` / `all_gather` / `broadcast` / `send` / `recv` + group API + `NcclMem` + custom `pre_mul_sum` reduction op + `register` / `deregister` for zero-copy). **The distributed-roadmap prerequisite** for Ring Attention, distributed MoE, Megatron-LM tensor parallelism, FSDP-style shard collectives — Phase 52 only ships the substrate; consumer plans land in Phase 53+. Spec-named API: `Communicator::new_single_gpu` / `new_with_id`, cached infallible `rank()` / `world_size()`, `NcclReduceOp` / `NcclUniqueId` / `NcclDataType` aliases, `NcclUniqueId::generate()`. Linux-primary (NCCL ships with the CUDA toolkit there); Windows builds clean and defers the "is NCCL installed?" question to first `nccl()` call (loader fails with `LoaderError::LibraryNotFound`). 20 new smoke tests (10 dtype mapping — runs on every host; 10 `#[ignore]` NCCL-required). No baracuda-kernels integration in this phase. | done |
| 49 | Apex optimizer subset (alpha.57, gated behind `optim` cargo feature): **deliberate scope expansion — training-framework-adjacent.** NEW sibling crate `baracuda-optim` (~600 LOC Rust + ~750 LOC CUDA) vendoring the NVIDIA Apex (BSD-3-Clause) `multi_tensor_apply` idiom + fused Adam / LAMB / SGD functors. Single launch over thousands of parameter tensors (Apex `MAX_TENSORS_PER_LAUNCH = 110` per batch, multi-launch transparent) — eliminates the ~10,000-launch optimizer step overhead on 32B-param models. Plans: `AdamStepPlan<T>` (f32/f16/bf16 + AdamW mode), `LambStepPlan` (f32; two-stage with atomicAdd-fused L2-norm + sqrt + trust-ratio scaling), `SgdStepPlan<T>` (f32/f16/bf16 + momentum + Nesterov + weight-decay). Inference-only consumers (e.g. Fuel) don't pay the FFI surface cost — the vendored sources only build / link when the feature is enabled. Re-exported under `baracuda_kernels::optim` when enabled. **Measured 41× speedup at 1000-tensor multi-tensor Adam vs 1000 individual launches on RTX 4070** (0.173 ms vs 7.096 ms; smoke test in `crates/baracuda-optim/tests/multi_tensor_dispatch_smoke.rs`). 4 smoke tests, 6 GPU tests total, all green. | done |
| 53 | bitsandbytes NF4 dequant + GEMV vendor — QLoRA inference (alpha.57, gated behind `bnb_nf4` cargo feature): **opens the QLoRA-trained Llama / Mistral / Qwen inference class** by vendoring the bitsandbytes (Dettmers et al. arXiv:2305.14314, MIT) NF4 (NormalFloat 4-bit) dequant + GEMV kernels. NF4 is the dominant 4-bit format for QLoRA-trained prebuilts on the HuggingFace Hub — **distinct from GGUF Q4_0** (symmetric int4*scale, llama.cpp, Phase 8) and AWQ int4 (asymmetric int4 + zero-points). NF4 uses a 16-entry **non-uniform quantile codebook** derived from the inverse CDF of `N(0, 1)` — dequant is a 16-entry lookup, not arithmetic; better accuracy than symmetric int4 for normally-distributed weights. NEW plan trio: `Nf4DequantizePlan<T>` (bulk unpack `[N/2, K]` u8 → `[N, K]` T), `Nf4MmvqPlan<T>` (M=1 single-vector decode GEMV), `Nf4MmvqMultiMPlan<T>` (M ∈ {1, 2, 4, 8} batched-decode with weight gmem reuse, Phase 33 pattern applied to NF4). 11 new FFI symbols (3 dequant + 2 M=1 + 6 multi-M). Pack layout matches bitsandbytes upstream `Linear4bit`: pair-packed nibbles in `[N/2, K]` u8 (N must be even) + `[N * (K/block_size)]` f32 per-block absmax (block_size typically 64). Activation/output dtypes f16+bf16 (PyTorch convention); f32 accumulator. Codebook reproduced bit-identical to upstream as device-side switch + host-side `NF4_CODEBOOK: [f32; 16]` const + `nf4_pack_weight` host helper. Vendor metadata at `crates/baracuda-kernels-sys/vendor/bitsandbytes/{LICENSE,AUTHORS,VENDOR.md}`. 3 smoke test files (dequant roundtrip, M=1 GEMV f16+bf16, multi-M f16 vs M=1-looped). Out of scope: 8-bit optimizers (Phase 49 overlap), LLM.int8 (Phase 45 obsoletes), FP4 (different codebook — separate phase if asked), double quantization (Tier 2). | done |
| 54 | xFormers BlockSparseAttention + 2:4 sparse GEMM (alpha.57, no version bump; clean-room hand-port of facebookresearch/xformers BSD-3-Clause algorithmic reference): NEW `SdpaBlockSparsePlan` (`xformers_blocksparse` cargo feature) — block-sparse SDPA FW where the attention mask is a per-block boolean pattern `[B, H, num_blocks_q × num_blocks_k]` (uint8); only the active (q_block, k_block) pairs participate in the QK^T matmul + online-softmax accumulation. Different from Phase 51's arbitrary additive-mask path (which still computes every cell) — block-sparse actually SKIPS compute on masked blocks → real wall-clock speedup on long-context attention with known sparse patterns. NEW `GemmSparse24Plan` (`xformers_sparse24` cargo feature) — 2:4 structured sparsity GEMM consuming pre-compressed `[M, K/2]` weights + `[M, K/8]` u16 metadata. **Tier-1 implementation**: inflate-then-dense reference matmul (correctness first; sparse-tensor-core `mma.sp.sync` / cuSPARSELt backend deferred to Tier 2). 16 new FFI symbols (8 block-sparse SDPA × 4 dtypes × 2 entries; 11 sparse24 × 3 dtypes × 4 entries with workspace_bytes helper). NEW `AttentionKind::BlockSparseAttention = 9` variant (`#[non_exhaustive]` so source-compat). 3 smoke tests (block-sparse all-ones-matches-dense + diagonal-band + empty-pattern; sparse24 matches host reference + K-rejection + throughput timing). Vendor attribution at `crates/baracuda-kernels-sys/vendor/xformers/` (no upstream sources vendored verbatim — algorithmic reference only). NOT vendored from xFormers: memory-efficient attention (overlaps with FA2 vendor); fused biases / RoPE / norm (overlaps with existing baracuda phases); Triton kernel paths (no Triton toolchain). | done (Tier 1) |
| 55 | TransformerEngine FP8 cast + delayed-scaling recipe (alpha.57, gated behind `tensor_engine` cargo feature; clean-room hand-port of NVIDIA TransformerEngine Apache-2.0 algorithm — only the cast + recipe subset). NEW sibling crate pair `baracuda-transformer-engine-sys` + `baracuda-transformer-engine`. The differentiated value of TE is the **per-tensor delayed-scaling recipe with amax history** for stable FP8 training; that's the load-bearing piece this phase ships. Public API: `Fp8Recipe` (RAII handle holding amax_history ring + scale + scale_inv device scalars), `Fp8CastPlan<TIn>` (fused FP8 cast + `max(\|x\|)` amax reduction in one kernel — atomicMax into `amax_history[write_pos]`), `Fp8DequantPlan<TOut>` (symmetric dequant via `scale_inv`). Both formats: E4M3 (max=448) for fwd/weights, E5M2 (max=57344) for grads. Wide dtypes: f32/f16/bf16. 4 new C-ABI symbols (`baracuda_te_fused_cast_amax_run` / `baracuda_te_dequant_run` / `baracuda_te_recipe_update_run` / `baracuda_te_recipe_init_run`) + format/dtype id helpers. **NO cuDNN dep** (cast/recipe paths don't need it — cuDNN is only needed for `fused_attn`, which we skip). **NO pybind11** (raw C ABI, not Python). 10 GPU smoke tests, all green on RTX 4070. **Sm_89 reality check**: FP8 storage + cast intrinsics work natively on Ada, but tensor-core FP8 MMA throughput equals BF16 — so the wins here are bandwidth-saving (KV cache, weight storage, activation memory) not compute. Recipe machinery is forward-compatible with Hopper (sm_90a) / Blackwell (sm_100) where the MMA throughput win also materializes. Deliberately NOT lifted: `normalization` (Phase 5), `fused_rope` (Phase 14/36/41), `fused_attn` (Phase 17/42; cuDNN dep), `fused_softmax` (Phase 5), `activation` (Phase 3/31), `gemm` (Phase 1+24+30), `comm_gemm_overlap` (Hopper TMA), `fused_router` (Phase 8+20), `hadamard`/`newton_schulz`/`swizzle`/`permutation` (niche), `multi_tensor` (Phase 49), `dropout` (composable), Python bindings (out of scope). Algorithmic reference: `transformer_engine/common/{cast,recipe}/*.cu` upstream + FP8 spec Micikevicius et al. 2022 (arXiv:2209.05433). Vendor attribution + full Apache-2.0 text at `crates/baracuda-transformer-engine-sys/ATTRIBUTION.md`. | done |
| 57 | Megatron-LM tensor-parallel primitives (alpha.57, no version bump; gated behind `megatron_tp` cargo feature). **NEW sibling crate `baracuda-megatron`** — pure-composition over `baracuda-cublas` (local GEMM via `cublasSgemm` for f32, `cublasGemmEx` with `Compute32F` accumulator for f16/bf16) + `baracuda-nccl` (cross-rank `all_gather` / `all_reduce` collectives). **NO new CUDA kernels** — foundational TP primitives for Megatron-style models are pure orchestration; the kernel substrate already exists in baracuda. NEW plans: `ColumnParallelLinearPlan<T>` (splits W along output dim; FW `Y_local = X @ W_local^T` + `all_gather`; BW `dX_partial = dY_local @ W_local` + `all_reduce(Sum)`, `dW_local = dY_local^T @ X` local) and `RowParallelLinearPlan<T>` (splits W along input dim; FW `Y_partial = X_local @ W_local^T` + `all_reduce(Sum)`; BW `dX_local = dY @ W_local` local, `dW_local = dY^T @ X_local` local — **no BW collective**, the Megatron pairing only needs one collective per layer-pair). NEW `TensorParallelContext` borrow type holding `&Communicator` + `in_features` / `out_features` / cached `rank` / `world_size`. Dtypes: f32 always; f16 + bf16 behind the crate-level `half-crate` feature (which the kernel-facade `megatron_tp` feature pulls in). Tier 1 scope — bias rejected at call site with a Tier-2 marker error (caller can Affine-add post-FW; matters for RowParallel where the bias must be added **after** the all_reduce so it doesn't get summed N times). 5 smoke tests across 3 files: `column_parallel_smoke` (FW + BW, single-rank, matches CPU `Linear` ref), `row_parallel_smoke` (same), `multi_rank_scaffold` (`#[ignore]`-gated 2-GPU scaffold — exits cleanly on single-GPU dev boxes). Algorithmic reference: Shoeybi et al. arXiv:1909.08053 (NVIDIA Megatron-LM, Apache-2.0); no source vendored. Out of scope: async overlap (Hopper TMA); sequence parallelism (Phase 56's domain); pipeline parallelism (future phase); VocabParallelEmbedding (future polish); distributed gradient accumulation (Phase 58's domain); expert parallelism (separate phase). | done |
| 58 | DistributedAdam — ZeRO-1-style sharded optimizer state (alpha.57, no version bump; gated behind the new `distributed_optim` cargo feature on `baracuda-optim`, pulls `baracuda-nccl` as optional dep). **Pure-Rust composition** over Phase 49 [`AdamStepPlan`] + Phase 52 NCCL collectives — **NO new CUDA kernels**, **NO new `baracuda-kernels-sys` FFI**. NEW `DistributedAdamStepPlan<T>` wrapping the inner Adam plan + a borrowed `&Communicator`; orchestrates the canonical ZeRO-1 protocol: `all_reduce(grads, Sum, in-place)` → local Adam step on this rank's `1/world_size` shard → `all_gather(updated_params, in-place)`. **Single-rank degenerate case** (`world_size == 1`) elides both collectives and reduces to `AdamStepPlan::step` bit-exactly (smoke test verifies this on single-GPU dev hardware). f32 + f16 + bf16 dtypes, AdamW + classic mode, mixed-precision `step_with_f32_state` variant. NEW `shard_range(n, rank, world_size)` helper matching `torch.chunk` semantics. Phase 58 constraint: tensor element counts must be `world_size`-multiples (ring all_gather symmetry); per-tensor broadcast fallback for ragged shards is future work. Out of scope: ZeRO-2 (gradient sharding); ZeRO-3 (parameter sharding during FW/BW); DistributedLamb / DistributedSGD (same pattern, defer until concrete demand); CPU-offload optimizer state; 8-bit distributed optimizer state. 3 smoke test files (4 pure-Rust shard_range tests run unconditionally; 2 single-rank GPU smokes `#[ignore]`-gated for NCCL; 1 multi-rank scaffold `#[ignore]`-gated for 2+ GPU validation). Algorithmic reference: Rajbhandari et al. SC20 "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models", Microsoft DeepSpeed (Apache-2.0; no source vendored — pure Rust composition). | done |
| 56 | Ring Attention — sequence-parallel attention (alpha.57, no version bump — consolidation phase will bump; gated behind the new `ring_attention` cargo feature, pulls `baracuda-nccl` + `baracuda-nccl-sys` as optional deps). **First Phase 52 NCCL consumer** — proves the substrate. Hand-port of Liu/Yan/Abbeel 2023 (arXiv:2310.01889; algorithmic reference at `https://github.com/lhao499/RingAttention`, Apache-2.0 — no JAX source vendored, clean-room CUDA implementation). NEW `RingAttentionPlan<T>` + `RingAttentionDescriptor` + `RingAttentionArgs` in `crates/baracuda-kernels/src/attention/ring_attention.rs`; bespoke `kernels/attention/ring_attention_kernel.cu` (~480 LOC kernel + ~390 LOC plan). Per-rank online-softmax fold of resident K/V chunk into persistent `(o_acc, m_acc, l_acc)` f32 state; ring rotation via `comm.send`/`recv` inside `group_start`/`group_end`; finalize kernel emits `y = o_acc / l_acc` (+ optional `lse`). **Tier 1 scope**: f16/bf16 (f32/f64 deferred), `head_dim == 128`, FW only (BW Tier 2), no GQA broadcast, no arbitrary additive mask. Causal masking applied on **global** indices (each step kernel takes `q_global_base` + `k_global_base` so masking is consistent across rotation steps). 12 new FFI symbols (`workspace_bytes` + dtype-independent `init_run` + 5 per-dtype × 2 dtypes: `step_run` / `step_can_implement` / `finalize_run` / `finalize_can_implement`). Unlocks **million-token context length** across N GPUs with O(N/P) memory where N = total seq len, P = ring size. Complementary to Phase 57's tensor-parallelism (sequence-dim sharding vs head-dim sharding compose). 4 smoke tests (3 single-rank degenerate cases validating against `FlashSdpaPlan` ground truth: f16 + bf16 + f16 causal — all pass on RTX 4070; 1 multi-rank scaffold `#[ignore]`-gated for 2+ GPU validation). Single-rank `world_size == 1` reduces to standard FlashAttention math (the validation path on single-GPU hardware). | done |
| 48 | Marlin + AWQ 4-bit GEMM vendor + GPTQ→Marlin repack utility (alpha.57, no version bump — consolidation phase will bump; gated behind the new `marlin` + `awq` cargo features on `baracuda-kernels-sys` and `baracuda-kernels`). **Two complementary 4-bit GEMM vendors** completing the "4-bit hub coverage" started in Phase 53 (NF4). **Marlin** (IST-DASLab, Apache-2.0 + §3 patent grant, vendored at `crates/baracuda-kernels-sys/vendor/marlin/`) — state-of-the-art W4A16 GEMM for the decode-batch regime, ~3.87× speedup over FP16 GEMM at M ∈ [1, 32] on Ampere / Ada per the paper. **Symmetric** int4 (zero-point fused into dequant as `q - 8`); group size 128 or per-channel; sm_80/86/89 only (sm_90 needs WGMMA rewrite — Marlin v2 territory, deferred). NEW `Int4MarlinGemmPlan<f16>`. **AWQ** (mit-han-lab, MIT — no patent grant, vendored at `crates/baracuda-kernels-sys/vendor/awq/`) — natively supports the **most-deployed 4-bit format on the Hugging Face Hub** (Llama / Mistral / Qwen `*-AWQ`). **Asymmetric** int4 with explicit per-group zero-points; group size 64 or 128; loads directly from HF checkpoints without repack. NEW `Int4AwqGemmPlan<f16>`. **GPTQ→Marlin repack utility** — pure-Rust host-side `gptq_to_marlin_repack` bridging GPTQ asymmetric checkpoints into Marlin's symmetric layout via zero-point absorption (trailblazer implementation; act_order=True deferred, the upstream Marlin intra-fragment permutation table is documented but uses identity permutation in the trailblazer). 4 new FFI symbols total (2 Marlin: `_run` + `_can_implement`; 4 AWQ: `_run` + `_workspace_bytes` + `_can_implement` + dequant stub). AWQ vendor source patched to strip the upstream `<torch/extension.h>` host wrapper (`__asm__ __volatile__` → `asm volatile` for MSVC nvcc portability) and re-export only the device-side `__global__` template kernel. Marlin needs `--expt-relaxed-constexpr` (constexpr `ceildiv` called from `__global__`). Both kernels build clean on RTX 4070 with the gated features. 3 smoke test files (marlin_smoke `#[ignore]` GPU + descriptor validation; awq_smoke `#[ignore]` GPU + descriptor validation; gptq_to_marlin_smoke pure-Rust roundtrip + zp-fold verification + clamp-at-extremes). | done |
| 46+ | Phase 46-51 mainstream-techniques roadmap (FlashInfer cherry-pick, Marlin/AWQ); Hopper sm_90a / Blackwell sm_100; 1.0 freeze. | pending (see [`ROADMAP.md`](ROADMAP.md)) |

API stability is **not** promised before beta.0. Breaking changes ship in
each alpha bump and are documented in the workspace `CHANGELOG.md`.

## Quick start

Add the kernel facade and the driver crate:

```toml
[dependencies]
baracuda-kernels = { version = "0.0.1-alpha.64", features = ["sm89", "cudnn"] }
baracuda-driver  = "0.0.1-alpha.64"
```

A representative example — single-axis numerically stable softmax over a
device-resident tensor:

```rust,no_run
use baracuda_driver::{Context, Device, DeviceBuffer, Stream};
use baracuda_kernels::{
    PlanPreference, SoftmaxArgs, SoftmaxDescriptor, SoftmaxKind, SoftmaxPlan,
    TensorMut, TensorRef, Workspace,
};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // 1. Standard CUDA bring-up via baracuda-driver.
    let ctx = Context::new(&Device::get(0)?)?;
    let stream = Stream::new(&ctx)?;

    // 2. Allocate device input + output buffers (rank-2: rows × cols).
    let rows = 32i32;
    let cols = 1024i32;
    let n_elems = (rows * cols) as usize;
    let dev_x: DeviceBuffer<f32> = DeviceBuffer::zeros(&ctx, n_elems)?;
    let mut dev_y: DeviceBuffer<f32> = DeviceBuffer::zeros(&ctx, n_elems)?;

    // 3. Build the descriptor — pure shape + dtype + op-kind, no handles.
    let desc = SoftmaxDescriptor::<2> {
        kind: SoftmaxKind::Softmax,
        input_shape: [rows, cols],
        softmax_axis: 1,
        element: <f32 as baracuda_kernels::KernelDtype>::KIND,
    };

    // 4. Plan selection — picks a kernel SKU (bespoke softmax kernel here).
    let plan = SoftmaxPlan::<f32, 2>::select(&stream, &desc, PlanPreference::default())?;

    // 5. Args carry the per-call tensor handles + strides.
    let args = SoftmaxArgs {
        x: TensorRef { data: dev_x.as_slice(), shape: [rows, cols], stride: [cols as i64, 1] },
        y: TensorMut { data: dev_y.as_slice_mut(), shape: [rows, cols], stride: [cols as i64, 1] },
    };

    // 6. Launch. Workspace::None for plans that need no scratch.
    plan.run(&stream, Workspace::None, args)?;
    stream.synchronize()?;
    Ok(())
}
```

The same `select` → `run` shape applies to every op. GEMM, attention,
conv2d, FFT, scatter — the descriptor / args fields differ per family but
the lifecycle is identical. See the [`crates/baracuda-kernels`
README](crates/baracuda-kernels/README.md) for the int8-GEMM variant of
the same example.

## Workspace layout

The user-facing crates a typical caller will reach for:

```text
baracuda-kernels             # the unified Plan-based ML op facade
baracuda-kernels-types       # shared type vocabulary (Element, TensorRef, KernelSku, ...)
baracuda-kernels-sys         # raw FFI to bespoke .cu kernels
baracuda-kernels-bench       # criterion harness for sm_89 perf sweeps (not published)
baracuda-cutlass             # safe wrapper for CUTLASS GEMM (float, int8 RCR, batched, grouped)
baracuda-driver              # safe wrapper for the CUDA Driver API
baracuda-runtime             # safe wrapper for the CUDA Runtime API
```

The per-library wrappers used internally by the facade (you can also use
them stand-alone):

```text
baracuda-cublas{,-sys}       # cuBLAS + cuBLASLt + cuBLASXt
baracuda-cudnn{,-sys}        # cuDNN classic + Graph API
baracuda-cudf{,-sys}         # cuDF (RAPIDS dataframe; Linux-only)
baracuda-cufft{,-sys}        # cuFFT
baracuda-cusolver{,-sys}     # cuSOLVER dense + sparse + Rf + Mg
baracuda-cusparse{,-sys}     # cuSPARSE
baracuda-curand{,-sys}       # cuRAND
baracuda-cutensor{,-sys}     # cuTENSOR
baracuda-cutlass{,-sys}      # CUTLASS GEMM kernel templates
baracuda-cutlass-kernels-sys # CUTLASS kernel-only compile target
baracuda-cuvs{,-sys}         # RAPIDS cuVS GPU vector search (Phase 71)
baracuda-cvcuda{,-sys}       # CV-CUDA image processing
baracuda-flashinfer{,-sys}   # FlashInfer paged-KV + cascade + sampling (Phase 46/66)
baracuda-npp{,-sys}          # NPP
baracuda-nccl{,-sys}         # NCCL
baracuda-nvcomp{,-sys}       # nvCOMP
baracuda-nvimagecodec{,-sys} # nvImageCodec (Phase 70 — supersedes nvJPEG)
baracuda-nvjpeg{,-sys}       # nvJPEG (kept for back-compat)
baracuda-nvshmem{,-sys}      # NVSHMEM symmetric-heap RDMA (Phase 69)
baracuda-ozimmu{,-sys}       # ozIMMU Ozaki-scheme DGEMM (Phase 44)
baracuda-transformer-engine{,-sys}  # TransformerEngine FP8 (Phase 55)
```

And the supporting low-level crates (FFI, build infrastructure, profiling):

```text
baracuda-cuda-sys            # Driver + Runtime FFI
baracuda-nvrtc{,-sys}        # runtime CUDA C++ → PTX
baracuda-nvjitlink{,-sys}    # CUDA 12+ JIT linker
baracuda-cupti{,-sys}        # profiling APIs
baracuda-nvml{,-sys}         # device monitoring
baracuda-cufile{,-sys}       # GPUDirect Storage (Linux-only)
baracuda-tensorrt{,-sys}     # TensorRT inference runtime (Phase 68 — vtable-dispatch C++ shim)
baracuda-forge              # build-time .cu → PTX compiler driver
baracuda-build              # build.rs helpers
baracuda-core                # loader + Error plumbing
baracuda-types{,-derive}     # pure-data types: Half, BFloat16, Complex, DeviceRepr
```

Vendor-track training / optimization crates (opt-in via feature gates on
`baracuda-kernels`):

```text
baracuda-megatron            # Megatron-LM TP primitives (Phase 57; composition, no kernels)
baracuda-optim               # NVIDIA Apex multi-tensor optimizers (Phase 49)
```

The full umbrella crate (`baracuda`) re-exports everything behind cargo
features — convenient when you want everything; overkill when you don't.

## Hardware support

baracuda targets **Ampere and newer** by design. Pre-Ampere GPUs lack the
tensor-core instructions and async-copy primitives the bespoke kernels are
written against (`mma.sync.m16n8k*`, `cp.async`, `ldmatrix`), and we have
no desire to ship a slower SIMT fallback for hardware that's eight years
old.

| Compute capability | NVIDIA marketing names | baracuda support |
| --- | --- | --- |
| sm_80 | Ampere (A100, A40, A30, RTX 30xx) | **default baseline** |
| sm_89 | Ada Lovelace (RTX 40xx, L40, L4) | feature-gated specialized kernels (FP8, larger Flash Attention tiles) |
| sm_90a | Hopper async (H100, H200) | stubs in place; full specialization pending Phase 11 |
| sm_100 | Blackwell | post-Phase-11 |
| ≤ sm_75 (Turing, Volta, Pascal, …) | — | **unsupported** |

The default `sm80` build runs forward-compatibly on Ada and Hopper through
JIT-compiled PTX; turn on `sm89` to pick up the FP8 and Flash-Attention
sibling plans tuned for Ada's larger register file.

## Cargo features

The kernel facade exposes a broad opt-in feature set. Architecture and
library-integration features are off by default — pick what your
deployment needs. **Important:** unless a feature is enabled, its
plans are not built, not linked, and not present in the public API.
Always check the feature gate when scanning for an op family.

### Architecture targets

| Feature | Default | Effect |
| --- | --- | --- |
| `sm80` | yes | Ampere-baseline kernel set (RTX 30xx, A100). |
| `sm89` | no | Ada Lovelace specializations (FP8 GEMM, `FlashSdpaSm89Plan`). |
| `sm90a` | no | Hopper-specialized kernels (stubs today; tracked for Phase 7x). |

### NVIDIA library integrations

| Feature | Default | Phase | Effect |
| --- | --- | --- | --- |
| `cudnn` | no | 7 | Link cuDNN. Enables Conv2d / Pool2d / `CtcLossCudnnPlan`. Separate NVIDIA download. |

### Vendored kernel families

| Feature | Default | Phase | Effect |
| --- | --- | --- | --- |
| `fa2` | no | 42 | Vendored Tri Dao FlashAttention v2 (BSD-3). Long-context routing on `FlashSdpaPlan`. |
| `mhc` | no | 43 | Vendored DeepSeek-AI mHC.cu (MIT). `HyperConnectionPlan` (learned residuals). |
| `ozimmu` | no | 44 | Vendored ozIMMU (MIT). Ozaki-scheme DGEMM via S² int8 tensor-core matmuls. |
| `flashinfer` | no | 46/66 | Vendored FlashInfer (Apache-2.0). Paged-KV decode/prefill, cascade, sampling. |
| `marlin` | no | 48 | Vendored IST-DASLab Marlin (Apache-2.0). Symmetric int4 W4A16 GEMM. |
| `awq` | no | 48 | Vendored mit-han-lab llm-awq (MIT). Asymmetric int4 W4A16 GEMM. |
| `optim` | no | 49 | Vendored NVIDIA Apex (BSD-3). Multi-tensor Adam/Lamb/SGD step kernels. |
| `mamba` | no | 50 | Vendored Mamba-2 SSD + Dao causal-conv1d. State-space LLM ops. |
| `bnb_nf4` | no | 53 | Vendored bitsandbytes NF4 (MIT). 4-bit non-uniform quantile QLoRA. |
| `xformers_blocksparse` | no | 54 | Clean-room port of xFormers block-sparse SDPA (BSD-3). |
| `xformers_sparse24` | no | 54 | Clean-room port of xFormers 2:4 structured sparsity GEMM (BSD-3). |
| `tensor_engine` | no | 55 | Vendored NVIDIA TransformerEngine (Apache-2.0). FP8 cast/dequant + delayed-scaling recipe. |
| `ring_attention` | no | 56 | Sequence-parallel ring attention (Apache-2.0 reference). Pulls in NCCL. |
| `megatron_tp` | no | 57 | Megatron-LM tensor-parallel primitives (composition, no new kernels). |
| `nvshmem` | no | 69 | NVSHMEM host-side wrapper (one-sided RDMA, sibling to NCCL). Linux-only. |

Sibling crates also have their own feature gates:

- `baracuda-cuvs` — `cuvs` feature in `baracuda-kernels-sys` ecosystem
  pulls in RAPIDS cuVS vector-search (Phase 71, Linux-only). Not yet
  re-exported by `baracuda-kernels` itself.
- `baracuda-tensorrt` — `shim` feature builds the vtable-dispatch C++
  shim required to call TensorRT (Phase 68 — no flat C ABI upstream).
- `baracuda-flashinfer` — same `flashinfer` feature as above.

**Notes:**

- `cudnn` is off by default because cuDNN is a separate NVIDIA download
  not bundled with the stock CUDA toolkit installer. Enabling it without
  cuDNN installed produces a linker error on `cudnn.lib` / `libcudnn.so`
  — see the [Building](#building) section for auto-discovery paths.
- Most vendored features add 30s–5 min to first build (template-heavy
  CUDA). Subsequent builds incremental.
- Features that pull in optional sibling crates (`optim`, `tensor_engine`,
  `megatron_tp`, `nvshmem`, `ring_attention`) only compile the sibling
  when the feature is enabled — inference-only consumers don't pay the
  surface cost.

## Building

Requirements:

- **CUDA Toolkit ≥ 12.0** with `nvcc` on `PATH`. baracuda is tested on
  12.x and 13.x.
- **cuDNN 9.x** (only if you enable the `cudnn` feature) — separate
  NVIDIA download, not bundled with the toolkit.
- **A working Rust toolchain ≥ 1.85** (workspace MSRV pinned in
  `rust-toolchain.toml`).
- **Windows users**: `lld-link.exe` somewhere on `PATH`. The CUDA `nvcc`
  invocation links through it; the install location is typically
  `C:\Program Files\LLVM\bin`. Install the LLVM Windows package and add
  that directory to `PATH` if `cargo build` complains about
  `lld-link.exe` not being found.

A typical full build with all GPU-side features (CUDA toolkit + cuDNN
present):

```bash
cargo build -p baracuda-kernels --features sm89,cudnn --release
```

Or, to verify the public API surface compiles without the full kernel
build (fast — type-check only):

```bash
cargo check -p baracuda-kernels --features sm89,cudnn
```

The `baracuda-kernels-sys` build script auto-discovers cuDNN at the
following paths in order: `CUDNN_PATH` / `CUDNN_ROOT` / `CUDNN_HOME` env
vars, then `C:\Program Files\NVIDIA\CUDNN\v<X.Y>\` on Windows, then the
CUDA toolkit's own `lib/` directory (pre-cuDNN-9 layout), then the
standard Linux distro paths under `/usr/lib/`.

## Troubleshooting

### Windows: Git-for-Windows fake `link.exe` shadowing the MSVC linker

Git-for-Windows ships a GNU coreutils binary named `link.exe` at
`C:\Program Files\Git\usr\bin\link.exe` — its job is to create a hard
link, **not** to link object files. If that directory appears on `PATH`
ahead of the MSVC linker (or LLVM's `lld-link.exe`), `cargo build`
invokes the coreutils binary instead of the real linker and fails with a
cryptic error (it doesn't understand `/OUT:` and friends).

baracuda's `baracuda-kernels-sys` and `baracuda-cutlass-sys` build
scripts probe `PATH` on Windows and emit a `cargo:warning` if they
detect this shadowing. **Fix:** re-order `PATH` so the MSVC linker
(typically reached via the Visual Studio "x64 Native Tools Command
Prompt") or LLVM's `lld-link.exe` (`C:\Program Files\LLVM\bin\`) appears
before `C:\Program Files\Git\usr\bin\`. Building from the VS x64 Native
Tools prompt is the most reliable option; alternatively, install LLVM
and put its `bin` directory ahead of Git's on the user/system `PATH`.

## Testing

baracuda's GPU integration tests are gated behind `#[ignore]` so a
host-only `cargo test` doesn't try to launch a kernel on a machine
without an NVIDIA driver. To run them you need a working GPU plus the
`--ignored` flag:

```bash
# Host-only tests (compile + reference logic; no GPU access):
cargo test -p baracuda-kernels --lib

# Full GPU integration sweep — RTX 30xx / 40xx / 50xx required:
cargo test -p baracuda-kernels --release -- --ignored

# Verify the workspace-level API surface compiles (no GPU needed):
cargo check -p baracuda-kernels --features sm89,cudnn
```

The full regression on an RTX 4070 covers 324 binary targets at
~1630 tests passing. Individual op-family suites take 30–90 seconds;
the full sweep is 25–40 minutes.

## Benchmarks

The `baracuda-kernels-bench` crate is a criterion-based harness with
CUDA-event-timed throughput sweeps across GEMM, Flash Attention, and
Conv2d at LLM-typical and ResNet-typical shapes. It is **not** published
to crates.io (it depends on a working GPU).

```bash
cargo bench -p baracuda-kernels-bench --features sm89,cudnn
```

The full sweep takes ~30 minutes on an RTX 4070. Scope to a single family
with `--bench gemm` / `--bench flash_attention` / `--bench conv2d`. See
[`crates/baracuda-kernels-bench/BENCH-sm89.md`](crates/baracuda-kernels-bench/BENCH-sm89.md)
for the baseline table format and methodology.

## Project documentation

- [`ARCHITECTURE.md`](ARCHITECTURE.md) — layered design, Plan-Descriptor-Args
  pattern, `KernelSku` taxonomy, dispatcher design, workspace contract,
  sibling-plan pattern, vendoring convention, phase roadmap.
- `OP-MATRIX.md` — full op × dtype × backend coverage matrix (planned).
- `LESSONS.md` — postmortems, ABI footguns, performance traps (planned).
- Per-crate `README.md` files under `crates/<name>/`.

## License

Dual-licensed under [MIT](LICENSE-MIT) **or** [Apache-2.0](LICENSE-APACHE).
Pick whichever fits your project. Contributions accepted under the same
terms.

NVIDIA's CUDA libraries (`libcuda`, `libcudart`, `libcublas`, `libcudnn`,
…) are **not** redistributed by this project. You obtain them from NVIDIA
separately — either through the CUDA Toolkit installer or through each
library's dedicated download page. baracuda's loader opens whatever the
host driver / toolkit has installed.

## Vendor attribution

A small number of bespoke kernels in `baracuda-kernels-sys` are vendored
from upstream open-source projects (huggingface/candle's CUDA kernel set
via `fuel-cuda-kernels`; llama.cpp's `ggml-cuda` GGUF block-format
quantization + MMVQ; `guoqingbao/attention.rs`'s fused MoE expert
kernels). Each adapted source carries an `SPDX-FileCopyrightText:` +
`SPDX-License-Identifier:` header; the consolidated provenance is in
[`crates/baracuda-kernels-sys/LICENSE-thirdparty.md`](crates/baracuda-kernels-sys/LICENSE-thirdparty.md).

[**FlashAttention v2**](https://github.com/Dao-AILab/flash-attention)
(Tri Dao, BSD-3-Clause, pinned at `v2.8.3` /
`060c9188beec3a8b62b33a3bfa6d5d2d44975fab`) is vendored at
[`crates/baracuda-kernels-sys/vendor/flash-attention/`](crates/baracuda-kernels-sys/vendor/flash-attention/)
with verbatim `LICENSE` + `AUTHORS` files and full vendor / scope notes
in [`VENDOR.md`](crates/baracuda-kernels-sys/vendor/flash-attention/VENDOR.md).
Gated behind the `fa2` cargo feature on `baracuda-kernels-sys` and
`baracuda-kernels`; exposed through a backend-choice path on
`FlashSdpaPlan` (Phase 42).

[**mHC.cu**](https://github.com/AndreSlavescu/mHC.cu) (Andre Slavescu,
MIT, pinned at `a426939c2dbc11c443db041bcff12b65d1b6482a`) — unofficial
CUDA implementation of DeepSeek-AI's
[*Manifold-Constrained Hyper-Connections*](https://arxiv.org/abs/2512.24880)
paper — is vendored at
[`crates/baracuda-kernels-sys/vendor/mhc/`](crates/baracuda-kernels-sys/vendor/mhc/)
with the verbatim upstream `LICENSE`, an `AUTHORS` file, and full
vendor / scope notes in
[`VENDOR.md`](crates/baracuda-kernels-sys/vendor/mhc/VENDOR.md). Gated
behind the `mhc` cargo feature on `baracuda-kernels-sys` and
`baracuda-kernels`; exposed through the new `HyperConnectionPlan`
(Phase 43, Tier 1: static-H FW, bf16 weights / f32 activations).

[**FlashInfer**](https://github.com/flashinfer-ai/flashinfer) (NVIDIA
+ FlashInfer community, Apache-2.0 with full patent grant, pinned at
`v0.6.12` / `eee0d75f91f64c520bfaed07e39a850ea4ddde23`) — a curated
~12 kLOC subset of the FlashInfer header tree is vendored at
[`crates/baracuda-kernels-sys/vendor/flashinfer/`](crates/baracuda-kernels-sys/vendor/flashinfer/)
with verbatim upstream `LICENSE` + `NOTICE` and full vendor / scope /
patch notes in
[`VENDOR.md`](crates/baracuda-kernels-sys/vendor/flashinfer/VENDOR.md).
Gated behind the `flashinfer` cargo feature on `baracuda-kernels-sys`
and `baracuda-kernels`; exposes three NEW plan families — paged-KV
decode + append (`BatchPagedDecodePlan` + `PagedKvAppendPlan` for
vLLM-style serving), sort-free sampling (`TopKTopPSamplingPlan` —
combined TopK/TopP/MinP via a single-kernel rejection sampler), and
cascade attention LSE merge (`CascadeAttentionPlan` for prefix-cache
sharing). Surgical cherry-pick (not a wholesale wrap) — Hopper /
Blackwell / NVSHMEM / Mamba / MLA / POD paths intentionally skipped
to keep the build cost contained. Phase 46.

The [`baracuda-forge`](crates/baracuda-forge) build-time kernel-compiler
crate is a vendored fork of [`cudaforge`](https://github.com/guoqingbao/cudaforge)
by **Guoqing Bao** — see [`crates/baracuda-forge/NOTICE`](crates/baracuda-forge/NOTICE)
for the upstream commit hash.

The [`baracuda-cutlass`](crates/baracuda-cutlass) safe wrapper for NVIDIA
CUTLASS — plan-based GEMM and grouped-GEMM with caller-supplied
workspace, MoE-friendly variable-M-per-group dispatch — was specified
by the **Fuel ML library team**. See
[`crates/baracuda-cutlass/NOTICE`](crates/baracuda-cutlass/NOTICE) for
the design lineage.

[`baracuda-kernels`]: crates/baracuda-kernels
[`baracuda-kernels-sys`]: crates/baracuda-kernels-sys