atomr-accel 0.10.0

Backend-agnostic compute-acceleration core. Defines the AccelBackend trait, AccelRef<T> typed pointers, AccelError enum, and CompletionStrategy — the abstraction layer that lets atomr-accel-cuda (NVIDIA), and future ROCm / Metal / oneAPI / Vulkan backends plug into the same actor surface.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
# atomr-accel

An **actor-shaped face for compute acceleration**, on top of the
[atomr](https://github.com/rustakka/atomr) actor runtime. NVIDIA CUDA
ships today through [`atomr-accel-cuda`](crates/atomr-accel-cuda); the
backend trait surface accommodates AMD ROCm, Apple Metal, Intel
oneAPI, and Vulkan compute when those crates land. Each backend
library ([cuBLAS][cublas], [cuDNN][cudnn], [cuFFT][cufft],
[cuRAND][curand], [cuSOLVER][cusolver], [cuSPARSE][cusparse],
[cuTENSOR][cutensor], [cuBLASLt][cublaslt], [NVRTC][nvrtc],
[NCCL][nccl]) becomes a typed atomr actor with stable supervision,
generation-validated buffers, and a single async surface. Drop GPU
work into a Rust service without juggling streams, contexts, or
hand-rolled retry loops.

```rust
use atomr_accel_cuda::prelude::*;

let device = system.actor_of(DeviceActor::props(DeviceConfig::new(0)), "gpu-0")?;
let a = ask_alloc::<f32>(&device, n * n).await?;
let b = ask_alloc::<f32>(&device, n * n).await?;
let c = ask_alloc::<f32>(&device, n * n).await?;

device.tell(DeviceMsg::Sgemm(Box::new(SgemmRequest {
    a, b, c, m: n, n: n, k: n, alpha: 1.0, beta: 0.0, reply,
})));
// reply arrives once the kernel completes — no host blocking,
// no manual stream synchronization.
```

That's the whole shape. The same envelope wires up convolutions
([`cudnnConvolutionForward`][cudnn-conv]), tensor contractions
([`cutensorContract`][cutensor-contract]), JIT-compiled custom kernels
([`nvrtcCompileProgram`][nvrtc-compile]), and multi-GPU all-reduce
([`ncclAllReduce`][nccl-allreduce]).

## Why an actor-shaped face for compute, in Rust, now

Modern workloads no longer live entirely on the CPU. Inference,
embedding, scoring, simulation — they want a GPU. Coordination,
control flow, I/O, persistence — they want a CPU. Today's stacks
force you to glue the two with ad-hoc batching layers, queues, and
serialization boundaries.

The actor model already encodes the right boundary: a message **is**
the dispatch unit. atomr-accel is built so that the same
`actor_ref.tell(msg)` can target a CPU mailbox today and a CUDA-backed
dispatcher tomorrow — with the same supervision, the same
backpressure, the same observability. The runtime is explicit about
*where* work runs without forcing the developer to write two programs.

Writing CUDA from Rust today otherwise means owning a long list of
invariants yourself:

| You'd otherwise hand-roll                                                                      | atomr-accel gives you                                                                            |
| ---------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------ |
| One [`CUcontext`][cuda-ctx] per device, restarted on poisoning                                 | `DeviceActor ↔ ContextActor` two-tier supervision                                               |
| [Sticky-error][cuda-sticky] detection and graceful recovery                                    | `OneForOneStrategy` + `GpuError::ContextPoisoned` decider                                       |
| Buffer staleness across context rebuilds                                                       | `GpuRef<T>` with generation tokens                                                              |
| Pinning library handles ([cuBLAS][cublas-handle], [cuDNN][cudnn-handle]) to a single OS thread | `GpuDispatcher` + per-actor handle                                                              |
| [Stream-event][cuda-events] choreography for kernel completion                                 | `HostFnCompletion` (sub-µs `cuLaunchHostFunc`) / `SyncCompletion` / `PolledCompletion`          |
| [`cuMemcpyPeerAsync`][cuda-p2p] cross-stream synchronization                                   | `P2pTopology` with `last_write_stream` injection                                                |
| [Page-locked][cuda-pinned] host buffer pooling                                                 | `PinnedBufferPool` actor                                                                        |
| [CUDA Graph][cuda-graph] capture/replay                                                        | `GraphActor` (`Sgemm` / `Memcpy` / `RngFillUniform` / `FftR2C` record contracts)                |
| Multi-GPU [communicator rebuild][nccl-comm] on context loss                                    | `NcclWorldActor` subscribes to `WatchGeneration`, tears down + rebuilds collectives             |

Because every concern is an actor, you compose CUDA the same way you
compose any other Rust service: `tokio` runtime, structured
supervision, typed messages, async/await throughout.

## What's in the box

| Crate                       | What it does                                                                                                         |
| --------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| `atomr-accel`               | Backend-agnostic core — `AccelBackend` trait, `AccelRef<T>`, `AccelDtype`/`DType`, `AccelError`, `CompletionStrategy`. Each backend crate (e.g. `atomr-accel-cuda`) depends on this for its trait surface. |
| `atomr-accel-cuda`          | NVIDIA CUDA implementation — `DeviceActor`/`ContextActor`, kernel actors for cuBLAS/cuBLASLt/cuDNN/cuFFT/cuRAND/cuSOLVER/cuSPARSE/cuTENSOR/NVRTC/NCCL, P2P topology, CUDA graphs, pinned pools |
| `atomr-accel-patterns`      | Universal blueprints — `DynamicBatchingServer`, `InferenceCascade`, `ModelReplicaPool`, `FairShareScheduler`, `ModelHotSwapServer`, `SpeculativeDecoder`, `MoeRouter`, plus a CPU `GpuMockActor` |
| `atomr-accel-train`         | Distributed-training blueprints — `DataParallelTrainer`, `PipelineParallelTrainer`, `TensorParallelTrainer`, `AsyncParameterServer`, optimizer + loss enums |
| `atomr-accel-agents`        | LLM blueprints — `RagPipeline` (with `EmbeddingCache` LRU + `CpuVectorIndex`), `SharedGpuStateCoordinator`, `LangGraphGpuActor` (DAG executor with cycle detection) |
| `atomr-accel-cuda-realtime` | NVRTC-backed realtime sims — `ImageFilterPipeline`, `ParticleSystemActor`, `ClothSimulationActor`, `FluidSimulationActor`, `SpatialIndexActor`, `GpuHashMapActor`, `GpuSparseStructureActor`, `MultiPassAnalysisActor`, `VideoEffectsGraph` |
| `atomr-accel-cub`           | CUB device-wide primitives — `CubActor` with reduce / scan / sort / histogram / select / partition / segmented-reduce dispatchers, NVRTC-templated per `(op, dtype, length-class)` |
| `atomr-accel-cutlass`       | CUTLASS kernel-template instantiation — `CutlassActor` for GEMM, grouped-GEMM, implicit-GEMM convolution, EVT (epilogue visitor tree), via NVRTC against vendored headers |
| `atomr-accel-flashattn`     | FlashAttention v2 + v3 kernels — `FlashAttnActor` with forward/backward, paged KV-cache, chunked prefill, varlen, ALiBi, sliding window, sink tokens, MQA/GQA, fp8 (fa3 only) |
| `atomr-accel-tensorrt`      | TensorRT engine builder + runtime — `TrtActor`, `IBuilderConfig` (fp32/fp16/bf16/int8/fp8/best), ONNX import, INT8 calibration, FP8 PTQ, `IPluginV3` Rust trampolines |
| `atomr-accel-telemetry`     | Observability backends — `NvtxKernelTrace` for kernel-range markers, `NvmlActor` for power/temp/ECC/clocks, `CuptiSession` for activity tracing |
| `atomr-accel-py`            | Python bindings via PyO3 — `atomr_accel.{System, Device, GpuBufferF32/F64/I32/U32/U8, Blas, Cudnn, Fft, RngGenerator, Solver, Collective, NvrtcKernel}`, typed exceptions, GIL-released kernel paths. Tracks upstream atomr 0.5.x Python coverage. |

Plus a Python facade — `pip install atomr-accel` — that exposes the
same actor model. Phase 1 ships multi-dtype buffers, the BLAS handle
(`gemm_f32`/`gemm_f64`/`axpy_f32`), per-feature handles for cuDNN /
cuFFT / cuRAND, and structural anchors for cuSOLVER / NCCL / NVRTC.
See [`docs/python-bridge.md`](docs/python-bridge.md) for the full
matrix and the Phase 1.5+ tracking issues.

## At a glance

```
   ┌──────────── ActorSystem ────────────┐
   │                                     │
   │   ┌─────────── DeviceActor ─────────┴─── stable address (ActorRef<DeviceMsg>)
   │   │   (queues work while context rebuilds)
   │   │
   │   │   ┌─── ContextActor ──── owns Arc<CudaContext> ── restartable
   │   │   │
   │   │   │   ├── BlasActor        ── cuBLAS handle  ── pinned to one stream
   │   │   │   ├── CudnnActor       ── cuDNN handle   ── pinned to one stream
   │   │   │   ├── FftActor         ── cuFFT plans    ── plan cache
   │   │   │   ├── RngActor         ── cuRAND gen     ── seedable
   │   │   │   ├── SolverActor      ── cuSOLVER       ── QR/LU/Chol/SVD/Syevd
   │   │   │   ├── SparseActor      ── cuSPARSE       ── CSR SpMv/SpMm
   │   │   │   ├── TensorActor      ── cuTENSOR       ── Einstein Contract
   │   │   │   ├── BlasLtActor      ── cuBLASLt       ── fused matmul + ReLU/GELU
   │   │   │   ├── NvrtcActor       ── NVRTC          ── JIT compile + launch
   │   │   │   └── CollectiveActor  ── NCCL comm      ── per-rank
   │   │   │
   │   │   └── PinnedBufferPool / ManagedAllocator / GraphActor / P2pTopology
   │
   └── PlacementActor / ReplayHarness / NcclWorldActor (top-level)
```

Each box is an actor. Messages are typed enums. Replies are
`oneshot::Sender` channels. Failures panic with a tagged string
(`"ContextPoisoned: …"` / `"OutOfMemory: …"` / `"Unrecoverable: …"`)
and the supervisor decides Restart / Resume / Stop / Escalate.

## Quick start (Rust)

The umbrella crate is published on crates.io as **`atomr-accel`**:

```toml
[dependencies]
atomr-accel = { version = "0.1", features = ["cuda"] }
tokio = { version = "1", features = ["rt-multi-thread", "macros"] }
```

Or pull in subsystem crates directly — `atomr-accel-cuda`,
`atomr-accel-patterns`, `atomr-accel-train`, `atomr-accel-agents`,
`atomr-accel-cuda-realtime` are all on crates.io.

```rust
use atomr_accel_cuda::prelude::*;
use atomr::prelude::*;

#[tokio::main(flavor = "multi_thread", worker_threads = 2)]
async fn main() -> anyhow::Result<()> {
    let system = ActorSystem::create("gpu-app", Config::empty()).await?;

    // Real-mode device. Use `DeviceConfig::mock(0)` for no-GPU CI.
    let device = system.actor_of(
        DeviceActor::props(DeviceConfig::new(0)),
        "device-0",
    )?;

    // Allocate, copy, dispatch — see docs/getting-started.md.

    system.terminate().await;
    Ok(())
}
```

cudarc loads CUDA dynamically, so the workspace **builds and
unit-tests on hosts without a GPU**. Real kernel paths are gated
behind `--features cuda-runtime-tests`.

```bash
# No GPU needed:
cargo check --workspace --no-default-features
cargo test  --workspace --no-default-features
cargo run   -p atomr-accel-cuda --example echo_no_gpu

# With GPU + CUDA toolkit:
cargo run   -p atomr-accel-cuda --example sgemm     --features cuda-runtime-tests
cargo run   -p atomr-accel-cuda --example fft_1d    --features cuda-runtime-tests,cufft
cargo run   -p atomr-accel-cuda --example jit_relu  --features cuda-runtime-tests,nvrtc
```

## Quick start (Python)

```bash
python -m venv .venv && source .venv/bin/activate
pip install atomr-accel
```

```python
import numpy as np
from atomr_accel import System, Unrecoverable

with System.open("gpu-app") as system:
    dev = system.spawn_device(device_id=0, mock=False)
    a = dev.allocate_f32(1024)
    b = dev.allocate_f32(1024)
    c = dev.allocate_f32(1024)
    dev.copy_from_numpy(a, np.random.randn(1024).astype(np.float32))
    dev.copy_from_numpy(b, np.random.randn(1024).astype(np.float32))

    blas = dev.blas()
    blas.gemm_f32(a, b, c, m=32, n=32, k=32)

    out = dev.copy_to_numpy(c)
```

See [`docs/python-bridge.md`](docs/python-bridge.md) for the full
binding surface — multi-dtype buffers, per-feature handle classes
(`Blas`, `Cudnn`, `Fft`, `RngGenerator`, …), typed exceptions, the
GIL-release contract, mock-mode tests, and the Phase 1.5+ tracking
issues that fill in the remaining method-level coverage.

## Library coverage

| Library                            | Actor              | NVIDIA reference                                  | Feature flag |
|------------------------------------|--------------------|---------------------------------------------------|--------------|
| [cuBLAS][cublas]                   | `BlasActor`        | [`cublasSgemm`][cublas-sgemm]                     | always-on    |
| [cuBLASLt][cublaslt]               | `BlasLtActor`      | [`cublasLtMatmul` + epilogue][cublaslt-matmul]    | `cublaslt`   |
| [cuDNN][cudnn]                     | `CudnnActor`       | [`cudnnConvolutionForward`][cudnn-conv]           | `cudnn`      |
| [cuFFT][cufft]                     | `FftActor`         | [`cufftPlan1d`][cufft-plan] / [`cufftExecR2C`][cufft-exec] | `cufft` |
| [cuRAND][curand]                   | `RngActor`         | [`curandGenerateUniform`][curand-uniform]         | `curand`     |
| [cuSOLVER][cusolver]               | `SolverActor`      | [`cusolverDnSgeqrf`][cusolver-qr] / `Sgetrf` / `Spotrf` / `Sgesvd` / `Ssyevd` | `cusolver` |
| [cuSPARSE][cusparse]               | `SparseActor`      | [`cusparseSpMV`][cusparse-spmv] / `SpMM` (CSR)    | `cusparse`   |
| [cuTENSOR][cutensor]               | `TensorActor`      | [`cutensorContract`][cutensor-contract]           | `cutensor`   |
| [NVRTC][nvrtc]                     | `NvrtcActor`       | [`nvrtcCompileProgram`][nvrtc-compile]            | `nvrtc`      |
| [NCCL][nccl]                       | `CollectiveActor` + `NcclWorldActor` | [`ncclAllReduce`][nccl-allreduce] | `nccl` |
| [Pinned host memory][cuda-pinned]  | `PinnedBufferPool` | [`cuMemHostAlloc`][cuda-pinned-api]               | always-on    |
| [Unified memory][cuda-um]          | `ManagedAllocatorActor` | [`cudaMallocManaged`][cuda-um-api]           | always-on    |
| [CUDA Graphs][cuda-graph]          | `GraphActor`       | [`cuGraphInstantiate` / `cuGraphLaunch`][cuda-graph-api] | always-on |
| [Peer-to-peer][cuda-p2p]           | `P2pTopology`      | [`cuMemcpyPeerAsync`][cuda-memcpy-peer]           | always-on    |

Aggregate features:
- `core-libs` = `cudnn` + `cufft` + `curand` + `cusparse` + `cutensor` + `cuda-managed`.
- `training-libs` = `core-libs` + `cusolver` + `cublaslt` + `nvrtc`.
- `full-cuda` = `training-libs` + `nccl` + `cuda-ipc` + `graphs-conditional`.
- `observability-full` = `telemetry` + `nvtx-trace` + `nvml` + `cupti`.

Sibling-crate gates (off by default; pull each in by enabling the
matching feature on `atomr-accel-cuda`):

- `cutlass` (+ `cutlass-evt`, `cutlass-grouped`, `cutlass-prebuilt`).
- `flashattn` (+ `flashattn-fp8`, `flashattn-paged`).
- `tensorrt` (+ `tensorrt-onnx`, `tensorrt-plugin`, `tensorrt-int8`, `tensorrt-fp8`).
- `nvtx-trace`, `nvml`, `cupti` — Phase 9 telemetry backends, layered on `telemetry`.

## atomr integrations

atomr-accel is feature-gated for each atomr subsystem so you only pay
for what you use:

- `replay` — persists replay-journal entries through any
  [`atomr_persistence::Journal`]https://docs.rs/atomr-persistence
  (in-memory, SQL, Redis, MongoDB, Cassandra, Dynamo). Build a
  deterministic replay harness with one constructor:
  `ReplayHarness::with_journal(journal, "pid")`.
- `cluster``placement::sharded::PlacementShardingAdapter` exposes
  a typed `EntityRef<DeviceExtractor>` over
  [`atomr-cluster-sharding`]https://docs.rs/atomr-cluster-sharding,
  so device routing follows consistent-hash placement across a cluster.
- `streams``streams_pipeline::{source_from_unbounded, gpu_stage,
  run_collect}` build GPU pipelines with
  [`atomr-streams`]https://docs.rs/atomr-streams Source / Sink
  alongside the actor-based `pipeline::PipelineExecutor`.
- `telemetry``observability::install(system, "node-1")` wires up a
  `TelemetryExtension` plus GPU-specific probes (allocations, OOM
  count, generation, VRAM, in-flight kernels). Visualize live in
  [`atomr-dashboard`]https://github.com/rustakka/atomr/tree/main/crates/atomr-dashboard.
- Typed supervision — `error::DeviceSupervisor` implements
  `SupervisorOf<C>` over `GpuError`. Pattern-match the error type
  instead of parsing panic strings.
- `#[derive(Actor)]` from
  [`atomr-macros`]https://docs.rs/atomr-macros — eliminates
  async-trait boilerplate.

## What you don't have to think about

- **Stream allocation.** Three strategies (`PerActorAllocator`,
  `SingleStreamAllocator`, `PooledAllocator`) ship out of the box;
  inject one and forget about it.
- **Kernel completion.** `HostFnCompletion` registers a
  [`cuLaunchHostFunc`][cuda-launch-host] callback that wakes the reply
  future the moment the kernel finishes — no host syncs, no polling.
- **Cross-stream events.** `GpuRef<T>` records its
  `last_write_stream`; downstream readers automatically wait on the
  right [event][cuda-events] before launching.
- **Context loss.** `WatchGeneration` is a
  `tokio::sync::watch::Receiver<u64>` you can subscribe to from any
  observer; we use it internally to rebuild NCCL communicators and
  invalidate P2P caches.
- **OS-thread pinning.** `GpuDispatcher` keeps the cuBLAS/cuDNN
  handle on a stable OS thread for its lifetime — required by
  several library APIs and easy to get wrong in async Rust.

## Building from source

You need a sibling clone of the
[atomr](https://github.com/rustakka/atomr) workspace next to this
repo (the workspace.dependencies in `Cargo.toml` reference
`../atomr`):

```
your-workspace/
├── atomr/         # the atomr actor runtime
└── atomr-accel/   # this repo
```

```bash
# Rust
cargo build --workspace
cargo test  --workspace --no-default-features

# The full release-pipeline gate (fmt + clippy + test + multi-feature check + doc)
cargo xtask verify

# Python bindings (requires maturin + a Python dev toolchain)
cd crates/atomr-accel-py
maturin develop --release
pytest tests/ -v
```

GPU-host integration tests are **opt-in** and **not part of CI**. On a
CUDA-equipped workstation:

```bash
cargo xtask gpu-probe          # report local CUDA + library availability
cargo xtask gpu-test            # run all suites
cargo xtask gpu-test cublas     # run one suite
cargo xtask gpu-bench           # criterion perf-regression benches
```

Tests skip gracefully when the local driver / library / GPU isn't
present, so the same commands are safe on a no-GPU laptop. See
[`docs/gpu-testing.md`](docs/gpu-testing.md) for the full suite list,
the gating model (cargo feature + `#[ignore]` + runtime probe), and
the rationale for keeping these tests out of CI.

## Build matrix

```bash
# No-GPU dev box:
cargo check --workspace --no-default-features
cargo check --workspace --features atomr-accel-cuda/core-libs
cargo check --workspace --features atomr-accel-cuda/training-libs
cargo check --workspace --features atomr-accel-cuda/full-cuda

# atomr subsystem integrations:
cargo check --workspace --features atomr-accel-cuda/replay
cargo check --workspace --features atomr-accel-cuda/cluster
cargo check --workspace --features atomr-accel-cuda/streams
cargo check --workspace --features atomr-accel-cuda/telemetry

cargo test  -p atomr-accel-cuda --features replay --test replay_persistence

# GPU host (requires CUDA toolkit):
cargo run   -p atomr-accel-cuda --example sgemm        --features cuda-runtime-tests
cargo run   -p atomr-accel-cuda --example rng_uniform  --features cuda-runtime-tests,curand
cargo run   -p atomr-accel-cuda --example fft_1d       --features cuda-runtime-tests,cufft
cargo run   -p atomr-accel-cuda --example jit_relu     --features cuda-runtime-tests,nvrtc

cargo bench -p atomr-accel-cuda --bench sgemm_overhead --features cuda-runtime-tests
cargo bench -p atomr-accel-cuda --bench rng_throughput --features cuda-runtime-tests,curand
```

## Picking the right deps

Each sub-crate path-depends only on `atomr-accel-cuda` (the foundation) —
no implicit pulls of the other blueprints. Add what you need:

```toml
# Just batching:
atomr-accel          = "0.1"
atomr-accel-patterns = "0.1"

# Training pipeline with NCCL + replay journal:
atomr-accel       = { version = "0.1", features = ["full-cuda", "replay"] }
atomr-accel-train = "0.1"

# Realtime sims with JIT kernels:
atomr-accel               = { version = "0.1", features = ["nvrtc"] }
atomr-accel-cuda-realtime = { version = "0.1", features = ["nvrtc"] }
```

[`docs/features-matrix.md`](docs/features-matrix.md) shows the full
pick-by-goal table plus the transitive-dependency view of every
feature.

Every sub-crate ships a `prelude` module:

```rust
use atomr_accel_cuda::prelude::*;            // foundation
use atomr_accel_patterns::prelude::*;        // batching, cascade, …
use atomr_accel_train::prelude::*;           // trainers, optimizers
use atomr_accel_agents::prelude::*;          // RAG, embedding cache
use atomr_accel_cuda_realtime::prelude::*;   // particles, cloth, sparse
```

If you're using an AI coding assistant (Claude Code, Cursor, etc.),
[`ai-skills/`](ai-skills/) ships ten `SKILL.md` files your tool can
pick up so the assistant gives you idiomatic atomr-accel guidance
instead of guessing.

## Layout

```
crates/                       Rust workspace
crates/atomr-accel/           Backend-agnostic core (umbrella)
crates/atomr-accel-cuda/      NVIDIA CUDA implementation
crates/atomr-accel-patterns/  Universal blueprints (batching / cascade / scheduler / …)
crates/atomr-accel-train/     Distributed-training blueprints
crates/atomr-accel-agents/    LLM blueprints (RAG / DAG)
crates/atomr-accel-cuda-realtime/  NVRTC-backed realtime sims
crates/atomr-accel-cub/       CUB device-wide primitives (Phase 5)
crates/atomr-accel-cutlass/   CUTLASS templates via NVRTC (Phase 6)
crates/atomr-accel-flashattn/ FlashAttention v2 + v3 kernels (Phase 7)
crates/atomr-accel-tensorrt/  TensorRT engine builder + runtime (Phase 8)
crates/atomr-accel-telemetry/ NVTX / NVML / CUPTI observability (Phase 9)
crates/atomr-accel-py/        PyO3 bridge (Python module: atomr_accel)
ai-skills/                    Vendor-neutral SKILL.md files for AI assistants
docs/                         Architecture, getting-started, concepts, features-matrix, gpu-testing
xtask/                        Cargo xtask (bump, verify, gpu-probe, gpu-test, gpu-bench)
```

## Status

Phases 0 – 9 of the CUDA-coverage roadmap are merged. The workspace
ships **twelve library crates** spanning the foundation actor surface
(`atomr-accel`, `atomr-accel-cuda`), the blueprint sub-crates
(`atomr-accel-patterns`, `atomr-accel-train`, `atomr-accel-agents`,
`atomr-accel-cuda-realtime`, `atomr-accel-py`), Phase 1 – 4 library
expansions (full cuBLAS / cuBLASLt / cuFFT / cuRAND / cuSOLVER dtype
matrix, cuDNN frontend graph, NCCL collective set, cuTENSOR
contraction + reduce + permute, cuSPARSE generic API + cuSPARSELt
2:4), Phase 5 foundations (NVRTC v2 + Hopper/Blackwell +
`atomr-accel-cub`), and Phase 6 – 9 sibling crates
(`atomr-accel-cutlass`, `atomr-accel-flashattn`,
`atomr-accel-tensorrt`, `atomr-accel-telemetry`).

The full feature matrix builds clean on a no-GPU host. ≈ 175 unit
tests pass with the headline feature combo
(`f16,cudnn,curand,cufft,nvrtc,cusolver,cusparse,cusparse-generic,cutensor,cublaslt,nccl,nvtx,cuda-ipc,cuda-managed,graphs-conditional`).
The opt-in GPU integration suite — invoked via `cargo xtask gpu-test`
— covers SGEMM, FFT, RNG, pinned memcpy, SpMV, tensor contraction,
SVD, the dispatch tables for FlashAttention / CUTLASS / CUB, and
real NVML probes against installed devices. See
[`docs/gpu-testing.md`](docs/gpu-testing.md) for the suite catalog
and the rationale for keeping it out of CI.

## Releasing

`v*.*.*` git tags trigger a single `release.yml` pipeline that runs
the verify gate, builds Python wheels (manylinux x86_64 + aarch64,
musllinux x86_64 + aarch64, macOS universal2, Windows x86_64) + an
sdist, creates a GitHub Release, publishes the workspace crates to
crates.io in topological order, and uploads wheels + sdist to PyPI
via trusted publishing. See
[`docs/release-process.md`](docs/release-process.md) for the
end-to-end operator flow and
[`docs/release-pipeline.md`](docs/release-pipeline.md) for the
workflow internals.

## Learn more

- [`docs/getting-started.md`]docs/getting-started.md — the
  ten-minute tour: wiring atomr-accel into a project, picking
  features, no-GPU vs GPU paths.
- [`docs/concepts.md`]docs/concepts.md — the five mental models
  (supervision, generation tokens, completion, streams, watch).
- [`docs/architecture.md`]docs/architecture.md — the full design
  narrative.
- [`docs/backends.md`]docs/backends.md — the multi-backend trait
  abstraction (and the ROCm / Metal / oneAPI roadmap).
- [`docs/features-matrix.md`]docs/features-matrix.md — pick the
  smallest dep footprint that fits your goal.
- [`docs/python-bridge.md`]docs/python-bridge.md — Python bindings
  surface and GIL strategy.
- [`docs/gpu-testing.md`]docs/gpu-testing.md — opt-in GPU
  integration suite, the three-layer gating model, and why the suite
  is intentionally not part of CI.
- [`ai-skills/README.md`]ai-skills/README.md — install the skill
  bundle into Claude Code, Cursor, Codex CLI, Gemini CLI, or any
  harness that reads `SKILL.md`. Covers the foundation actors plus
  per-crate skills for FlashAttention, CUTLASS, and TensorRT.
- [`docs/release-process.md`]docs/release-process.md — operator
  guide: how to ship a release, conventional-commit rules, the
  trampoline architecture, and a troubleshooting cookbook.
- [`docs/release-pipeline.md`]docs/release-pipeline.md — workflow
  internals: jobs, matrix entries, secrets, and dep-order publish
  list.

## License

Apache-2.0.

---

[cuda-ctx]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html
[cuda-sticky]: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#error-checking
[cuda-events]: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#events
[cuda-pinned]: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#page-locked-host-memory
[cuda-pinned-api]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b9
[cuda-um]: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-memory-programming
[cuda-um-api]: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gd228014f19cc0975ebe3e0dd2af6dd1b
[cuda-graph]: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
[cuda-graph-api]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__GRAPH.html
[cuda-p2p]: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#peer-to-peer-memory-access
[cuda-memcpy-peer]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g0e6a92f5c0a8c9d8a1c3d9a7e72b7d6e
[cuda-launch-host]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g05841eaa5f90f27264c5d9eb96b16d2c
[cublas]: https://docs.nvidia.com/cuda/cublas/index.html
[cublas-handle]: https://docs.nvidia.com/cuda/cublas/index.html#cublas-context
[cublas-sgemm]: https://docs.nvidia.com/cuda/cublas/index.html#cublas-t-gemm
[cublaslt]: https://docs.nvidia.com/cuda/cublas/index.html#using-the-cublaslt-api
[cublaslt-matmul]: https://docs.nvidia.com/cuda/cublas/index.html#cublasltmatmul
[cudnn]: https://docs.nvidia.com/deeplearning/cudnn/api/index.html
[cudnn-handle]: https://docs.nvidia.com/deeplearning/cudnn/api/cudnn-ops-library.html#cudnncreate
[cudnn-conv]: https://docs.nvidia.com/deeplearning/cudnn/api/cudnn-cnn-library.html#cudnnconvolutionforward
[cufft]: https://docs.nvidia.com/cuda/cufft/index.html
[cufft-plan]: https://docs.nvidia.com/cuda/cufft/index.html#function-cufftplan1d
[cufft-exec]: https://docs.nvidia.com/cuda/cufft/index.html#function-cufftexecr2c
[curand]: https://docs.nvidia.com/cuda/curand/index.html
[curand-uniform]: https://docs.nvidia.com/cuda/curand/host-api-overview.html#generation-functions
[cusolver]: https://docs.nvidia.com/cuda/cusolver/index.html
[cusolver-qr]: https://docs.nvidia.com/cuda/cusolver/index.html#cuds-lt-t-gt-geqrf
[cusparse]: https://docs.nvidia.com/cuda/cusparse/index.html
[cusparse-spmv]: https://docs.nvidia.com/cuda/cusparse/index.html#cusparsespmv
[cutensor]: https://docs.nvidia.com/cuda/cutensor/latest/index.html
[cutensor-contract]: https://docs.nvidia.com/cuda/cutensor/latest/api/cutensor.html#cutensorcontract
[nvrtc]: https://docs.nvidia.com/cuda/nvrtc/index.html
[nvrtc-compile]: https://docs.nvidia.com/cuda/nvrtc/index.html#group__error_1ga0e0b48c4e6f7e69dbb5e1d8c6c58c1d8
[nccl]: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html
[nccl-comm]: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html
[nccl-allreduce]: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/colls.html#ncclallreduce