# atomr-accel
An **actor-shaped face for compute acceleration**, on top of the
[atomr](https://github.com/rustakka/atomr) actor runtime. NVIDIA CUDA
ships today through [`atomr-accel-cuda`](crates/atomr-accel-cuda); the
backend trait surface accommodates AMD ROCm, Apple Metal, Intel
oneAPI, and Vulkan compute when those crates land. Each backend
library ([cuBLAS][cublas], [cuDNN][cudnn], [cuFFT][cufft],
[cuRAND][curand], [cuSOLVER][cusolver], [cuSPARSE][cusparse],
[cuTENSOR][cutensor], [cuBLASLt][cublaslt], [NVRTC][nvrtc],
[NCCL][nccl]) becomes a typed atomr actor with stable supervision,
generation-validated buffers, and a single async surface. Drop GPU
work into a Rust service without juggling streams, contexts, or
hand-rolled retry loops.
```rust
use atomr_accel_cuda::prelude::*;
let device = system.actor_of(DeviceActor::props(DeviceConfig::new(0)), "gpu-0")?;
let a = ask_alloc::<f32>(&device, n * n).await?;
let b = ask_alloc::<f32>(&device, n * n).await?;
let c = ask_alloc::<f32>(&device, n * n).await?;
device.tell(DeviceMsg::Sgemm(Box::new(SgemmRequest {
a, b, c, m: n, n: n, k: n, alpha: 1.0, beta: 0.0, reply,
})));
// reply arrives once the kernel completes — no host blocking,
// no manual stream synchronization.
```
That's the whole shape. The same envelope wires up convolutions
([`cudnnConvolutionForward`][cudnn-conv]), tensor contractions
([`cutensorContract`][cutensor-contract]), JIT-compiled custom kernels
([`nvrtcCompileProgram`][nvrtc-compile]), and multi-GPU all-reduce
([`ncclAllReduce`][nccl-allreduce]).
## Why an actor-shaped face for compute, in Rust, now
Modern workloads no longer live entirely on the CPU. Inference,
embedding, scoring, simulation — they want a GPU. Coordination,
control flow, I/O, persistence — they want a CPU. Today's stacks
force you to glue the two with ad-hoc batching layers, queues, and
serialization boundaries.
The actor model already encodes the right boundary: a message **is**
the dispatch unit. atomr-accel is built so that the same
`actor_ref.tell(msg)` can target a CPU mailbox today and a CUDA-backed
dispatcher tomorrow — with the same supervision, the same
backpressure, the same observability. The runtime is explicit about
*where* work runs without forcing the developer to write two programs.
Writing CUDA from Rust today otherwise means owning a long list of
invariants yourself:
| One [`CUcontext`][cuda-ctx] per device, restarted on poisoning | `DeviceActor ↔ ContextActor` two-tier supervision |
| [Sticky-error][cuda-sticky] detection and graceful recovery | `OneForOneStrategy` + `GpuError::ContextPoisoned` decider |
| Buffer staleness across context rebuilds | `GpuRef<T>` with generation tokens |
| Pinning library handles ([cuBLAS][cublas-handle], [cuDNN][cudnn-handle]) to a single OS thread | `GpuDispatcher` + per-actor handle |
| [Stream-event][cuda-events] choreography for kernel completion | `HostFnCompletion` (sub-µs `cuLaunchHostFunc`) / `SyncCompletion` / `PolledCompletion` |
| [`cuMemcpyPeerAsync`][cuda-p2p] cross-stream synchronization | `P2pTopology` with `last_write_stream` injection |
| [Page-locked][cuda-pinned] host buffer pooling | `PinnedBufferPool` actor |
| [CUDA Graph][cuda-graph] capture/replay | `GraphActor` (`Sgemm` / `Memcpy` / `RngFillUniform` / `FftR2C` record contracts) |
| Multi-GPU [communicator rebuild][nccl-comm] on context loss | `NcclWorldActor` subscribes to `WatchGeneration`, tears down + rebuilds collectives |
Because every concern is an actor, you compose CUDA the same way you
compose any other Rust service: `tokio` runtime, structured
supervision, typed messages, async/await throughout.
## What's in the box
| `atomr-accel` | Backend-agnostic core — `AccelBackend` trait, `AccelRef<T>`, `AccelDtype`/`DType`, `AccelError`, `CompletionStrategy`. Each backend crate (e.g. `atomr-accel-cuda`) depends on this for its trait surface. |
| `atomr-accel-cuda` | NVIDIA CUDA implementation — `DeviceActor`/`ContextActor`, kernel actors for cuBLAS/cuBLASLt/cuDNN/cuFFT/cuRAND/cuSOLVER/cuSPARSE/cuTENSOR/NVRTC/NCCL, P2P topology, CUDA graphs, pinned pools |
| `atomr-accel-patterns` | Universal blueprints — `DynamicBatchingServer`, `InferenceCascade`, `ModelReplicaPool`, `FairShareScheduler`, `ModelHotSwapServer`, `SpeculativeDecoder`, `MoeRouter`, plus a CPU `GpuMockActor` |
| `atomr-accel-train` | Distributed-training blueprints — `DataParallelTrainer`, `PipelineParallelTrainer`, `TensorParallelTrainer`, `AsyncParameterServer`, optimizer + loss enums |
| `atomr-accel-agents` | LLM blueprints — `RagPipeline` (with `EmbeddingCache` LRU + `CpuVectorIndex`), `SharedGpuStateCoordinator`, `LangGraphGpuActor` (DAG executor with cycle detection) |
| `atomr-accel-cuda-realtime` | NVRTC-backed realtime sims — `ImageFilterPipeline`, `ParticleSystemActor`, `ClothSimulationActor`, `FluidSimulationActor`, `SpatialIndexActor`, `GpuHashMapActor`, `GpuSparseStructureActor`, `MultiPassAnalysisActor`, `VideoEffectsGraph` |
| `atomr-accel-cub` | CUB device-wide primitives — `CubActor` with reduce / scan / sort / histogram / select / partition / segmented-reduce dispatchers, NVRTC-templated per `(op, dtype, length-class)` |
| `atomr-accel-cutlass` | CUTLASS kernel-template instantiation — `CutlassActor` for GEMM, grouped-GEMM, implicit-GEMM convolution, EVT (epilogue visitor tree), via NVRTC against vendored headers |
| `atomr-accel-flashattn` | FlashAttention v2 + v3 kernels — `FlashAttnActor` with forward/backward, paged KV-cache, chunked prefill, varlen, ALiBi, sliding window, sink tokens, MQA/GQA, fp8 (fa3 only) |
| `atomr-accel-tensorrt` | TensorRT engine builder + runtime — `TrtActor`, `IBuilderConfig` (fp32/fp16/bf16/int8/fp8/best), ONNX import, INT8 calibration, FP8 PTQ, `IPluginV3` Rust trampolines |
| `atomr-accel-telemetry` | Observability backends — `NvtxKernelTrace` for kernel-range markers, `NvmlActor` for power/temp/ECC/clocks, `CuptiSession` for activity tracing |
| `atomr-accel-py` | Python bindings via PyO3 — `atomr_accel.{System, Device, GpuBufferF32/F64/I32/U32/U8, Blas, Cudnn, Fft, RngGenerator, Solver, Collective, NvrtcKernel}`, typed exceptions, GIL-released kernel paths. Tracks upstream atomr 0.5.x Python coverage. |
Plus a Python facade — `pip install atomr-accel` — that exposes the
same actor model. Phase 1 ships multi-dtype buffers, the BLAS handle
(`gemm_f32`/`gemm_f64`/`axpy_f32`), per-feature handles for cuDNN /
cuFFT / cuRAND, and structural anchors for cuSOLVER / NCCL / NVRTC.
See [`docs/python-bridge.md`](docs/python-bridge.md) for the full
matrix and the Phase 1.5+ tracking issues.
## At a glance
```
┌──────────── ActorSystem ────────────┐
│ │
│ ┌─────────── DeviceActor ─────────┴─── stable address (ActorRef<DeviceMsg>)
│ │ (queues work while context rebuilds)
│ │
│ │ ┌─── ContextActor ──── owns Arc<CudaContext> ── restartable
│ │ │
│ │ │ ├── BlasActor ── cuBLAS handle ── pinned to one stream
│ │ │ ├── CudnnActor ── cuDNN handle ── pinned to one stream
│ │ │ ├── FftActor ── cuFFT plans ── plan cache
│ │ │ ├── RngActor ── cuRAND gen ── seedable
│ │ │ ├── SolverActor ── cuSOLVER ── QR/LU/Chol/SVD/Syevd
│ │ │ ├── SparseActor ── cuSPARSE ── CSR SpMv/SpMm
│ │ │ ├── TensorActor ── cuTENSOR ── Einstein Contract
│ │ │ ├── BlasLtActor ── cuBLASLt ── fused matmul + ReLU/GELU
│ │ │ ├── NvrtcActor ── NVRTC ── JIT compile + launch
│ │ │ └── CollectiveActor ── NCCL comm ── per-rank
│ │ │
│ │ └── PinnedBufferPool / ManagedAllocator / GraphActor / P2pTopology
│
└── PlacementActor / ReplayHarness / NcclWorldActor (top-level)
```
Each box is an actor. Messages are typed enums. Replies are
`oneshot::Sender` channels. Failures panic with a tagged string
(`"ContextPoisoned: …"` / `"OutOfMemory: …"` / `"Unrecoverable: …"`)
and the supervisor decides Restart / Resume / Stop / Escalate.
## Quick start (Rust)
The umbrella crate is published on crates.io as **`atomr-accel`**:
```toml
[dependencies]
atomr-accel = { version = "0.1", features = ["cuda"] }
tokio = { version = "1", features = ["rt-multi-thread", "macros"] }
```
Or pull in subsystem crates directly — `atomr-accel-cuda`,
`atomr-accel-patterns`, `atomr-accel-train`, `atomr-accel-agents`,
`atomr-accel-cuda-realtime` are all on crates.io.
```rust
use atomr_accel_cuda::prelude::*;
use atomr::prelude::*;
#[tokio::main(flavor = "multi_thread", worker_threads = 2)]
async fn main() -> anyhow::Result<()> {
let system = ActorSystem::create("gpu-app", Config::empty()).await?;
// Real-mode device. Use `DeviceConfig::mock(0)` for no-GPU CI.
let device = system.actor_of(
DeviceActor::props(DeviceConfig::new(0)),
"device-0",
)?;
// Allocate, copy, dispatch — see docs/getting-started.md.
system.terminate().await;
Ok(())
}
```
cudarc loads CUDA dynamically, so the workspace **builds and
unit-tests on hosts without a GPU**. Real kernel paths are gated
behind `--features cuda-runtime-tests`.
```bash
# No GPU needed:
cargo check --workspace --no-default-features
cargo test --workspace --no-default-features
cargo run -p atomr-accel-cuda --example echo_no_gpu
# With GPU + CUDA toolkit:
cargo run -p atomr-accel-cuda --example sgemm --features cuda-runtime-tests
cargo run -p atomr-accel-cuda --example fft_1d --features cuda-runtime-tests,cufft
cargo run -p atomr-accel-cuda --example jit_relu --features cuda-runtime-tests,nvrtc
```
## Quick start (Python)
```bash
python -m venv .venv && source .venv/bin/activate
pip install atomr-accel
```
```python
import numpy as np
from atomr_accel import System, Unrecoverable
with System.open("gpu-app") as system:
dev = system.spawn_device(device_id=0, mock=False)
a = dev.allocate_f32(1024)
b = dev.allocate_f32(1024)
c = dev.allocate_f32(1024)
dev.copy_from_numpy(a, np.random.randn(1024).astype(np.float32))
dev.copy_from_numpy(b, np.random.randn(1024).astype(np.float32))
blas = dev.blas()
blas.gemm_f32(a, b, c, m=32, n=32, k=32)
out = dev.copy_to_numpy(c)
```
See [`docs/python-bridge.md`](docs/python-bridge.md) for the full
binding surface — multi-dtype buffers, per-feature handle classes
(`Blas`, `Cudnn`, `Fft`, `RngGenerator`, …), typed exceptions, the
GIL-release contract, mock-mode tests, and the Phase 1.5+ tracking
issues that fill in the remaining method-level coverage.
## Library coverage
| [cuBLAS][cublas] | `BlasActor` | [`cublasSgemm`][cublas-sgemm] | always-on |
| [cuBLASLt][cublaslt] | `BlasLtActor` | [`cublasLtMatmul` + epilogue][cublaslt-matmul] | `cublaslt` |
| [cuDNN][cudnn] | `CudnnActor` | [`cudnnConvolutionForward`][cudnn-conv] | `cudnn` |
| [cuFFT][cufft] | `FftActor` | [`cufftPlan1d`][cufft-plan] / [`cufftExecR2C`][cufft-exec] | `cufft` |
| [cuRAND][curand] | `RngActor` | [`curandGenerateUniform`][curand-uniform] | `curand` |
| [cuSOLVER][cusolver] | `SolverActor` | [`cusolverDnSgeqrf`][cusolver-qr] / `Sgetrf` / `Spotrf` / `Sgesvd` / `Ssyevd` | `cusolver` |
| [cuSPARSE][cusparse] | `SparseActor` | [`cusparseSpMV`][cusparse-spmv] / `SpMM` (CSR) | `cusparse` |
| [cuTENSOR][cutensor] | `TensorActor` | [`cutensorContract`][cutensor-contract] | `cutensor` |
| [NVRTC][nvrtc] | `NvrtcActor` | [`nvrtcCompileProgram`][nvrtc-compile] | `nvrtc` |
| [NCCL][nccl] | `CollectiveActor` + `NcclWorldActor` | [`ncclAllReduce`][nccl-allreduce] | `nccl` |
| [Pinned host memory][cuda-pinned] | `PinnedBufferPool` | [`cuMemHostAlloc`][cuda-pinned-api] | always-on |
| [Unified memory][cuda-um] | `ManagedAllocatorActor` | [`cudaMallocManaged`][cuda-um-api] | always-on |
| [CUDA Graphs][cuda-graph] | `GraphActor` | [`cuGraphInstantiate` / `cuGraphLaunch`][cuda-graph-api] | always-on |
| [Peer-to-peer][cuda-p2p] | `P2pTopology` | [`cuMemcpyPeerAsync`][cuda-memcpy-peer] | always-on |
Aggregate features:
- `core-libs` = `cudnn` + `cufft` + `curand` + `cusparse` + `cutensor` + `cuda-managed`.
- `training-libs` = `core-libs` + `cusolver` + `cublaslt` + `nvrtc`.
- `full-cuda` = `training-libs` + `nccl` + `cuda-ipc` + `graphs-conditional`.
- `observability-full` = `telemetry` + `nvtx-trace` + `nvml` + `cupti`.
Sibling-crate gates (off by default; pull each in by enabling the
matching feature on `atomr-accel-cuda`):
- `cutlass` (+ `cutlass-evt`, `cutlass-grouped`, `cutlass-prebuilt`).
- `flashattn` (+ `flashattn-fp8`, `flashattn-paged`).
- `tensorrt` (+ `tensorrt-onnx`, `tensorrt-plugin`, `tensorrt-int8`, `tensorrt-fp8`).
- `nvtx-trace`, `nvml`, `cupti` — Phase 9 telemetry backends, layered on `telemetry`.
## atomr integrations
atomr-accel is feature-gated for each atomr subsystem so you only pay
for what you use:
- `replay` — persists replay-journal entries through any
[`atomr_persistence::Journal`](https://docs.rs/atomr-persistence)
(in-memory, SQL, Redis, MongoDB, Cassandra, Dynamo). Build a
deterministic replay harness with one constructor:
`ReplayHarness::with_journal(journal, "pid")`.
- `cluster` — `placement::sharded::PlacementShardingAdapter` exposes
a typed `EntityRef<DeviceExtractor>` over
[`atomr-cluster-sharding`](https://docs.rs/atomr-cluster-sharding),
so device routing follows consistent-hash placement across a cluster.
- `streams` — `streams_pipeline::{source_from_unbounded, gpu_stage,
run_collect}` build GPU pipelines with
[`atomr-streams`](https://docs.rs/atomr-streams) Source / Sink
alongside the actor-based `pipeline::PipelineExecutor`.
- `telemetry` — `observability::install(system, "node-1")` wires up a
`TelemetryExtension` plus GPU-specific probes (allocations, OOM
count, generation, VRAM, in-flight kernels). Visualize live in
[`atomr-dashboard`](https://github.com/rustakka/atomr/tree/main/crates/atomr-dashboard).
- Typed supervision — `error::DeviceSupervisor` implements
`SupervisorOf<C>` over `GpuError`. Pattern-match the error type
instead of parsing panic strings.
- `#[derive(Actor)]` from
[`atomr-macros`](https://docs.rs/atomr-macros) — eliminates
async-trait boilerplate.
## What you don't have to think about
- **Stream allocation.** Three strategies (`PerActorAllocator`,
`SingleStreamAllocator`, `PooledAllocator`) ship out of the box;
inject one and forget about it.
- **Kernel completion.** `HostFnCompletion` registers a
[`cuLaunchHostFunc`][cuda-launch-host] callback that wakes the reply
future the moment the kernel finishes — no host syncs, no polling.
- **Cross-stream events.** `GpuRef<T>` records its
`last_write_stream`; downstream readers automatically wait on the
right [event][cuda-events] before launching.
- **Context loss.** `WatchGeneration` is a
`tokio::sync::watch::Receiver<u64>` you can subscribe to from any
observer; we use it internally to rebuild NCCL communicators and
invalidate P2P caches.
- **OS-thread pinning.** `GpuDispatcher` keeps the cuBLAS/cuDNN
handle on a stable OS thread for its lifetime — required by
several library APIs and easy to get wrong in async Rust.
## Building from source
You need a sibling clone of the
[atomr](https://github.com/rustakka/atomr) workspace next to this
repo (the workspace.dependencies in `Cargo.toml` reference
`../atomr`):
```
your-workspace/
├── atomr/ # the atomr actor runtime
└── atomr-accel/ # this repo
```
```bash
# Rust
cargo build --workspace
cargo test --workspace --no-default-features
# The full release-pipeline gate (fmt + clippy + test + multi-feature check + doc)
cargo xtask verify
# Python bindings (requires maturin + a Python dev toolchain)
cd crates/atomr-accel-py
maturin develop --release
pytest tests/ -v
```
GPU-host integration tests are **opt-in** and **not part of CI**. On a
CUDA-equipped workstation:
```bash
cargo xtask gpu-probe # report local CUDA + library availability
cargo xtask gpu-test # run all suites
cargo xtask gpu-test cublas # run one suite
cargo xtask gpu-bench # criterion perf-regression benches
```
Tests skip gracefully when the local driver / library / GPU isn't
present, so the same commands are safe on a no-GPU laptop. See
[`docs/gpu-testing.md`](docs/gpu-testing.md) for the full suite list,
the gating model (cargo feature + `#[ignore]` + runtime probe), and
the rationale for keeping these tests out of CI.
## Build matrix
```bash
# No-GPU dev box:
cargo check --workspace --no-default-features
cargo check --workspace --features atomr-accel-cuda/core-libs
cargo check --workspace --features atomr-accel-cuda/training-libs
cargo check --workspace --features atomr-accel-cuda/full-cuda
# atomr subsystem integrations:
cargo check --workspace --features atomr-accel-cuda/replay
cargo check --workspace --features atomr-accel-cuda/cluster
cargo check --workspace --features atomr-accel-cuda/streams
cargo check --workspace --features atomr-accel-cuda/telemetry
cargo test -p atomr-accel-cuda --features replay --test replay_persistence
# GPU host (requires CUDA toolkit):
cargo run -p atomr-accel-cuda --example sgemm --features cuda-runtime-tests
cargo run -p atomr-accel-cuda --example rng_uniform --features cuda-runtime-tests,curand
cargo run -p atomr-accel-cuda --example fft_1d --features cuda-runtime-tests,cufft
cargo run -p atomr-accel-cuda --example jit_relu --features cuda-runtime-tests,nvrtc
cargo bench -p atomr-accel-cuda --bench sgemm_overhead --features cuda-runtime-tests
cargo bench -p atomr-accel-cuda --bench rng_throughput --features cuda-runtime-tests,curand
```
## Picking the right deps
Each sub-crate path-depends only on `atomr-accel-cuda` (the foundation) —
no implicit pulls of the other blueprints. Add what you need:
```toml
# Just batching:
atomr-accel = "0.1"
atomr-accel-patterns = "0.1"
# Training pipeline with NCCL + replay journal:
atomr-accel = { version = "0.1", features = ["full-cuda", "replay"] }
atomr-accel-train = "0.1"
# Realtime sims with JIT kernels:
atomr-accel = { version = "0.1", features = ["nvrtc"] }
atomr-accel-cuda-realtime = { version = "0.1", features = ["nvrtc"] }
```
[`docs/features-matrix.md`](docs/features-matrix.md) shows the full
pick-by-goal table plus the transitive-dependency view of every
feature.
Every sub-crate ships a `prelude` module:
```rust
use atomr_accel_cuda::prelude::*; // foundation
use atomr_accel_patterns::prelude::*; // batching, cascade, …
use atomr_accel_train::prelude::*; // trainers, optimizers
use atomr_accel_agents::prelude::*; // RAG, embedding cache
use atomr_accel_cuda_realtime::prelude::*; // particles, cloth, sparse
```
If you're using an AI coding assistant (Claude Code, Cursor, etc.),
[`ai-skills/`](ai-skills/) ships ten `SKILL.md` files your tool can
pick up so the assistant gives you idiomatic atomr-accel guidance
instead of guessing.
## Layout
```
crates/ Rust workspace
crates/atomr-accel/ Backend-agnostic core (umbrella)
crates/atomr-accel-cuda/ NVIDIA CUDA implementation
crates/atomr-accel-patterns/ Universal blueprints (batching / cascade / scheduler / …)
crates/atomr-accel-train/ Distributed-training blueprints
crates/atomr-accel-agents/ LLM blueprints (RAG / DAG)
crates/atomr-accel-cuda-realtime/ NVRTC-backed realtime sims
crates/atomr-accel-cub/ CUB device-wide primitives (Phase 5)
crates/atomr-accel-cutlass/ CUTLASS templates via NVRTC (Phase 6)
crates/atomr-accel-flashattn/ FlashAttention v2 + v3 kernels (Phase 7)
crates/atomr-accel-tensorrt/ TensorRT engine builder + runtime (Phase 8)
crates/atomr-accel-telemetry/ NVTX / NVML / CUPTI observability (Phase 9)
crates/atomr-accel-py/ PyO3 bridge (Python module: atomr_accel)
ai-skills/ Vendor-neutral SKILL.md files for AI assistants
docs/ Architecture, getting-started, concepts, features-matrix, gpu-testing
xtask/ Cargo xtask (bump, verify, gpu-probe, gpu-test, gpu-bench)
```
## Status
Phases 0 – 9 of the CUDA-coverage roadmap are merged. The workspace
ships **twelve library crates** spanning the foundation actor surface
(`atomr-accel`, `atomr-accel-cuda`), the blueprint sub-crates
(`atomr-accel-patterns`, `atomr-accel-train`, `atomr-accel-agents`,
`atomr-accel-cuda-realtime`, `atomr-accel-py`), Phase 1 – 4 library
expansions (full cuBLAS / cuBLASLt / cuFFT / cuRAND / cuSOLVER dtype
matrix, cuDNN frontend graph, NCCL collective set, cuTENSOR
contraction + reduce + permute, cuSPARSE generic API + cuSPARSELt
2:4), Phase 5 foundations (NVRTC v2 + Hopper/Blackwell +
`atomr-accel-cub`), and Phase 6 – 9 sibling crates
(`atomr-accel-cutlass`, `atomr-accel-flashattn`,
`atomr-accel-tensorrt`, `atomr-accel-telemetry`).
The full feature matrix builds clean on a no-GPU host. ≈ 175 unit
tests pass with the headline feature combo
(`f16,cudnn,curand,cufft,nvrtc,cusolver,cusparse,cusparse-generic,cutensor,cublaslt,nccl,nvtx,cuda-ipc,cuda-managed,graphs-conditional`).
The opt-in GPU integration suite — invoked via `cargo xtask gpu-test`
— covers SGEMM, FFT, RNG, pinned memcpy, SpMV, tensor contraction,
SVD, the dispatch tables for FlashAttention / CUTLASS / CUB, and
real NVML probes against installed devices. See
[`docs/gpu-testing.md`](docs/gpu-testing.md) for the suite catalog
and the rationale for keeping it out of CI.
## Releasing
`v*.*.*` git tags trigger a single `release.yml` pipeline that runs
the verify gate, builds Python wheels (manylinux x86_64 + aarch64,
musllinux x86_64 + aarch64, macOS universal2, Windows x86_64) + an
sdist, creates a GitHub Release, publishes the workspace crates to
crates.io in topological order, and uploads wheels + sdist to PyPI
via trusted publishing. See
[`docs/release-process.md`](docs/release-process.md) for the
end-to-end operator flow and
[`docs/release-pipeline.md`](docs/release-pipeline.md) for the
workflow internals.
## Learn more
- [`docs/getting-started.md`](docs/getting-started.md) — the
ten-minute tour: wiring atomr-accel into a project, picking
features, no-GPU vs GPU paths.
- [`docs/concepts.md`](docs/concepts.md) — the five mental models
(supervision, generation tokens, completion, streams, watch).
- [`docs/architecture.md`](docs/architecture.md) — the full design
narrative.
- [`docs/backends.md`](docs/backends.md) — the multi-backend trait
abstraction (and the ROCm / Metal / oneAPI roadmap).
- [`docs/features-matrix.md`](docs/features-matrix.md) — pick the
smallest dep footprint that fits your goal.
- [`docs/python-bridge.md`](docs/python-bridge.md) — Python bindings
surface and GIL strategy.
- [`docs/gpu-testing.md`](docs/gpu-testing.md) — opt-in GPU
integration suite, the three-layer gating model, and why the suite
is intentionally not part of CI.
- [`ai-skills/README.md`](ai-skills/README.md) — install the skill
bundle into Claude Code, Cursor, Codex CLI, Gemini CLI, or any
harness that reads `SKILL.md`. Covers the foundation actors plus
per-crate skills for FlashAttention, CUTLASS, and TensorRT.
- [`docs/release-process.md`](docs/release-process.md) — operator
guide: how to ship a release, conventional-commit rules, the
trampoline architecture, and a troubleshooting cookbook.
- [`docs/release-pipeline.md`](docs/release-pipeline.md) — workflow
internals: jobs, matrix entries, secrets, and dep-order publish
list.
## License
Apache-2.0.
---
[cuda-ctx]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__CTX.html
[cuda-sticky]: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#error-checking
[cuda-events]: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#events
[cuda-pinned]: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#page-locked-host-memory
[cuda-pinned-api]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g572ca4011bfcb25034888a14d4e035b9
[cuda-um]: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-memory-programming
[cuda-um-api]: https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gd228014f19cc0975ebe3e0dd2af6dd1b
[cuda-graph]: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
[cuda-graph-api]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__GRAPH.html
[cuda-p2p]: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#peer-to-peer-memory-access
[cuda-memcpy-peer]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MEM.html#group__CUDA__MEM_1g0e6a92f5c0a8c9d8a1c3d9a7e72b7d6e
[cuda-launch-host]: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__EXEC.html#group__CUDA__EXEC_1g05841eaa5f90f27264c5d9eb96b16d2c
[cublas]: https://docs.nvidia.com/cuda/cublas/index.html
[cublas-handle]: https://docs.nvidia.com/cuda/cublas/index.html#cublas-context
[cublas-sgemm]: https://docs.nvidia.com/cuda/cublas/index.html#cublas-t-gemm
[cublaslt]: https://docs.nvidia.com/cuda/cublas/index.html#using-the-cublaslt-api
[cublaslt-matmul]: https://docs.nvidia.com/cuda/cublas/index.html#cublasltmatmul
[cudnn]: https://docs.nvidia.com/deeplearning/cudnn/api/index.html
[cudnn-handle]: https://docs.nvidia.com/deeplearning/cudnn/api/cudnn-ops-library.html#cudnncreate
[cudnn-conv]: https://docs.nvidia.com/deeplearning/cudnn/api/cudnn-cnn-library.html#cudnnconvolutionforward
[cufft]: https://docs.nvidia.com/cuda/cufft/index.html
[cufft-plan]: https://docs.nvidia.com/cuda/cufft/index.html#function-cufftplan1d
[cufft-exec]: https://docs.nvidia.com/cuda/cufft/index.html#function-cufftexecr2c
[curand]: https://docs.nvidia.com/cuda/curand/index.html
[curand-uniform]: https://docs.nvidia.com/cuda/curand/host-api-overview.html#generation-functions
[cusolver]: https://docs.nvidia.com/cuda/cusolver/index.html
[cusolver-qr]: https://docs.nvidia.com/cuda/cusolver/index.html#cuds-lt-t-gt-geqrf
[cusparse]: https://docs.nvidia.com/cuda/cusparse/index.html
[cusparse-spmv]: https://docs.nvidia.com/cuda/cusparse/index.html#cusparsespmv
[cutensor]: https://docs.nvidia.com/cuda/cutensor/latest/index.html
[cutensor-contract]: https://docs.nvidia.com/cuda/cutensor/latest/api/cutensor.html#cutensorcontract
[nvrtc]: https://docs.nvidia.com/cuda/nvrtc/index.html
[nvrtc-compile]: https://docs.nvidia.com/cuda/nvrtc/index.html#group__error_1ga0e0b48c4e6f7e69dbb5e1d8c6c58c1d8
[nccl]: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html
[nccl-comm]: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/communicators.html
[nccl-allreduce]: https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/api/colls.html#ncclallreduce