atomr-accel
An actor-shaped face for compute acceleration, on top of the
atomr actor runtime. NVIDIA CUDA
ships today through atomr-accel-cuda; the
backend trait surface accommodates AMD ROCm, Apple Metal, Intel
oneAPI, and Vulkan compute when those crates land. Each backend
library (cuBLAS, cuDNN, cuFFT,
cuRAND, cuSOLVER, cuSPARSE,
cuTENSOR, cuBLASLt, NVRTC,
NCCL) becomes a typed atomr actor with stable supervision,
generation-validated buffers, and a single async surface. Drop GPU
work into a Rust service without juggling streams, contexts, or
hand-rolled retry loops.
use *;
let device = system.actor_of?;
let a = .await?;
let b = .await?;
let c = .await?;
device.tell;
// reply arrives once the kernel completes — no host blocking,
// no manual stream synchronization.
That's the whole shape. The same envelope wires up convolutions
(cudnnConvolutionForward), tensor contractions
(cutensorContract), JIT-compiled custom kernels
(nvrtcCompileProgram), and multi-GPU all-reduce
(ncclAllReduce).
Why an actor-shaped face for compute, in Rust, now
Modern workloads no longer live entirely on the CPU. Inference, embedding, scoring, simulation — they want a GPU. Coordination, control flow, I/O, persistence — they want a CPU. Today's stacks force you to glue the two with ad-hoc batching layers, queues, and serialization boundaries.
The actor model already encodes the right boundary: a message is
the dispatch unit. atomr-accel is built so that the same
actor_ref.tell(msg) can target a CPU mailbox today and a CUDA-backed
dispatcher tomorrow — with the same supervision, the same
backpressure, the same observability. The runtime is explicit about
where work runs without forcing the developer to write two programs.
Writing CUDA from Rust today otherwise means owning a long list of invariants yourself:
| You'd otherwise hand-roll | atomr-accel gives you |
|---|---|
One CUcontext per device, restarted on poisoning |
DeviceActor ↔ ContextActor two-tier supervision |
| Sticky-error detection and graceful recovery | OneForOneStrategy + GpuError::ContextPoisoned decider |
| Buffer staleness across context rebuilds | GpuRef<T> with generation tokens |
| Pinning library handles (cuBLAS, cuDNN) to a single OS thread | GpuDispatcher + per-actor handle |
| Stream-event choreography for kernel completion | HostFnCompletion (sub-µs cuLaunchHostFunc) / SyncCompletion / PolledCompletion |
cuMemcpyPeerAsync cross-stream synchronization |
P2pTopology with last_write_stream injection |
| Page-locked host buffer pooling | PinnedBufferPool actor |
| CUDA Graph capture/replay | GraphActor (Sgemm / Memcpy / RngFillUniform / FftR2C record contracts) |
| Multi-GPU communicator rebuild on context loss | NcclWorldActor subscribes to WatchGeneration, tears down + rebuilds collectives |
Because every concern is an actor, you compose CUDA the same way you
compose any other Rust service: tokio runtime, structured
supervision, typed messages, async/await throughout.
What's in the box
| Crate | What it does |
|---|---|
atomr-accel |
Backend-agnostic core — AccelBackend trait, AccelRef<T>, AccelDtype/DType, AccelError, CompletionStrategy. Each backend crate (e.g. atomr-accel-cuda) depends on this for its trait surface. |
atomr-accel-cuda |
NVIDIA CUDA implementation — DeviceActor/ContextActor, kernel actors for cuBLAS/cuBLASLt/cuDNN/cuFFT/cuRAND/cuSOLVER/cuSPARSE/cuTENSOR/NVRTC/NCCL, P2P topology, CUDA graphs, pinned pools |
atomr-accel-patterns |
Universal blueprints — DynamicBatchingServer, InferenceCascade, ModelReplicaPool, FairShareScheduler, ModelHotSwapServer, SpeculativeDecoder, MoeRouter, plus a CPU GpuMockActor |
atomr-accel-train |
Distributed-training blueprints — DataParallelTrainer, PipelineParallelTrainer, TensorParallelTrainer, AsyncParameterServer, optimizer + loss enums |
atomr-accel-agents |
LLM blueprints — RagPipeline (with EmbeddingCache LRU + CpuVectorIndex), SharedGpuStateCoordinator, LangGraphGpuActor (DAG executor with cycle detection) |
atomr-accel-cuda-realtime |
NVRTC-backed realtime sims — ImageFilterPipeline, ParticleSystemActor, ClothSimulationActor, FluidSimulationActor, SpatialIndexActor, GpuHashMapActor, GpuSparseStructureActor, MultiPassAnalysisActor, VideoEffectsGraph |
atomr-accel-cub |
CUB device-wide primitives — CubActor with reduce / scan / sort / histogram / select / partition / segmented-reduce dispatchers, NVRTC-templated per (op, dtype, length-class) |
atomr-accel-cutlass |
CUTLASS kernel-template instantiation — CutlassActor for GEMM, grouped-GEMM, implicit-GEMM convolution, EVT (epilogue visitor tree), via NVRTC against vendored headers |
atomr-accel-flashattn |
FlashAttention v2 + v3 kernels — FlashAttnActor with forward/backward, paged KV-cache, chunked prefill, varlen, ALiBi, sliding window, sink tokens, MQA/GQA, fp8 (fa3 only) |
atomr-accel-tensorrt |
TensorRT engine builder + runtime — TrtActor, IBuilderConfig (fp32/fp16/bf16/int8/fp8/best), ONNX import, INT8 calibration, FP8 PTQ, IPluginV3 Rust trampolines |
atomr-accel-telemetry |
Observability backends — NvtxKernelTrace for kernel-range markers, NvmlActor for power/temp/ECC/clocks, CuptiSession for activity tracing |
atomr-accel-py |
Python bindings via PyO3 — atomr_accel.{System, Device, GpuBufferF32/F64/I32/U32/U8, Blas, Cudnn, Fft, RngGenerator, Solver, Collective, NvrtcKernel}, typed exceptions, GIL-released kernel paths. Tracks upstream atomr 0.5.x Python coverage. |
Plus a Python facade — pip install atomr-accel — that exposes the
same actor model. Phase 1 ships multi-dtype buffers, the BLAS handle
(gemm_f32/gemm_f64/axpy_f32), per-feature handles for cuDNN /
cuFFT / cuRAND, and structural anchors for cuSOLVER / NCCL / NVRTC.
See docs/python-bridge.md for the full
matrix and the Phase 1.5+ tracking issues.
At a glance
┌──────────── ActorSystem ────────────┐
│ │
│ ┌─────────── DeviceActor ─────────┴─── stable address (ActorRef<DeviceMsg>)
│ │ (queues work while context rebuilds)
│ │
│ │ ┌─── ContextActor ──── owns Arc<CudaContext> ── restartable
│ │ │
│ │ │ ├── BlasActor ── cuBLAS handle ── pinned to one stream
│ │ │ ├── CudnnActor ── cuDNN handle ── pinned to one stream
│ │ │ ├── FftActor ── cuFFT plans ── plan cache
│ │ │ ├── RngActor ── cuRAND gen ── seedable
│ │ │ ├── SolverActor ── cuSOLVER ── QR/LU/Chol/SVD/Syevd
│ │ │ ├── SparseActor ── cuSPARSE ── CSR SpMv/SpMm
│ │ │ ├── TensorActor ── cuTENSOR ── Einstein Contract
│ │ │ ├── BlasLtActor ── cuBLASLt ── fused matmul + ReLU/GELU
│ │ │ ├── NvrtcActor ── NVRTC ── JIT compile + launch
│ │ │ └── CollectiveActor ── NCCL comm ── per-rank
│ │ │
│ │ └── PinnedBufferPool / ManagedAllocator / GraphActor / P2pTopology
│
└── PlacementActor / ReplayHarness / NcclWorldActor (top-level)
Each box is an actor. Messages are typed enums. Replies are
oneshot::Sender channels. Failures panic with a tagged string
("ContextPoisoned: …" / "OutOfMemory: …" / "Unrecoverable: …")
and the supervisor decides Restart / Resume / Stop / Escalate.
Quick start (Rust)
The umbrella crate is published on crates.io as atomr-accel:
[]
= { = "0.1", = ["cuda"] }
= { = "1", = ["rt-multi-thread", "macros"] }
Or pull in subsystem crates directly — atomr-accel-cuda,
atomr-accel-patterns, atomr-accel-train, atomr-accel-agents,
atomr-accel-cuda-realtime are all on crates.io.
use *;
use *;
async
cudarc loads CUDA dynamically, so the workspace builds and
unit-tests on hosts without a GPU. Real kernel paths are gated
behind --features cuda-runtime-tests.
# No GPU needed:
# With GPU + CUDA toolkit:
Quick start (Python)
&&
=
=
=
=
=
=
See docs/python-bridge.md for the full
binding surface — multi-dtype buffers, per-feature handle classes
(Blas, Cudnn, Fft, RngGenerator, …), typed exceptions, the
GIL-release contract, mock-mode tests, and the Phase 1.5+ tracking
issues that fill in the remaining method-level coverage.
Library coverage
| Library | Actor | NVIDIA reference | Feature flag |
|---|---|---|---|
| cuBLAS | BlasActor |
cublasSgemm |
always-on |
| cuBLASLt | BlasLtActor |
cublasLtMatmul + epilogue |
cublaslt |
| cuDNN | CudnnActor |
cudnnConvolutionForward |
cudnn |
| cuFFT | FftActor |
cufftPlan1d / cufftExecR2C |
cufft |
| cuRAND | RngActor |
curandGenerateUniform |
curand |
| cuSOLVER | SolverActor |
cusolverDnSgeqrf / Sgetrf / Spotrf / Sgesvd / Ssyevd |
cusolver |
| cuSPARSE | SparseActor |
cusparseSpMV / SpMM (CSR) |
cusparse |
| cuTENSOR | TensorActor |
cutensorContract |
cutensor |
| NVRTC | NvrtcActor |
nvrtcCompileProgram |
nvrtc |
| NCCL | CollectiveActor + NcclWorldActor |
ncclAllReduce |
nccl |
| Pinned host memory | PinnedBufferPool |
cuMemHostAlloc |
always-on |
| Unified memory | ManagedAllocatorActor |
cudaMallocManaged |
always-on |
| CUDA Graphs | GraphActor |
cuGraphInstantiate / cuGraphLaunch |
always-on |
| Peer-to-peer | P2pTopology |
cuMemcpyPeerAsync |
always-on |
Aggregate features:
core-libs=cudnn+cufft+curand+cusparse+cutensor+cuda-managed.training-libs=core-libs+cusolver+cublaslt+nvrtc.full-cuda=training-libs+nccl+cuda-ipc+graphs-conditional.observability-full=telemetry+nvtx-trace+nvml+cupti.
Sibling-crate gates (off by default; pull each in by enabling the
matching feature on atomr-accel-cuda):
cutlass(+cutlass-evt,cutlass-grouped,cutlass-prebuilt).flashattn(+flashattn-fp8,flashattn-paged).tensorrt(+tensorrt-onnx,tensorrt-plugin,tensorrt-int8,tensorrt-fp8).nvtx-trace,nvml,cupti— Phase 9 telemetry backends, layered ontelemetry.
atomr integrations
atomr-accel is feature-gated for each atomr subsystem so you only pay for what you use:
replay— persists replay-journal entries through anyatomr_persistence::Journal(in-memory, SQL, Redis, MongoDB, Cassandra, Dynamo). Build a deterministic replay harness with one constructor:ReplayHarness::with_journal(journal, "pid").cluster—placement::sharded::PlacementShardingAdapterexposes a typedEntityRef<DeviceExtractor>overatomr-cluster-sharding, so device routing follows consistent-hash placement across a cluster.streams—streams_pipeline::{source_from_unbounded, gpu_stage, run_collect}build GPU pipelines withatomr-streamsSource / Sink alongside the actor-basedpipeline::PipelineExecutor.telemetry—observability::install(system, "node-1")wires up aTelemetryExtensionplus GPU-specific probes (allocations, OOM count, generation, VRAM, in-flight kernels). Visualize live inatomr-dashboard.- Typed supervision —
error::DeviceSupervisorimplementsSupervisorOf<C>overGpuError. Pattern-match the error type instead of parsing panic strings. #[derive(Actor)]fromatomr-macros— eliminates async-trait boilerplate.
What you don't have to think about
- Stream allocation. Three strategies (
PerActorAllocator,SingleStreamAllocator,PooledAllocator) ship out of the box; inject one and forget about it. - Kernel completion.
HostFnCompletionregisters acuLaunchHostFunccallback that wakes the reply future the moment the kernel finishes — no host syncs, no polling. - Cross-stream events.
GpuRef<T>records itslast_write_stream; downstream readers automatically wait on the right event before launching. - Context loss.
WatchGenerationis atokio::sync::watch::Receiver<u64>you can subscribe to from any observer; we use it internally to rebuild NCCL communicators and invalidate P2P caches. - OS-thread pinning.
GpuDispatcherkeeps the cuBLAS/cuDNN handle on a stable OS thread for its lifetime — required by several library APIs and easy to get wrong in async Rust.
Building from source
You need a sibling clone of the
atomr workspace next to this
repo (the workspace.dependencies in Cargo.toml reference
../atomr):
your-workspace/
├── atomr/ # the atomr actor runtime
└── atomr-accel/ # this repo
# Rust
# The full release-pipeline gate (fmt + clippy + test + multi-feature check + doc)
# Python bindings (requires maturin + a Python dev toolchain)
GPU-host integration tests are opt-in and not part of CI. On a CUDA-equipped workstation:
Tests skip gracefully when the local driver / library / GPU isn't
present, so the same commands are safe on a no-GPU laptop. See
docs/gpu-testing.md for the full suite list,
the gating model (cargo feature + #[ignore] + runtime probe), and
the rationale for keeping these tests out of CI.
Build matrix
# No-GPU dev box:
# atomr subsystem integrations:
# GPU host (requires CUDA toolkit):
Picking the right deps
Each sub-crate path-depends only on atomr-accel-cuda (the foundation) —
no implicit pulls of the other blueprints. Add what you need:
# Just batching:
= "0.1"
= "0.1"
# Training pipeline with NCCL + replay journal:
= { = "0.1", = ["full-cuda", "replay"] }
= "0.1"
# Realtime sims with JIT kernels:
= { = "0.1", = ["nvrtc"] }
= { = "0.1", = ["nvrtc"] }
docs/features-matrix.md shows the full
pick-by-goal table plus the transitive-dependency view of every
feature.
Every sub-crate ships a prelude module:
use *; // foundation
use *; // batching, cascade, …
use *; // trainers, optimizers
use *; // RAG, embedding cache
use *; // particles, cloth, sparse
If you're using an AI coding assistant (Claude Code, Cursor, etc.),
ai-skills/ ships ten SKILL.md files your tool can
pick up so the assistant gives you idiomatic atomr-accel guidance
instead of guessing.
Layout
crates/ Rust workspace
crates/atomr-accel/ Backend-agnostic core (umbrella)
crates/atomr-accel-cuda/ NVIDIA CUDA implementation
crates/atomr-accel-patterns/ Universal blueprints (batching / cascade / scheduler / …)
crates/atomr-accel-train/ Distributed-training blueprints
crates/atomr-accel-agents/ LLM blueprints (RAG / DAG)
crates/atomr-accel-cuda-realtime/ NVRTC-backed realtime sims
crates/atomr-accel-cub/ CUB device-wide primitives (Phase 5)
crates/atomr-accel-cutlass/ CUTLASS templates via NVRTC (Phase 6)
crates/atomr-accel-flashattn/ FlashAttention v2 + v3 kernels (Phase 7)
crates/atomr-accel-tensorrt/ TensorRT engine builder + runtime (Phase 8)
crates/atomr-accel-telemetry/ NVTX / NVML / CUPTI observability (Phase 9)
crates/atomr-accel-py/ PyO3 bridge (Python module: atomr_accel)
ai-skills/ Vendor-neutral SKILL.md files for AI assistants
docs/ Architecture, getting-started, concepts, features-matrix, gpu-testing
xtask/ Cargo xtask (bump, verify, gpu-probe, gpu-test, gpu-bench)
Status
Phases 0 – 9 of the CUDA-coverage roadmap are merged. The workspace
ships twelve library crates spanning the foundation actor surface
(atomr-accel, atomr-accel-cuda), the blueprint sub-crates
(atomr-accel-patterns, atomr-accel-train, atomr-accel-agents,
atomr-accel-cuda-realtime, atomr-accel-py), Phase 1 – 4 library
expansions (full cuBLAS / cuBLASLt / cuFFT / cuRAND / cuSOLVER dtype
matrix, cuDNN frontend graph, NCCL collective set, cuTENSOR
contraction + reduce + permute, cuSPARSE generic API + cuSPARSELt
2:4), Phase 5 foundations (NVRTC v2 + Hopper/Blackwell +
atomr-accel-cub), and Phase 6 – 9 sibling crates
(atomr-accel-cutlass, atomr-accel-flashattn,
atomr-accel-tensorrt, atomr-accel-telemetry).
The full feature matrix builds clean on a no-GPU host. ≈ 175 unit
tests pass with the headline feature combo
(f16,cudnn,curand,cufft,nvrtc,cusolver,cusparse,cusparse-generic,cutensor,cublaslt,nccl,nvtx,cuda-ipc,cuda-managed,graphs-conditional).
The opt-in GPU integration suite — invoked via cargo xtask gpu-test
— covers SGEMM, FFT, RNG, pinned memcpy, SpMV, tensor contraction,
SVD, the dispatch tables for FlashAttention / CUTLASS / CUB, and
real NVML probes against installed devices. See
docs/gpu-testing.md for the suite catalog
and the rationale for keeping it out of CI.
Releasing
v*.*.* git tags trigger a single release.yml pipeline that runs
the verify gate, builds Python wheels (manylinux x86_64 + aarch64,
musllinux x86_64 + aarch64, macOS universal2, Windows x86_64) + an
sdist, creates a GitHub Release, publishes the workspace crates to
crates.io in topological order, and uploads wheels + sdist to PyPI
via trusted publishing. See RELEASING.md for the
end-to-end flow.
Learn more
docs/getting-started.md— the ten-minute tour: wiring atomr-accel into a project, picking features, no-GPU vs GPU paths.docs/concepts.md— the five mental models (supervision, generation tokens, completion, streams, watch).docs/architecture.md— the full design narrative.docs/backends.md— the multi-backend trait abstraction (and the ROCm / Metal / oneAPI roadmap).docs/features-matrix.md— pick the smallest dep footprint that fits your goal.docs/python-bridge.md— Python bindings surface and GIL strategy.docs/gpu-testing.md— opt-in GPU integration suite, the three-layer gating model, and why the suite is intentionally not part of CI.ai-skills/README.md— install the skill bundle into Claude Code, Cursor, Codex CLI, Gemini CLI, or any harness that readsSKILL.md. Covers the foundation actors plus per-crate skills for FlashAttention, CUTLASS, and TensorRT.RELEASING.md— release pipeline, secrets, yanking, post-release verification.
License
Apache-2.0.