rakka-accel
An actor-shaped face for compute acceleration. NVIDIA CUDA ships
today through rakka-accel-cuda; the
backend trait surface accommodates AMD ROCm, Apple Metal, Intel
oneAPI, and Vulkan compute when those crates land. Each backend
library (cuBLAS, cuDNN, cuFFT,
cuRAND, cuSOLVER, cuSPARSE,
cuTENSOR, cuBLASLt, NVRTC,
NCCL) becomes a typed rakka actor with stable
supervision, generation-validated buffers, and a single async
surface. Drop GPU work into a Rust service without juggling streams,
contexts, or hand-rolled retry loops.
[]
= { = "0.0", = ["cuda"] }
use *; // active backend re-exported here
let device = system.actor_of?;
let a = .await?;
let b = .await?;
let c = .await?;
device.tell;
// reply arrives once the kernel completes — no host blocking,
// no manual stream synchronization.
That's the whole shape. The same envelope wires up convolutions
(cudnnConvolutionForward), tensor contractions
(cutensorContract), JIT-compiled custom kernels
(nvrtcCompileProgram), and multi-GPU all-reduce
(ncclAllReduce).
Why
Writing CUDA from Rust today means owning a long list of invariants yourself:
| You'd otherwise hand-roll | rakka-accel gives you |
|---|---|
One CUcontext per device, restarted on poisoning |
DeviceActor ↔ ContextActor two-tier supervision |
| Sticky-error detection and graceful recovery | OneForOneStrategy + GpuError::ContextPoisoned decider |
| Buffer staleness across context rebuilds | GpuRef<T> with generation tokens |
| Pinning library handles (cuBLAS, cuDNN) to a single OS thread | GpuDispatcher + per-actor handle |
| Stream-event choreography for kernel completion | HostFnCompletion (sub-µs cuLaunchHostFunc) / SyncCompletion / PolledCompletion |
cuMemcpyPeerAsync cross-stream synchronization |
P2pTopology with last_write_stream injection |
| Page-locked host buffer pooling | PinnedBufferPool actor |
| CUDA Graph capture/replay | GraphActor (Sgemm / Memcpy / RngFillUniform / FftR2C record contracts) |
| Multi-GPU communicator rebuild on context loss | NcclWorldActor subscribes to WatchGeneration, tears down + rebuilds collectives |
Because every concern is an actor, you compose CUDA the same way you
compose any other Rust service: tokio runtime, structured
supervision, typed messages, async/await throughout.
At a glance
┌──────────── ActorSystem ────────────┐
│ │
│ ┌─────────── DeviceActor ─────────┴─── stable address (ActorRef<DeviceMsg>)
│ │ (queues work while context rebuilds)
│ │
│ │ ┌─── ContextActor ──── owns Arc<CudaContext> ── restartable
│ │ │
│ │ │ ├── BlasActor ── cuBLAS handle ── pinned to one stream
│ │ │ ├── CudnnActor ── cuDNN handle ── pinned to one stream
│ │ │ ├── FftActor ── cuFFT plans ── plan cache
│ │ │ ├── RngActor ── cuRAND gen ── seedable
│ │ │ ├── SolverActor ── cuSOLVER ── QR/LU/Chol/SVD/Syevd
│ │ │ ├── SparseActor ── cuSPARSE ── CSR SpMv/SpMm
│ │ │ ├── TensorActor ── cuTENSOR ── Einstein Contract
│ │ │ ├── BlasLtActor ── cuBLASLt ── fused matmul + ReLU/GELU
│ │ │ ├── NvrtcActor ── NVRTC ── JIT compile + launch
│ │ │ └── CollectiveActor ── NCCL comm ── per-rank
│ │ │
│ │ └── PinnedBufferPool / ManagedAllocator / GraphActor / P2pTopology
│
└── PlacementActor / ReplayHarness / NcclWorldActor (top-level)
Each box is an actor. Messages are typed enums. Replies are
oneshot::Sender channels. Failures panic with a tagged string
("ContextPoisoned: …" / "OutOfMemory: …" / "Unrecoverable: …")
and the supervisor decides Restart / Resume / Stop / Escalate.
Quick start
You need a sibling clone of the rakka workspace:
your-workspace/
├── rakka/ # the rakka actor runtime (v0.2.x)
└── rakka-accel/ # this repo
cudarc loads CUDA dynamically, so the workspace builds and
unit-tests on hosts without a GPU. Real kernel paths are gated
behind --features cuda-runtime-tests.
# No GPU needed:
# With GPU + CUDA toolkit:
Read docs/getting-started.md for a
ten-minute tour, docs/concepts.md for the
supervision / completion / generation model,
docs/architecture.md for the full design,
docs/backends.md for the multi-backend trait
abstraction (and the ROCm / Metal / oneAPI roadmap),
docs/python-bridge.md for the Python
bindings, and docs/features-matrix.md
for the by-goal dependency picker.
If you're using an AI coding assistant (Claude Code, Cursor, etc.),
ai-skills/ ships seven SKILL.md files your tool
can pick up so the assistant gives you idiomatic rakka-accel
guidance instead of guessing.
Library coverage
| Library | Actor | NVIDIA reference | Feature flag |
|---|---|---|---|
| cuBLAS | BlasActor |
cublasSgemm |
always-on |
| cuBLASLt | BlasLtActor |
cublasLtMatmul + epilogue |
cublaslt |
| cuDNN | CudnnActor |
cudnnConvolutionForward |
cudnn |
| cuFFT | FftActor |
cufftPlan1d / cufftExecR2C |
cufft |
| cuRAND | RngActor |
curandGenerateUniform |
curand |
| cuSOLVER | SolverActor |
cusolverDnSgeqrf / Sgetrf / Spotrf / Sgesvd / Ssyevd |
cusolver |
| cuSPARSE | SparseActor |
cusparseSpMV / SpMM (CSR) |
cusparse |
| cuTENSOR | TensorActor |
cutensorContract |
cutensor |
| NVRTC | NvrtcActor |
nvrtcCompileProgram |
nvrtc |
| NCCL | CollectiveActor + NcclWorldActor |
ncclAllReduce |
nccl |
| Pinned host memory | PinnedBufferPool |
cuMemHostAlloc |
always-on |
| Unified memory | ManagedAllocatorActor |
cudaMallocManaged |
always-on |
| CUDA Graphs | GraphActor |
cuGraphInstantiate / cuGraphLaunch |
always-on |
| Peer-to-peer | P2pTopology |
cuMemcpyPeerAsync |
always-on |
Aggregate features:
core-libs=cudnn+cufft+curand+cusparsetraining-libs=core-libs+cusolver+cublaslt+nvrtc+cutensorfull-cuda=training-libs+nccl
rakka 0.2 integrations
rakka-accel is feature-gated for each rakka subsystem so you only pay for what you use:
replay— persists replay-journal entries through anyrakka_persistence::Journal(in-memory, SQL, Redis, MongoDB, Cassandra, Dynamo). Build a deterministic replay harness with one constructor:ReplayHarness::with_journal(journal, "pid").cluster—placement::sharded::PlacementShardingAdapterexposes a typedEntityRef<DeviceExtractor>over rakka-cluster-sharding, so device routing follows consistent-hash placement across a cluster.streams—streams_pipeline::{source_from_unbounded, gpu_stage, run_collect}build GPU pipelines with rakka-streams Source / Sink alongside the actor-basedpipeline::PipelineExecutor.telemetry—observability::install(system, "node-1")wires up aTelemetryExtensionplus GPU-specific probes (allocations, OOM count, generation, VRAM, in-flight kernels). Visualize live inrakka-dashboard.- Typed supervision —
error::DeviceSupervisorimplementsSupervisorOf<C>overGpuError. Pattern-match the error type instead of parsing panic strings. #[derive(Actor)]fromrakka-macros— eliminates async-trait boilerplate. Used byBlasActor,EmbeddingCache,GpuMockActor,GpuHashMapActor; opt-in for the rest.
Blueprint sub-crates
These ride on top of the foundation and demonstrate concrete patterns:
rakka-accel-patterns—DynamicBatchingServer,InferenceCascade,ModelReplicaPool,FairShareScheduler(WFQ),ModelHotSwapServer,SpeculativeDecoder,MoeRouter, plus a CPUGpuMockActorfor tests.rakka-accel-train—DataParallelTrainer,PipelineParallelTrainer,TensorParallelTrainer,AsyncParameterServer, optimizer + loss enums.rakka-accel-agents—RagPipeline(withEmbeddingCacheLRUCpuVectorIndex),SharedGpuStateCoordinator,LangGraphGpuActor(DAG executor with cycle detection).
rakka-accel-py— Python bindings via PyO3.pip install maturin && maturin developfromcrates/rakka-accel-py/. Exposesrakka_accel.{System, Device, GpuBuffer}plus typed exceptions; seedocs/python-bridge.md.rakka-accel-cuda-realtime—ImageFilterPipeline,ParticleSystemActor,ClothSimulationActor,FluidSimulationActor,SpatialIndexActor,GpuHashMapActor,GpuSparseStructureActor,MultiPassAnalysisActor,VideoEffectsGraph. Real CUDA-C kernel sources for these actors live undercrates/rakka-accel-cuda-realtime/kernels/.
Every pattern ships a *_no_gpu example you can run today:
What you don't have to think about
- Stream allocation. Three strategies (
PerActorAllocator,SingleStreamAllocator,PooledAllocator) ship out of the box; inject one and forget about it. - Kernel completion.
HostFnCompletionregisters acuLaunchHostFunccallback that wakes the reply future the moment the kernel finishes — no host syncs, no polling. - Cross-stream events.
GpuRef<T>records itslast_write_stream; downstream readers automatically wait on the right event before launching. - Context loss.
WatchGenerationis atokio::sync::watch::Receiver<u64>you can subscribe to from any observer; we use it internally to rebuild NCCL communicators and invalidate P2P caches. - OS-thread pinning.
GpuDispatcherkeeps the cuBLAS/cuDNN handle on a stable OS thread for its lifetime — required by several library APIs and easy to get wrong in async Rust.
Build matrix
# No-GPU dev box:
# rakka 0.2 subsystem integrations:
# GPU host (requires CUDA toolkit):
Picking the right deps
Each sub-crate path-depends only on rakka-accel-cuda (the foundation) —
no implicit pulls of the other blueprints. Add what you need:
# Just batching:
= "0.0"
= "0.0"
# Training pipeline with NCCL + replay journal:
= { = "0.0", = ["full-cuda", "replay"] }
= "0.0"
# Realtime sims with JIT kernels:
= { = "0.0", = ["nvrtc"] }
= { = "0.0", = ["nvrtc"] }
docs/features-matrix.md shows the full
pick-by-goal table plus the transitive-dependency view of every
feature.
Every sub-crate ships a prelude module:
use *; // foundation
use *; // batching, cascade, …
use *; // trainers, optimizers
use *; // RAG, embedding cache
use *; // particles, cloth, sparse
Status
F2 – F9 implemented + rakka 0.2 adoption complete. The full feature
matrix builds clean; 60+ tests pass on a no-GPU CI; the GPU-runtime
suite covers SGEMM, FFT, RNG, pinned memcpy, SpMV, tensor contraction,
SVD, and the multi-actor end-to-end smoke.
Releasing
v*.*.* git tags trigger two CI pipelines: release-crates.yml
publishes the five Rust crates to crates.io in topological order;
release-pypi.yml builds wheels (manylinux + macOS + Windows) and
uploads to PyPI. See RELEASING.md for the
end-to-end flow.
See docs/architecture.md for the design
narrative and CHANGELOG.md (forthcoming) for
release-by-release surface changes.
License
Apache-2.0.