rakka-accel

An actor-shaped face for compute acceleration. NVIDIA CUDA ships today through rakka-accel-cuda; the backend trait surface accommodates AMD ROCm, Apple Metal, Intel oneAPI, and Vulkan compute when those crates land. Each backend library (cuBLAS, cuDNN, cuFFT, cuRAND, cuSOLVER, cuSPARSE, cuTENSOR, cuBLASLt, NVRTC, NCCL) becomes a typed rakka actor with stable supervision, generation-validated buffers, and a single async surface. Drop GPU work into a Rust service without juggling streams, contexts, or hand-rolled retry loops.

[dependencies]
rakka-accel = { version = "0.0", features = ["cuda"] }

use rakka_accel::cuda::prelude::*;   // active backend re-exported here

let device = system.actor_of(DeviceActor::props(DeviceConfig::new(0)), "gpu-0")?;
let a = ask_alloc::<f32>(&device, n * n).await?;
let b = ask_alloc::<f32>(&device, n * n).await?;
let c = ask_alloc::<f32>(&device, n * n).await?;

device.tell(DeviceMsg::Sgemm(Box::new(SgemmRequest {
    a, b, c, m: n, n: n, k: n, alpha: 1.0, beta: 0.0, reply,
})));
// reply arrives once the kernel completes — no host blocking,
// no manual stream synchronization.

That's the whole shape. The same envelope wires up convolutions (cudnnConvolutionForward), tensor contractions (cutensorContract), JIT-compiled custom kernels (nvrtcCompileProgram), and multi-GPU all-reduce (ncclAllReduce).

Why

Writing CUDA from Rust today means owning a long list of invariants yourself:

You'd otherwise hand-roll	rakka-accel gives you
One `CUcontext` per device, restarted on poisoning	`DeviceActor ↔ ContextActor` two-tier supervision
Sticky-error detection and graceful recovery	`OneForOneStrategy` + `GpuError::ContextPoisoned` decider
Buffer staleness across context rebuilds	`GpuRef<T>` with generation tokens
Pinning library handles (cuBLAS, cuDNN) to a single OS thread	`GpuDispatcher` + per-actor handle
Stream-event choreography for kernel completion	`HostFnCompletion` (sub-µs `cuLaunchHostFunc`) / `SyncCompletion` / `PolledCompletion`
`cuMemcpyPeerAsync` cross-stream synchronization	`P2pTopology` with `last_write_stream` injection
Page-locked host buffer pooling	`PinnedBufferPool` actor
CUDA Graph capture/replay	`GraphActor` (`Sgemm` / `Memcpy` / `RngFillUniform` / `FftR2C` record contracts)
Multi-GPU communicator rebuild on context loss	`NcclWorldActor` subscribes to `WatchGeneration`, tears down + rebuilds collectives

Because every concern is an actor, you compose CUDA the same way you compose any other Rust service: tokio runtime, structured supervision, typed messages, async/await throughout.

At a glance

   ┌──────────── ActorSystem ────────────┐
   │                                     │
   │   ┌─────────── DeviceActor ─────────┴─── stable address (ActorRef<DeviceMsg>)
   │   │   (queues work while context rebuilds)
   │   │
   │   │   ┌─── ContextActor ──── owns Arc<CudaContext> ── restartable
   │   │   │
   │   │   │   ├── BlasActor        ── cuBLAS handle  ── pinned to one stream
   │   │   │   ├── CudnnActor       ── cuDNN handle   ── pinned to one stream
   │   │   │   ├── FftActor         ── cuFFT plans    ── plan cache
   │   │   │   ├── RngActor         ── cuRAND gen     ── seedable
   │   │   │   ├── SolverActor      ── cuSOLVER       ── QR/LU/Chol/SVD/Syevd
   │   │   │   ├── SparseActor      ── cuSPARSE       ── CSR SpMv/SpMm
   │   │   │   ├── TensorActor      ── cuTENSOR       ── Einstein Contract
   │   │   │   ├── BlasLtActor      ── cuBLASLt       ── fused matmul + ReLU/GELU
   │   │   │   ├── NvrtcActor       ── NVRTC          ── JIT compile + launch
   │   │   │   └── CollectiveActor  ── NCCL comm      ── per-rank
   │   │   │
   │   │   └── PinnedBufferPool / ManagedAllocator / GraphActor / P2pTopology
   │
   └── PlacementActor / ReplayHarness / NcclWorldActor (top-level)

Each box is an actor. Messages are typed enums. Replies are oneshot::Sender channels. Failures panic with a tagged string ("ContextPoisoned: …" / "OutOfMemory: …" / "Unrecoverable: …") and the supervisor decides Restart / Resume / Stop / Escalate.

Quick start

You need a sibling clone of the rakka workspace:

your-workspace/
├── rakka/         # the rakka actor runtime (v0.2.x)
└── rakka-accel/   # this repo

cudarc loads CUDA dynamically, so the workspace builds and unit-tests on hosts without a GPU. Real kernel paths are gated behind --features cuda-runtime-tests.

# No GPU needed:
cargo check --workspace --no-default-features
cargo test  --workspace --no-default-features
cargo run   -p rakka-accel --example echo_no_gpu

# With GPU + CUDA toolkit:
cargo run   -p rakka-accel --example sgemm     --features cuda-runtime-tests
cargo run   -p rakka-accel --example fft_1d    --features cuda-runtime-tests,cufft
cargo run   -p rakka-accel --example jit_relu  --features cuda-runtime-tests,nvrtc

Read docs/getting-started.md for a ten-minute tour, docs/concepts.md for the supervision / completion / generation model, docs/architecture.md for the full design, docs/backends.md for the multi-backend trait abstraction (and the ROCm / Metal / oneAPI roadmap), docs/python-bridge.md for the Python bindings, and docs/features-matrix.md for the by-goal dependency picker.

If you're using an AI coding assistant (Claude Code, Cursor, etc.), ai-skills/ ships seven SKILL.md files your tool can pick up so the assistant gives you idiomatic rakka-accel guidance instead of guessing.

Library coverage

Library	Actor	NVIDIA reference	Feature flag
cuBLAS	`BlasActor`	`cublasSgemm`	always-on
cuBLASLt	`BlasLtActor`	`cublasLtMatmul` + epilogue	`cublaslt`
cuDNN	`CudnnActor`	`cudnnConvolutionForward`	`cudnn`
cuFFT	`FftActor`	`cufftPlan1d` / `cufftExecR2C`	`cufft`
cuRAND	`RngActor`	`curandGenerateUniform`	`curand`
cuSOLVER	`SolverActor`	`cusolverDnSgeqrf` / `Sgetrf` / `Spotrf` / `Sgesvd` / `Ssyevd`	`cusolver`
cuSPARSE	`SparseActor`	`cusparseSpMV` / `SpMM` (CSR)	`cusparse`
cuTENSOR	`TensorActor`	`cutensorContract`	`cutensor`
NVRTC	`NvrtcActor`	`nvrtcCompileProgram`	`nvrtc`
NCCL	`CollectiveActor` + `NcclWorldActor`	`ncclAllReduce`	`nccl`
Pinned host memory	`PinnedBufferPool`	`cuMemHostAlloc`	always-on
Unified memory	`ManagedAllocatorActor`	`cudaMallocManaged`	always-on
CUDA Graphs	`GraphActor`	`cuGraphInstantiate` / `cuGraphLaunch`	always-on
Peer-to-peer	`P2pTopology`	`cuMemcpyPeerAsync`	always-on

Aggregate features:

core-libs = cudnn + cufft + curand + cusparse
training-libs = core-libs + cusolver + cublaslt + nvrtc + cutensor
full-cuda = training-libs + nccl

rakka 0.2 integrations

rakka-accel is feature-gated for each rakka subsystem so you only pay for what you use:

replay — persists replay-journal entries through any rakka_persistence::Journal (in-memory, SQL, Redis, MongoDB, Cassandra, Dynamo). Build a deterministic replay harness with one constructor: ReplayHarness::with_journal(journal, "pid").
cluster — placement::sharded::PlacementShardingAdapter exposes a typed EntityRef<DeviceExtractor> over rakka-cluster-sharding, so device routing follows consistent-hash placement across a cluster.
streams — streams_pipeline::{source_from_unbounded, gpu_stage, run_collect} build GPU pipelines with rakka-streams Source / Sink alongside the actor-based pipeline::PipelineExecutor.
telemetry — observability::install(system, "node-1") wires up a TelemetryExtension plus GPU-specific probes (allocations, OOM count, generation, VRAM, in-flight kernels). Visualize live in rakka-dashboard.
Typed supervision — error::DeviceSupervisor implements SupervisorOf<C> over GpuError. Pattern-match the error type instead of parsing panic strings.
#[derive(Actor)] from rakka-macros — eliminates async-trait boilerplate. Used by BlasActor, EmbeddingCache, GpuMockActor, GpuHashMapActor; opt-in for the rest.

Blueprint sub-crates

These ride on top of the foundation and demonstrate concrete patterns:

rakka-accel-patterns — DynamicBatchingServer, InferenceCascade, ModelReplicaPool, FairShareScheduler (WFQ), ModelHotSwapServer, SpeculativeDecoder, MoeRouter, plus a CPU GpuMockActor for tests.
rakka-accel-train — DataParallelTrainer, PipelineParallelTrainer, TensorParallelTrainer, AsyncParameterServer, optimizer + loss enums.
rakka-accel-agents — RagPipeline (with EmbeddingCache LRU
- CpuVectorIndex), SharedGpuStateCoordinator, LangGraphGpuActor (DAG executor with cycle detection).
rakka-accel-py — Python bindings via PyO3. pip install maturin && maturin develop from crates/rakka-accel-py/. Exposes rakka_accel.{System, Device, GpuBuffer} plus typed exceptions; see docs/python-bridge.md.
rakka-accel-cuda-realtime — ImageFilterPipeline, ParticleSystemActor, ClothSimulationActor, FluidSimulationActor, SpatialIndexActor, GpuHashMapActor, GpuSparseStructureActor, MultiPassAnalysisActor, VideoEffectsGraph. Real CUDA-C kernel sources for these actors live under crates/rakka-accel-cuda-realtime/kernels/.

Every pattern ships a *_no_gpu example you can run today:

cargo run -p rakka-accel-patterns --example batching_no_gpu
cargo run -p rakka-accel-patterns --example cascade_no_gpu
cargo run -p rakka-accel-patterns --example fair_share_no_gpu
cargo run -p rakka-accel-patterns --example moe_no_gpu
cargo run -p rakka-accel-patterns --example speculative_no_gpu

What you don't have to think about

Stream allocation. Three strategies (PerActorAllocator, SingleStreamAllocator, PooledAllocator) ship out of the box; inject one and forget about it.
Kernel completion. HostFnCompletion registers a cuLaunchHostFunc callback that wakes the reply future the moment the kernel finishes — no host syncs, no polling.
Cross-stream events. GpuRef<T> records its last_write_stream; downstream readers automatically wait on the right event before launching.
Context loss. WatchGeneration is a tokio::sync::watch::Receiver<u64> you can subscribe to from any observer; we use it internally to rebuild NCCL communicators and invalidate P2P caches.
OS-thread pinning. GpuDispatcher keeps the cuBLAS/cuDNN handle on a stable OS thread for its lifetime — required by several library APIs and easy to get wrong in async Rust.

Build matrix

# No-GPU dev box:
cargo check --workspace --no-default-features
cargo check --workspace --features rakka-accel-cuda/core-libs
cargo check --workspace --features rakka-accel-cuda/training-libs
cargo check --workspace --features rakka-accel-cuda/full-cuda

# rakka 0.2 subsystem integrations:
cargo check --workspace --features rakka-accel-cuda/replay
cargo check --workspace --features rakka-accel-cuda/cluster
cargo check --workspace --features rakka-accel-cuda/streams
cargo check --workspace --features rakka-accel-cuda/telemetry

cargo test  -p rakka-accel --features replay --test replay_persistence

# GPU host (requires CUDA toolkit):
cargo run   -p rakka-accel --example sgemm        --features cuda-runtime-tests
cargo run   -p rakka-accel --example rng_uniform  --features cuda-runtime-tests,curand
cargo run   -p rakka-accel --example fft_1d       --features cuda-runtime-tests,cufft
cargo run   -p rakka-accel --example jit_relu     --features cuda-runtime-tests,nvrtc

cargo bench -p rakka-accel --bench sgemm_overhead --features cuda-runtime-tests
cargo bench -p rakka-accel --bench rng_throughput --features cuda-runtime-tests,curand

cargo test  -p rakka-accel --test sgemm_e2e          --features cuda-runtime-tests
cargo test  -p rakka-accel --test pinned_memcpy_e2e  --features cuda-runtime-tests
cargo test  -p rakka-accel --test end_to_end_e2e     --features cuda-runtime-tests
cargo test  -p rakka-accel --test spmv_e2e           --features cuda-runtime-tests,cusparse
cargo test  -p rakka-accel --test contract_e2e       --features cuda-runtime-tests,cutensor
cargo test  -p rakka-accel --test svd_e2e            --features cuda-runtime-tests,cusolver

Picking the right deps

Each sub-crate path-depends only on rakka-accel-cuda (the foundation) — no implicit pulls of the other blueprints. Add what you need:

# Just batching:
rakka-accel          = "0.0"
rakka-accel-patterns = "0.0"

# Training pipeline with NCCL + replay journal:
rakka-accel       = { version = "0.0", features = ["full-cuda", "replay"] }
rakka-accel-train = "0.0"

# Realtime sims with JIT kernels:
rakka-accel          = { version = "0.0", features = ["nvrtc"] }
rakka-accel-cuda-realtime = { version = "0.0", features = ["nvrtc"] }

docs/features-matrix.md shows the full pick-by-goal table plus the transitive-dependency view of every feature.

Every sub-crate ships a prelude module:

use rakka_accel_cuda::prelude::*;            // foundation
use rakka_accel_patterns::prelude::*;   // batching, cascade, …
use rakka_accel_train::prelude::*;      // trainers, optimizers
use rakka_accel_agents::prelude::*;     // RAG, embedding cache
use rakka_accel_cuda_realtime::prelude::*;   // particles, cloth, sparse

Status

F2 – F9 implemented + rakka 0.2 adoption complete. The full feature matrix builds clean; 60+ tests pass on a no-GPU CI; the GPU-runtime suite covers SGEMM, FFT, RNG, pinned memcpy, SpMV, tensor contraction, SVD, and the multi-actor end-to-end smoke.

Releasing

v*.*.* git tags trigger two CI pipelines: release-crates.yml publishes the five Rust crates to crates.io in topological order; release-pypi.yml builds wheels (manylinux + macOS + Windows) and uploads to PyPI. See RELEASING.md for the end-to-end flow.

See docs/architecture.md for the design narrative and CHANGELOG.md (forthcoming) for release-by-release surface changes.

License

Apache-2.0.

rakka-accel 0.2.8