Crate atomr_accel_train

Expand description

Distributed training blueprints on atomr-accel-cuda.

use atomr_accel_train::prelude::*;

data_parallel::DataParallelTrainer — N-replica trainer wired to NCCL all-reduce.
pipeline_parallel::PipelineParallelTrainer — staged forward/backward across pipeline ranks.
tensor_parallel::TensorParallelTrainer — sharded matmul coordinator.
parameter_server::AsyncParameterServer — async PS protocol.
optimizer / loss — typed enums for the common choices.

Modules§

data_parallel: DataParallelTrainer — replicates a model across N replicas, splits a mini-batch evenly, runs forward+backward per replica, aggregates loss/grad-norm, and applies an optimizer step.
loss: Loss kinds.
optimizer: Optimizer kinds. F4 ships SGD and AdamW configs; the actual parameter-update kernels live in F4.x once the gradient buffers are flowing through NCCL.
parameter_server: AsyncParameterServer — central parameter store with async gradient pushes and async weight pulls.
pipeline_parallel: PipelineParallelTrainer — stage-pipelined model across N GPUs/actors.
prelude: Canonical re-exports. use atomr_accel_train::prelude::*;.
tensor_parallel: TensorParallelTrainer — weight-sharded matmul: each replica owns a slice of the weight matrix; activations are split, each shard runs a partial matmul, then results are summed via AllReduce.