Expand description
Distributed training blueprints on atomr-accel-cuda.
ⓘ
use atomr_accel_train::prelude::*;data_parallel::DataParallelTrainer— N-replica trainer wired to NCCL all-reduce.pipeline_parallel::PipelineParallelTrainer— staged forward/backward across pipeline ranks.tensor_parallel::TensorParallelTrainer— sharded matmul coordinator.parameter_server::AsyncParameterServer— async PS protocol.optimizer/loss— typed enums for the common choices.
Modules§
- data_
parallel DataParallelTrainer— replicates a model across N replicas, splits a mini-batch evenly, runs forward+backward per replica, aggregates loss/grad-norm, and applies an optimizer step.- loss
- Loss kinds.
- optimizer
- Optimizer kinds. F4 ships SGD and AdamW configs; the actual parameter-update kernels live in F4.x once the gradient buffers are flowing through NCCL.
- parameter_
server AsyncParameterServer— central parameter store with async gradient pushes and async weight pulls.- pipeline_
parallel PipelineParallelTrainer— stage-pipelined model across N GPUs/actors.- prelude
- Canonical re-exports.
use atomr_accel_train::prelude::*;. - tensor_
parallel TensorParallelTrainer— weight-sharded matmul: each replica owns a slice of the weight matrix; activations are split, each shard runs a partial matmul, then results are summed viaAllReduce.