rumus-distributed 0.1.0

3D parallelism for RUMUS: Tensor Parallelism, Pipeline Parallelism, async collectives
Documentation
  • Coverage
  • 48.84%
    21 out of 43 items documented0 out of 19 items with examples
  • Size
  • Source code size: 63.89 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 5.82 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Ø build duration
  • this release: 3m 8s Average build duration of successful builds.
  • all releases: 3m 8s Average build duration of successful builds in releases after 2024-10-23.
  • Links
  • SUM-INNOVATION/RUMUS
    200 40 0
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • sunhaoxiangwang

rumus-distributed

3D parallelism engine for the RUMUS deep learning framework — Tensor Parallelism (TP), Pipeline Parallelism (PP), and async collective operations.

Architecture

Async Communication

Compute Thread                    Comm Thread (owns Arc<Device> + Arc<Queue>)
──────────────                    ──────────────────────────────────────────
1. Dispatch compute kernel        (idle)
2. Encode copy → staging          (idle)
3. queue.submit()                 (idle)
4. Send CommRequest ──────────►   5. staging.map_async(Read)
5. Start NEXT micro-batch         6. device.poll(Wait)  ← blocks HERE only
   (non-blocking!)                7. Read mapped range → Vec<f32>
                                  8. CollectiveBarrier: push + wait
                                  9. Read reduced result
                                 10. queue.write_buffer(result → dst_buf)
N. handle.wait() ◄───────────── 11. Send completion signal

The compute thread never calls poll(Wait) — all blocking GPU-to-CPU transfers happen on the dedicated comm thread.

Tensor Parallelism

ColumnParallelLinear: Weight [K, N/T] sharded along columns.

  • Forward: Y_t = X @ W_t (no collective)
  • Backward: grad_X = AllReduce(Σ_t grad_X_t) via CollectiveBarrier

RowParallelLinear: Weight [K/T, N] sharded along rows.

  • Forward: Y = AllReduce(Σ_t X_t @ W_t) (partial sums reduced)
  • Backward: grad_X_t = grad_Y @ W_t^T (no collective)

Only 2 AllReduces per Transformer block (1 forward + 1 backward).

Pipeline Parallelism (1F1B Schedule)

GPU 0: F0  F1  F2  F3  B3  B2  B1  B0
GPU 1:     F0  F1  F2  B2  B1  B0  ...
GPU 2:         F0  F1  B1  B0  ...
GPU 3:             F0  B0  ...

Per-micro-batch isolated tapes:

  1. context::install_tape(Tape::new()) — fresh tape per micro-batch
  2. Forward pass records to the isolated tape
  3. context::take_tape() — saves the tape for later backward

Cross-stage gradient injection:

  1. Incoming tensor: set_requires_grad(true) → assigns known GradId
  2. Forward with local layers
  3. backward_with_grad(output, grad_from_next_stage) — injected seed
  4. grads.remove(saved_grad_id) — extracts grad_input deterministically
  5. Send grad_input to previous stage via channel

Quick Start

use rumus_distributed::{PipelineExecutor, PipelineStage, CollectiveBarrier};

let stages = vec![
    PipelineStage { device_index: 0, forward_fn: Box::new(|x| layer0.forward(x)) },
    PipelineStage { device_index: 1, forward_fn: Box::new(|x| layer1.forward(x)) },
];

let executor = PipelineExecutor::new(stages, 4); // 4 micro-batches
let grad_stores = executor.run(&input_batch, &|output| loss_fn(output));

API

Collectives

Struct Description
CollectiveBarrier Mutex + Condvar AllReduce: last arrival sums + averages, notify_all
CommThread Background thread with Arc<Device/Queue> for non-blocking GPU staging
AllReduceHandle Oneshot completion handle — .wait() only when result needed

Tensor Parallelism

Struct Weight Split Forward Collective Backward Collective
ColumnParallelLinear Along N (columns) None AllReduce on grad_X
RowParallelLinear Along K (rows) AllReduce on output None

Pipeline Parallelism

Struct Description
PipelineStage Per-device stage with user's forward_fn
PipelineExecutor 1F1B scheduler: micro-batch tapes, gradient injection, merge_from

Core Engine Extensions

M24 adds the following to the rumus core crate:

Addition Location Purpose
Arc<Device> + Arc<Queue> GpuContext Enables comm thread device sharing
install_tape(tape) autograd::context Per-micro-batch tape isolation
backward_with_grad(tensor, grad) autograd::backward Cross-stage gradient injection
merge_from(&mut other) GradientStore Accumulate parameter gradients across micro-batches

Dependencies

  • rumus (v0.3.0) — core framework with gpu + multi_gpu features
  • wgpu — WebGPU device/buffer types
  • bytemuck — safe byte casts

License

Licensed under either of

at your option.