Struct ElChe

Source

pub struct ElChe { /* private fields */ }

Expand description

El Che: heterogeneous DDP cadence strategy.

The column marches at the slowest one’s pace. The slow device anchors the cadence (anchor batches per sync step), the fast ones range ahead doing more work, and everyone rejoins at AllReduce. No one waits, no one idles.

After each sync step, call report_timing with measured wall times and AllReduce overhead. El Che refines batch ratios and auto-tunes the anchor count to keep AllReduce overhead below a configurable target (default 10%).

§Example

let ddp = Ddp::wrap(&[&model0, &model1], &devices)?;
let mut cadence = ElChe::new(2, 10);

loop {
    let start_events = record_start_events(&devices)?;
    for rank in 0..2 {
        for _ in 0..cadence.batches(rank) {
            forward_backward(rank)?;
        }
    }
    let wall_ms = measure_elapsed(&start_events)?;

    let sync_start = Instant::now();
    ddp.weighted_all_reduce_gradients(cadence.batch_counts())?;
    let sync_ms = sync_start.elapsed().as_secs_f64() * 1000.0;

    cadence.report_timing(&wall_ms, cadence.batch_counts(), sync_ms);
}

Implementations§

Source §

impl ElChe

Source

pub fn new(world_size: usize, anchor: usize) -> Self

Create a new sync cadence.

world_size: number of devices (must be >= 2). anchor: initial batch count for the slow device per sync step.

The first step uses equal counts (anchor for every device). After report_timing, ratios adapt to measured throughput.

Source

pub fn with_overhead_target(self, target: f64) -> Self

Set the target AllReduce overhead as a fraction of compute time.

Default: 0.10 (10%). The anchor auto-tunes upward to keep overhead below this target. Lower values = fewer syncs = more gradient staleness.

Source

pub fn with_max_anchor(self, max: usize) -> Self

Set the maximum anchor count (gradient staleness limit).

Default: 200. Higher values allow fewer syncs but accumulate more batches of gradient before averaging. Set to 1 to sync after every slow-device batch (minimal accumulation, traditional DDP cadence).

Source

pub fn with_max_batch_diff(self, max: usize) -> Self

Set the maximum batch difference between fastest and slowest worker.

When the fastest worker leads the slowest by more than this many batches, it is throttled (paused) until the gap closes. This prevents catastrophic divergence with large batches or extreme speed ratios.

None (default): no limit, workers run freely.
Some(0): strict lockstep, equivalent to synchronous DDP.
Some(n): fast workers may lead by at most n batches.

Source

pub fn max_batch_diff(&self) -> Option<usize>

Current max batch diff setting.

Source

pub fn with_speed_ratio(self, slow_rank: usize, ratio: f64) -> Self

Set initial speed estimate before the first timing measurement.

slow_rank: which device is slowest (receives anchor batches). ratio: how many times faster the fastest device is (e.g., 3.0 means the fast GPU processes ~3x more batches per unit time).

Default (without this call): all devices start equal (anchor batches each). After the first report_timing, actual measurements replace this estimate, so even a wrong guess self-corrects in one step.

// RTX 5060 Ti (rank 0) is ~2.3x faster than GTX 1060 (rank 1)
let che = ElChe::new(2, 10).with_speed_ratio(1, 2.3);
// → rank 0: 23 batches, rank 1: 10 batches

Source

pub fn batches(&self, rank: usize) -> usize

Batch count for the given device rank in the current cadence step.

Source

pub fn batch_counts(&self) -> &[usize]

Per-device batch counts (for Ddp::weighted_all_reduce_gradients).

Source

pub fn total_batches(&self) -> usize

Total batches across all devices for this cadence step.

Source

pub fn anchor(&self) -> usize

Current anchor batch count (slow device batches per step).

Source

pub fn anchor_wall_ms(&self) -> f64

Target wall time (ms) for one sync interval.

Returns anchor * slowest_ms_per_batch, the intended wall-clock duration between AllReduce events. Both GPUs should accumulate this much compute time before syncing. Returns 0 if not yet calibrated (no timing data).

Source

pub fn nudge_anchor_down(&mut self, factor: f64)

Reduce the anchor by factor (e.g. 0.5 = halve).

One-directional correction for parameter divergence: tightens sync cadence when replicas drift apart. Does NOT loosen; ElChe’s overhead auto-tune handles upward adjustment.

Bypasses min_anchor (clamped to 1) because divergence is a stronger signal than the overhead floor. The overhead auto-tune will recover the anchor upward once divergence subsides.

Source

pub fn is_calibrated(&self) -> bool

Whether at least one timing measurement has been reported.

Source

pub fn has_speed_hint(&self) -> bool

Whether a speed hint was applied (batch_counts are non-uniform).

Used by the coordinator to decide if epoch 0 should use throughput-proportional partitions before calibration.

Source

pub fn ms_per_batch(&self) -> &[f64]

Per-device milliseconds per batch from last measurement.

Source

pub fn report_timing( &mut self, wall_ms: &[f64], actual_batches: &[usize], sync_ms: f64, )

Report timing after a cadence step completes.

wall_ms[rank]: wall-clock time for all batches on that device (ms). actual_batches[rank]: number of batches each rank actually processed since the last sync (i.e., steps_since_avg). In Cadence mode the fast GPU may process more batches than its intended batch_counts while waiting for the slow GPU to reach the trigger threshold. Using the intended count as divisor would inflate the fast GPU’s ms_per_batch, inverting the throughput ratio. sync_ms: AllReduce overhead for this step (ms).

Updates batch ratios based on measured throughput. If AllReduce overhead exceeds the target, anchor auto-tunes upward.

Source