Skip to main content

ElChe

Struct ElChe 

Source
pub struct ElChe { /* private fields */ }
Expand description

El Che: heterogeneous DDP cadence strategy.

The column marches at the slowest one’s pace. The slow device anchors the cadence (anchor batches per sync step), the fast ones range ahead doing more work, and everyone rejoins at AllReduce. No one waits, no one idles.

After each sync step, call report_timing with measured wall times and AllReduce overhead. El Che refines batch ratios and auto-tunes the anchor count to keep AllReduce overhead below a configurable target (default 10%).

§Example

let ddp = Ddp::wrap(&[&model0, &model1], &devices)?;
let mut cadence = ElChe::new(2, 10);

loop {
    let start_events = record_start_events(&devices)?;
    for rank in 0..2 {
        for _ in 0..cadence.batches(rank) {
            forward_backward(rank)?;
        }
    }
    let wall_ms = measure_elapsed(&start_events)?;

    let sync_start = Instant::now();
    ddp.weighted_all_reduce_gradients(cadence.batch_counts())?;
    let sync_ms = sync_start.elapsed().as_secs_f64() * 1000.0;

    cadence.report_timing(&wall_ms, cadence.batch_counts(), sync_ms);
}

Implementations§

Source§

impl ElChe

Source

pub fn new(world_size: usize, anchor: usize) -> Self

Create a new sync cadence.

world_size: number of devices (must be >= 2). anchor: initial batch count for the slow device per sync step.

The first step uses equal counts (anchor for every device). After report_timing, ratios adapt to measured throughput.

Source

pub fn with_overhead_target(self, target: f64) -> Self

Set the target AllReduce overhead as a fraction of compute time.

Default: 0.10 (10%). The anchor auto-tunes upward to keep overhead below this target. Lower values = fewer syncs = more gradient staleness.

Source

pub fn with_max_anchor(self, max: usize) -> Self

Set the maximum anchor count (gradient staleness limit).

Default: 200. Higher values allow fewer syncs but accumulate more batches of gradient before averaging. Set to 1 to sync after every slow-device batch (minimal accumulation, traditional DDP cadence).

Source

pub fn with_max_batch_diff(self, max: usize) -> Self

Set the maximum batch difference between fastest and slowest worker.

When the fastest worker leads the slowest by more than this many batches, it is throttled (paused) until the gap closes. This prevents catastrophic divergence with large batches or extreme speed ratios.

  • None (default): no limit, workers run freely.
  • Some(0): strict lockstep, equivalent to synchronous DDP.
  • Some(n): fast workers may lead by at most n batches.
Source

pub fn max_batch_diff(&self) -> Option<usize>

Current max batch diff setting.

Source

pub fn with_speed_ratio(self, slow_rank: usize, ratio: f64) -> Self

Set initial speed estimate before the first timing measurement.

slow_rank: which device is slowest (receives anchor batches). ratio: how many times faster the fastest device is (e.g., 3.0 means the fast GPU processes ~3x more batches per unit time).

Default (without this call): all devices start equal (anchor batches each). After the first report_timing, actual measurements replace this estimate, so even a wrong guess self-corrects in one step.

// RTX 5060 Ti (rank 0) is ~2.3x faster than GTX 1060 (rank 1)
let che = ElChe::new(2, 10).with_speed_ratio(1, 2.3);
// → rank 0: 23 batches, rank 1: 10 batches
Source

pub fn batches(&self, rank: usize) -> usize

Batch count for the given device rank in the current cadence step.

Source

pub fn batch_counts(&self) -> &[usize]

Per-device batch counts (for Ddp::weighted_all_reduce_gradients).

Source

pub fn total_batches(&self) -> usize

Total batches across all devices for this cadence step.

Source

pub fn anchor(&self) -> usize

Current anchor batch count (slow device batches per step).

Source

pub fn anchor_wall_ms(&self) -> f64

Target wall time (ms) for one sync interval.

Returns anchor * slowest_ms_per_batch, the intended wall-clock duration between AllReduce events. Both GPUs should accumulate this much compute time before syncing. Returns 0 if not yet calibrated (no timing data).

Source

pub fn nudge_anchor_down(&mut self, factor: f64)

Reduce the anchor by factor (e.g. 0.5 = halve).

One-directional correction for parameter divergence: tightens sync cadence when replicas drift apart. Does NOT loosen; ElChe’s overhead auto-tune handles upward adjustment.

Bypasses min_anchor (clamped to 1) because divergence is a stronger signal than the overhead floor. The overhead auto-tune will recover the anchor upward once divergence subsides.

Source

pub fn is_calibrated(&self) -> bool

Whether at least one timing measurement has been reported.

Source

pub fn has_speed_hint(&self) -> bool

Whether a speed hint was applied (batch_counts are non-uniform).

Used by the coordinator to decide if epoch 0 should use throughput-proportional partitions before calibration.

Source

pub fn ms_per_batch(&self) -> &[f64]

Per-device milliseconds per batch from last measurement.

Source

pub fn report_timing( &mut self, wall_ms: &[f64], actual_batches: &[usize], sync_ms: f64, )

Report timing after a cadence step completes.

wall_ms[rank]: wall-clock time for all batches on that device (ms). actual_batches[rank]: number of batches each rank actually processed since the last sync (i.e., steps_since_avg). In Cadence mode the fast GPU may process more batches than its intended batch_counts while waiting for the slow GPU to reach the trigger threshold. Using the intended count as divisor would inflate the fast GPU’s ms_per_batch, inverting the throughput ratio. sync_ms: AllReduce overhead for this step (ms).

Updates batch ratios based on measured throughput. If AllReduce overhead exceeds the target, anchor auto-tunes upward.

Source

pub fn clamp_total(&self, max_total: usize) -> Vec<usize>

Clamp batch counts to a maximum total, preserving proportions.

Returns a new batch-count vector. Use near epoch boundaries to avoid consuming more batches than remain.

Auto Trait Implementations§

§

impl Freeze for ElChe

§

impl RefUnwindSafe for ElChe

§

impl Send for ElChe

§

impl Sync for ElChe

§

impl Unpin for ElChe

§

impl UnsafeUnpin for ElChe

§

impl UnwindSafe for ElChe

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.