pub struct ElChe { /* private fields */ }Expand description
El Che: heterogeneous DDP cadence strategy.
The column marches at the slowest one’s pace. The slow device
anchors the cadence (anchor batches per sync step), the fast
ones range ahead doing more work, and everyone rejoins at AllReduce.
No one waits, no one idles.
After each sync step, call report_timing
with measured wall times and AllReduce overhead. El Che refines
batch ratios and auto-tunes the anchor count to keep AllReduce overhead
below a configurable target (default 10%).
§Example
let ddp = Ddp::wrap(&[&model0, &model1], &devices)?;
let mut cadence = ElChe::new(2, 10);
loop {
let start_events = record_start_events(&devices)?;
for rank in 0..2 {
for _ in 0..cadence.batches(rank) {
forward_backward(rank)?;
}
}
let wall_ms = measure_elapsed(&start_events)?;
let sync_start = Instant::now();
ddp.weighted_all_reduce_gradients(cadence.batch_counts())?;
let sync_ms = sync_start.elapsed().as_secs_f64() * 1000.0;
cadence.report_timing(&wall_ms, cadence.batch_counts(), sync_ms);
}Implementations§
Source§impl ElChe
impl ElChe
Sourcepub fn new(world_size: usize, anchor: usize) -> Self
pub fn new(world_size: usize, anchor: usize) -> Self
Create a new sync cadence.
world_size: number of devices (must be >= 2).
anchor: initial batch count for the slow device per sync step.
The first step uses equal counts (anchor for every device).
After report_timing, ratios adapt
to measured throughput.
Sourcepub fn with_overhead_target(self, target: f64) -> Self
pub fn with_overhead_target(self, target: f64) -> Self
Set the target AllReduce overhead as a fraction of compute time.
Default: 0.10 (10%). The anchor auto-tunes upward to keep overhead below this target. Lower values = fewer syncs = more gradient staleness.
Sourcepub fn with_max_anchor(self, max: usize) -> Self
pub fn with_max_anchor(self, max: usize) -> Self
Set the maximum anchor count (gradient staleness limit).
Default: 200. Higher values allow fewer syncs but accumulate more batches of gradient before averaging. Set to 1 to sync after every slow-device batch (minimal accumulation, traditional DDP cadence).
Sourcepub fn with_max_batch_diff(self, max: usize) -> Self
pub fn with_max_batch_diff(self, max: usize) -> Self
Set the maximum batch difference between fastest and slowest worker.
When the fastest worker leads the slowest by more than this many batches, it is throttled (paused) until the gap closes. This prevents catastrophic divergence with large batches or extreme speed ratios.
None(default): no limit, workers run freely.Some(0): strict lockstep, equivalent to synchronous DDP.Some(n): fast workers may lead by at mostnbatches.
Sourcepub fn max_batch_diff(&self) -> Option<usize>
pub fn max_batch_diff(&self) -> Option<usize>
Current max batch diff setting.
Sourcepub fn with_speed_ratio(self, slow_rank: usize, ratio: f64) -> Self
pub fn with_speed_ratio(self, slow_rank: usize, ratio: f64) -> Self
Set initial speed estimate before the first timing measurement.
slow_rank: which device is slowest (receives anchor batches).
ratio: how many times faster the fastest device is (e.g., 3.0
means the fast GPU processes ~3x more batches per unit time).
Default (without this call): all devices start equal (anchor
batches each). After the first report_timing,
actual measurements replace this estimate, so even a wrong guess
self-corrects in one step.
// RTX 5060 Ti (rank 0) is ~2.3x faster than GTX 1060 (rank 1)
let che = ElChe::new(2, 10).with_speed_ratio(1, 2.3);
// → rank 0: 23 batches, rank 1: 10 batchesSourcepub fn batches(&self, rank: usize) -> usize
pub fn batches(&self, rank: usize) -> usize
Batch count for the given device rank in the current cadence step.
Sourcepub fn batch_counts(&self) -> &[usize]
pub fn batch_counts(&self) -> &[usize]
Per-device batch counts (for Ddp::weighted_all_reduce_gradients).
Sourcepub fn total_batches(&self) -> usize
pub fn total_batches(&self) -> usize
Total batches across all devices for this cadence step.
Sourcepub fn anchor_wall_ms(&self) -> f64
pub fn anchor_wall_ms(&self) -> f64
Target wall time (ms) for one sync interval.
Returns anchor * slowest_ms_per_batch, the intended wall-clock
duration between AllReduce events. Both GPUs should accumulate
this much compute time before syncing. Returns 0 if not yet
calibrated (no timing data).
Sourcepub fn nudge_anchor_down(&mut self, factor: f64)
pub fn nudge_anchor_down(&mut self, factor: f64)
Reduce the anchor by factor (e.g. 0.5 = halve).
One-directional correction for parameter divergence: tightens sync cadence when replicas drift apart. Does NOT loosen; ElChe’s overhead auto-tune handles upward adjustment.
Bypasses min_anchor (clamped to 1) because divergence is a stronger
signal than the overhead floor. The overhead auto-tune will recover
the anchor upward once divergence subsides.
Sourcepub fn is_calibrated(&self) -> bool
pub fn is_calibrated(&self) -> bool
Whether at least one timing measurement has been reported.
Sourcepub fn has_speed_hint(&self) -> bool
pub fn has_speed_hint(&self) -> bool
Whether a speed hint was applied (batch_counts are non-uniform).
Used by the coordinator to decide if epoch 0 should use throughput-proportional partitions before calibration.
Sourcepub fn ms_per_batch(&self) -> &[f64]
pub fn ms_per_batch(&self) -> &[f64]
Per-device milliseconds per batch from last measurement.
Sourcepub fn report_timing(
&mut self,
wall_ms: &[f64],
actual_batches: &[usize],
sync_ms: f64,
)
pub fn report_timing( &mut self, wall_ms: &[f64], actual_batches: &[usize], sync_ms: f64, )
Report timing after a cadence step completes.
wall_ms[rank]: wall-clock time for all batches on that device (ms).
actual_batches[rank]: number of batches each rank actually processed
since the last sync (i.e., steps_since_avg). In Cadence mode the fast
GPU may process more batches than its intended batch_counts while
waiting for the slow GPU to reach the trigger threshold. Using the
intended count as divisor would inflate the fast GPU’s ms_per_batch,
inverting the throughput ratio.
sync_ms: AllReduce overhead for this step (ms).
Updates batch ratios based on measured throughput. If AllReduce overhead exceeds the target, anchor auto-tunes upward.
Sourcepub fn clamp_total(&self, max_total: usize) -> Vec<usize>
pub fn clamp_total(&self, max_total: usize) -> Vec<usize>
Clamp batch counts to a maximum total, preserving proportions.
Returns a new batch-count vector. Use near epoch boundaries to avoid consuming more batches than remain.