Skip to main content

Module distributed

Module distributed 

Source
Expand description

Distributed execution infrastructure for multi-device and multi-node computation.

This module provides distributed training and inference capabilities:

  • DistributedExecutor: Multi-device execution coordination
  • DataParallelism: Data-parallel training across devices
  • ModelParallelism: Model-parallel execution with tensor sharding
  • CommunicationBackend: Abstract interface for device communication
  • TlDistributedExecutor: Trait for executors that support distributed execution

§Parallelism Strategies

§Data Parallelism

  • Each device processes a different subset of the batch
  • Gradients are averaged across devices
  • Suitable for models that fit on a single device

§Model Parallelism

  • Model is split across multiple devices
  • Each device processes different parts of the model
  • Suitable for large models that don’t fit on a single device

§Hybrid Parallelism

  • Combines data and model parallelism
  • Model is split across devices, each replica processes different data

§Example

use tensorlogic_infer::distributed::{DistributedConfig, ParallelismStrategy};

let config = DistributedConfig {
    parallelism: ParallelismStrategy::DataParallel,
    num_devices: 4,
    ..Default::default()
};

Structs§

DataParallelCoordinator
Data parallelism coordinator.
DistributedConfig
Configuration for distributed execution.
DistributedExecutor
Distributed executor that coordinates multi-device execution.
DistributedPlacementPlan
Placement plan for distributed execution.
DistributedStats
Statistics for distributed execution.
DummyCommunicationBackend
Dummy communication backend for testing.
ModelParallelCoordinator
Model parallelism coordinator.
PipelineParallelCoordinator
Pipeline parallelism coordinator.
ShardingSpec
Tensor sharding specification for model parallelism.

Enums§

CommunicationOp
Communication operation for distributed execution.
ParallelismStrategy
Parallelism strategy for distributed execution.
ReductionOp
Reduction operation for communication.

Traits§

CommunicationBackend
Abstract communication backend for device-to-device communication.
TlDistributedExecutor
Trait for executors that support distributed execution.