Expand description
Distributed execution infrastructure for multi-device and multi-node computation.
This module provides distributed training and inference capabilities:
DistributedExecutor: Multi-device execution coordinationDataParallelism: Data-parallel training across devicesModelParallelism: Model-parallel execution with tensor shardingCommunicationBackend: Abstract interface for device communicationTlDistributedExecutor: Trait for executors that support distributed execution
§Parallelism Strategies
§Data Parallelism
- Each device processes a different subset of the batch
- Gradients are averaged across devices
- Suitable for models that fit on a single device
§Model Parallelism
- Model is split across multiple devices
- Each device processes different parts of the model
- Suitable for large models that don’t fit on a single device
§Hybrid Parallelism
- Combines data and model parallelism
- Model is split across devices, each replica processes different data
§Example
use tensorlogic_infer::distributed::{DistributedConfig, ParallelismStrategy};
let config = DistributedConfig {
parallelism: ParallelismStrategy::DataParallel,
num_devices: 4,
..Default::default()
};Structs§
- Data
Parallel Coordinator - Data parallelism coordinator.
- Distributed
Config - Configuration for distributed execution.
- Distributed
Executor - Distributed executor that coordinates multi-device execution.
- Distributed
Placement Plan - Placement plan for distributed execution.
- Distributed
Stats - Statistics for distributed execution.
- Dummy
Communication Backend - Dummy communication backend for testing.
- Model
Parallel Coordinator - Model parallelism coordinator.
- Pipeline
Parallel Coordinator - Pipeline parallelism coordinator.
- Sharding
Spec - Tensor sharding specification for model parallelism.
Enums§
- Communication
Op - Communication operation for distributed execution.
- Parallelism
Strategy - Parallelism strategy for distributed execution.
- Reduction
Op - Reduction operation for communication.
Traits§
- Communication
Backend - Abstract communication backend for device-to-device communication.
- TlDistributed
Executor - Trait for executors that support distributed execution.