pub struct ExpertParallelismConfig {Show 22 fields
pub num_experts: usize,
pub num_experts_per_token: usize,
pub capacity_factor: f32,
pub load_balance_loss_coeff: f32,
pub router_z_loss_coeff: f32,
pub expert_dropout: f32,
pub enable_load_balancing: bool,
pub sharding_strategy: ExpertShardingStrategy,
pub max_expert_batch_size: Option<usize>,
pub enable_gradient_accumulation: bool,
pub gradient_accumulation_steps: usize,
pub initialization_strategy: ExpertInitStrategy,
pub enable_expert_sync: bool,
pub sync_frequency: usize,
pub gate_network: Option<GateNetworkConfig>,
pub load_balancing: Option<LoadBalancingConfig>,
pub migration: Option<ExpertMigrationConfig>,
pub enable_expert_migration: bool,
pub migration_threshold: f32,
pub memory_per_expert_mb: usize,
pub communication_overlap: bool,
pub gradient_compression: bool,
}Expand description
Expert parallelism configuration
This structure contains all the configuration parameters needed to set up and run a Mixture of Experts (MoE) model with distributed expert parallelism.
§Examples
use torsh_distributed::expert_parallelism::config::{ExpertParallelismConfig, ExpertShardingStrategy};
let config = ExpertParallelismConfig {
num_experts: 16,
num_experts_per_token: 2,
capacity_factor: 1.5,
sharding_strategy: ExpertShardingStrategy::ModelParallel,
..Default::default()
};Fields§
§num_experts: usizeNumber of experts in the MoE layer
This determines the total number of expert networks available for routing. Typical values range from 8 to 1024 depending on model size and requirements.
num_experts_per_token: usizeNumber of experts to activate per token (top-k)
Each token is routed to the top-k experts based on router scores. Common values are 1, 2, or 4. Higher values increase computational cost but may improve model quality.
capacity_factor: f32Expert capacity factor (capacity = tokens_per_expert * capacity_factor)
This factor determines how many tokens each expert can process. Values > 1.0 provide buffer capacity to handle load imbalance. Typical range: 1.0 to 2.0.
load_balance_loss_coeff: f32Load balancing loss coefficient
Weight for the auxiliary loss that encourages balanced expert utilization. Higher values enforce stronger load balancing but may hurt model quality. Typical range: 0.001 to 0.1.
router_z_loss_coeff: f32Router z-loss coefficient (for numerical stability)
Weight for the z-loss that encourages router logits to stay close to zero, improving numerical stability. Typical range: 0.0001 to 0.01.
expert_dropout: f32Enable expert dropout during training
Probability of randomly dropping experts during training to improve robustness and prevent overfitting. Range: 0.0 to 1.0.
enable_load_balancing: boolEnable load balancing across devices
When true, the system actively monitors and rebalances expert utilization across different devices to optimize resource usage.
sharding_strategy: ExpertShardingStrategyExpert sharding strategy
Determines how experts are distributed across devices and processes.
max_expert_batch_size: Option<usize>Maximum batch size for expert processing
Limits the number of tokens that can be processed by a single expert in one forward pass. Helps control memory usage.
enable_gradient_accumulation: boolEnable gradient accumulation across experts
When true, gradients are accumulated across multiple expert invocations before updating parameters, which can improve training stability.
gradient_accumulation_steps: usizeNumber of gradient accumulation steps
Only relevant when gradient accumulation is enabled.
initialization_strategy: ExpertInitStrategyExpert initialization strategy
Method used to initialize expert parameters.
enable_expert_sync: boolEnable expert synchronization
When true, experts synchronize their parameters periodically during training.
sync_frequency: usizeSynchronization frequency (in steps)
How often to synchronize expert parameters when synchronization is enabled.
gate_network: Option<GateNetworkConfig>Gate network configuration
Optional configuration for hierarchical or advanced gate networks.
load_balancing: Option<LoadBalancingConfig>Load balancing configuration
Configuration for expert load balancing and migration.
migration: Option<ExpertMigrationConfig>Migration configuration
Configuration for expert migration strategies and triggers.
enable_expert_migration: boolEnable expert migration (simplified flag)
migration_threshold: f32Migration threshold for triggering migrations
memory_per_expert_mb: usizeMemory allocated per expert (in MB)
communication_overlap: boolEnable communication overlap
gradient_compression: boolEnable gradient compression
Implementations§
Source§impl ExpertParallelismConfig
impl ExpertParallelismConfig
Sourcepub fn small_scale() -> Self
pub fn small_scale() -> Self
Create a configuration optimized for small-scale deployment
§Returns
A configuration suitable for models with 8-16 experts
Sourcepub fn large_scale() -> Self
pub fn large_scale() -> Self
Create a configuration optimized for large-scale deployment
§Returns
A configuration suitable for models with 64+ experts
Sourcepub fn inference() -> Self
pub fn inference() -> Self
Create a configuration optimized for inference
§Returns
A configuration with settings optimized for inference workloads
Sourcepub fn calculate_expert_capacity(&self, total_tokens: usize) -> usize
pub fn calculate_expert_capacity(&self, total_tokens: usize) -> usize
Sourcepub fn recommended_num_devices(&self) -> usize
pub fn recommended_num_devices(&self) -> usize
Get the recommended number of devices for this configuration
§Returns
Recommended number of devices based on the sharding strategy
Trait Implementations§
Source§impl Clone for ExpertParallelismConfig
impl Clone for ExpertParallelismConfig
Source§fn clone(&self) -> ExpertParallelismConfig
fn clone(&self) -> ExpertParallelismConfig
1.0.0 · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreSource§impl Debug for ExpertParallelismConfig
impl Debug for ExpertParallelismConfig
Source§impl Default for ExpertParallelismConfig
impl Default for ExpertParallelismConfig
Source§impl<'de> Deserialize<'de> for ExpertParallelismConfig
impl<'de> Deserialize<'de> for ExpertParallelismConfig
Source§fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
fn deserialize<__D>(__deserializer: __D) -> Result<Self, __D::Error>where
__D: Deserializer<'de>,
Auto Trait Implementations§
impl Freeze for ExpertParallelismConfig
impl RefUnwindSafe for ExpertParallelismConfig
impl Send for ExpertParallelismConfig
impl Sync for ExpertParallelismConfig
impl Unpin for ExpertParallelismConfig
impl UnsafeUnpin for ExpertParallelismConfig
impl UnwindSafe for ExpertParallelismConfig
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more