Expand description
Mixture of Experts (MoE) layer
Provides a sparse MoE layer where each input token is routed to a subset of expert networks via a learned gating mechanism. This enables scaling model capacity without proportionally increasing computation.
§Architecture
- Router: Linear gating network with softmax that selects top-k experts per token
- Experts: Independent feed-forward networks (weight + bias)
- MoeLayer: Combines router and experts into a single forward pass
§Load Balancing
The balance_loss() method computes a Switch Transformer-style auxiliary loss
that penalizes uneven expert utilization, encouraging the router to distribute
tokens uniformly across experts.
§References
- Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers. JMLR.
- Lepikhin, D., et al. (2021). GShard: Scaling Giant Models. ICLR.
Re-exports§
pub use router::NoisyTopKRouter;pub use router::RoutingResult;pub use router::TopKRouter;
Modules§
- router
- Gating/routing mechanisms for Mixture of Experts
Structs§
- Expert
- A single expert network: a two-layer feed-forward with ReLU activation.
- MoeConfig
- Configuration for a Mixture of Experts layer.
- MoeLayer
- Mixture of Experts layer combining a router with a set of expert networks.
Enums§
- Router
- Router variant: either deterministic or noisy.