Skip to main content

Module moe

Module moe 

Source
Expand description

Mixture of Experts (MoE) layer

Provides a sparse MoE layer where each input token is routed to a subset of expert networks via a learned gating mechanism. This enables scaling model capacity without proportionally increasing computation.

§Architecture

  • Router: Linear gating network with softmax that selects top-k experts per token
  • Experts: Independent feed-forward networks (weight + bias)
  • MoeLayer: Combines router and experts into a single forward pass

§Load Balancing

The balance_loss() method computes a Switch Transformer-style auxiliary loss that penalizes uneven expert utilization, encouraging the router to distribute tokens uniformly across experts.

§References

  • Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers. JMLR.
  • Lepikhin, D., et al. (2021). GShard: Scaling Giant Models. ICLR.

Re-exports§

pub use router::NoisyTopKRouter;
pub use router::RoutingResult;
pub use router::TopKRouter;

Modules§

router
Gating/routing mechanisms for Mixture of Experts

Structs§

Expert
A single expert network: a two-layer feed-forward with ReLU activation.
MoeConfig
Configuration for a Mixture of Experts layer.
MoeLayer
Mixture of Experts layer combining a router with a set of expert networks.

Enums§

Router
Router variant: either deterministic or noisy.