Module moe

Expand description

Mixture of Experts (MoE) layer

Provides a sparse MoE layer where each input token is routed to a subset of expert networks via a learned gating mechanism. This enables scaling model capacity without proportionally increasing computation.

§Architecture

Router: Linear gating network with softmax that selects top-k experts per token
Experts: Independent feed-forward networks (weight + bias)
MoeLayer: Combines router and experts into a single forward pass

§Load Balancing

The balance_loss() method computes a Switch Transformer-style auxiliary loss that penalizes uneven expert utilization, encouraging the router to distribute tokens uniformly across experts.

§References

Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch Transformers. JMLR.
Lepikhin, D., et al. (2021). GShard: Scaling Giant Models. ICLR.

Re-exports§

pub use router::NoisyTopKRouter;
pub use router::RoutingResult;
pub use router::TopKRouter;

Modules§

router: Gating/routing mechanisms for Mixture of Experts

Structs§

Expert: A single expert network: a two-layer feed-forward with ReLU activation.
MoeConfig: Configuration for a Mixture of Experts layer.
MoeLayer: Mixture of Experts layer combining a router with a set of expert networks.

Enums§

Router: Router variant: either deterministic or noisy.

Module moe

Module moe Copy item path

§Architecture

§Load Balancing

§References

Re-exports§

Modules§

Structs§

Enums§

Module moe