pub struct IMoELayer { /* private fields */ }Expand description
┌──────────────┐┌────────────────────────┐┌────────────────────────┐ │ hiddenStates ││selectedExpertsForTokens││scoresForSelectedExperts│ └──────────────┘└────────────────────────┘└────────────────────────┘ │ │ │ │ │ │ ┌───────────────────────────────────────────────────────────────────────────────────┐ │ │ │ ┌──────────────────────────┐ ┌──────────────────────────┐ │ │ │ │ Expert 0 │ │ MOE │ │ Expert i │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌────────┐ ┌────────┐│ │ ┌────────┐ ┌────────┐│ │ │ │ │ fcGate │ │ fcUp ││ │ │ fcGate │ │ fcUp ││ │ │ │ │ │ │ ││ │ │ │ │ ││ │ │ │ └───┬────┘ └────┬───┘│ │ └───┬────┘ └────┬───┘│ │ │ │ │ │ │ │ │ │ │ │ │ │ ┌──────────┐ │ │ │ ┌──────────┐ │ │ │ │ │ │activation│ │ │ │ │activation│ │ │ │ │ │ └────┬─────┘ │ │ │ └────┬─────┘ │ │ │ │ │ │ │ │ ....... │ │ │ │ │ │ │ └──────┬───────┘ │ │ └──────┬───────┘ │ │ │ │ │ │ │ │ │ │ │ │ ┌────────┐ │ │ ┌────────┐ │ │ │ │ │ mul │ │ │ │ mul │ │ │ │ │ └───┬────┘ │ │ └───┬────┘ │ │ │ │ │ │ │ │ │ │ │ │ ┌───▼────┐ │ │ ┌───▼────┐ │ │ │ │ │ fcDown │ │ │ │ fcDown │ │ │ │ │ └───┬────┘ │ │ └───┬────┘ │ │ │ │ │ │ │ │ │ │ │ │ ┌───▼────┐ │ │ ┌───▼────┐ │ │ │ │ │output 0│ │ │ │output i│ │ │ │ │ └───┬────┘ │ │ └───┬────┘ │ │ │ └─────────────┼────────────┘ └─────────────┼────────────┘ │ │ │ │ │ │ └───────────────────┬───────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────┐ │ │ │ weightedSum │ │ │ └───────┬───────┘ │ └────────────────────────────────────│──────────────────────────────────────────────┘ ▼ ┌───────────────┐ │ moeOutput │ └───────────────┘
IMoELayer
A MoE layer in a network definition. Mixture of Experts (MoE) is a collection of experts with each expert specializing in processing different subsets of input data. The key innovation lies in using a Router that selectively activates only the specific experts needed for a given input, rather than engaging the entire neural network for every task.
Definition in the MoE layer: fcDown, fcGate, fcUp are three linear layers. fc(x) = x * w + b, where x is the input, w is the weight, b is the bias, * is the matrix multiplication. activation is the activation function. mul is the multiplication between the output of fc_up and the output of fc_gate. weightedSum is the weighted sum of the output of the experts. moeOutput is the output of the MoE layer.
MoE is a collection of experts. Each expert is a GLU (gated linear unit), which consists by fcGate, fcUp, fcDown, activation, mul.
Definitions and Abbreviations: batchSize: batch size seqLen: sequence length hiddenSize: the size of the hidden states numExperts: the number of experts in the MoE layer moeInterSize: the intermediate size of the MoE layer topK: the number of experts to select for each token
This layer takes several activation inputs:
- hiddenStates: the hidden states of the layer, with shape [batchSize, seqLen, hiddenSize]
- selectedExpertsForTokens: the top K experts selected for each token, with shape [batchSize, seqLen, topK]
- scoresForSelectedExperts: the scales for the selected experts per token, with shape [batchSize, seqLen, topK] The MoE will take the selected experts and the corresponding scales for the selected experts to compute the output.
The weights in the MoE layer:
- fcGateWeights with shape [numExperts, hiddenSize, moeInterSize]: the weight matrix for fcGate
- fcUpWeights with shape [numExperts, hiddenSize, moeInterSize]: the weight matrix for fcUp
- fcDownWeights with shape [numExperts, moeInterSize, hiddenSize]: the weight matrix for fcDown
Several optional inputs are supported:
-
fcGateBias: the bias for the fcGate, with shape [numExperts, moeInterSize]
-
fcUpBias: the bias for the fcUp, with shape [numExperts, moeInterSize]
-
fcDownBias: the bias for the fcDown, with shape [numExperts, hiddenSize] All the bias are none by default. You must either set all the bias or none of them.
-
activation: the activation type for the MoE layer, currently only support SILU.
MoE computation process description: For each token, the MoE layer computation process is as follows:
- Input processing:
- Receive hiddenStates:
- Receive selectedExpertsForTokens:
- Receive scoresForSelectedExperts:
- Expert computation for each token:
- output_i = fcDown(fcUp(hiddenStates) * activation(fcGate(hiddenStates)))
- Expert output aggregation: For each token, firstly select all the experts that need to be activated to do the computation.
- calculate the selected expert’s output according to expert id in selectedExpertsForTokens for each token
- Weighted sum of each expert’s output according to weights in scoresForSelectedExperts for each token
- Final output for the token: moeOutput = Σ(score_i * output_i) The output of MoE has the same shape as the input hiddenStates.
MoE requires Blackwell or Thor GPU architecture (SM 10.x or SM 11.x). SM 12.x is not currently supported. And performance is limited when seqLen > 16.
Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
Implementations§
Source§impl IMoELayer
impl IMoELayer
Sourcepub fn setGatedWeights(
self: Pin<&mut IMoELayer>,
fcGateWeights: Pin<&mut ITensor>,
fcUpWeights: Pin<&mut ITensor>,
fcDownWeights: Pin<&mut ITensor>,
activationType: MoEActType,
)
pub fn setGatedWeights( self: Pin<&mut IMoELayer>, fcGateWeights: Pin<&mut ITensor>, fcUpWeights: Pin<&mut ITensor>, fcDownWeights: Pin<&mut ITensor>, activationType: MoEActType, )
Set the weights of the experts when each expert is a GLU (gated linear unit). In each GLU, there are 3 linear layers and 1 activation function, so this function requires 3 weight tensors and 1 activation type.
fcGateWeightsThe weights for the gate-projection layer of all experts in MoE. Shape: [numExperts, hiddenSize, moeInterSize].fcUpWeightsThe weights for the up-projection layer of all experts in MoE. Shape: [numExperts, hiddenSize, moeInterSize].fcDownWeightsThe weights for the down-projection layer of all experts in MoE. Shape: [numExperts, moeInterSize, hiddenSize].activationTypeThe activation function to use for the MoE layer. Currently only kSILU is supported.
See [setActivationType()]
See [getActivationType()]
Sourcepub fn setGatedBiases(
self: Pin<&mut IMoELayer>,
fcGateBiases: Pin<&mut ITensor>,
fcUpBiases: Pin<&mut ITensor>,
fcDownBiases: Pin<&mut ITensor>,
)
pub fn setGatedBiases( self: Pin<&mut IMoELayer>, fcGateBiases: Pin<&mut ITensor>, fcUpBiases: Pin<&mut ITensor>, fcDownBiases: Pin<&mut ITensor>, )
Set the biases of the experts when each expert is a GLU (gated linear unit). In each GLU, there are 3 linear layers, so this function requires 3 bias tensors.
fcGateBiasesThe biases for the gate-projection layer of all experts in MoE. Shape: [numExperts, moeInterSize].fcUpBiasesThe biases for the up-projection layer of all experts in MoE. Shape: [numExperts, moeInterSize].fcDownBiasesThe biases for the down-projection layer of all experts in MoE. Shape: [numExperts, hiddenSize].
Sourcepub fn setActivationType(self: Pin<&mut IMoELayer>, activationType: MoEActType)
pub fn setActivationType(self: Pin<&mut IMoELayer>, activationType: MoEActType)
Set the activation type for the MoE layer.
activationType:the activation type for the MoE layer.
See [getActivationType()]
Sourcepub fn getActivationType(self: &IMoELayer) -> MoEActType
pub fn getActivationType(self: &IMoELayer) -> MoEActType
Get the activation type for the MoE layer.
See [setActivationType()]
the activation type for the MoE layer.
Sourcepub fn setQuantizationStatic(
self: Pin<&mut IMoELayer>,
fcDownActivationScale: Pin<&mut ITensor>,
dataType: DataType,
)
pub fn setQuantizationStatic( self: Pin<&mut IMoELayer>, fcDownActivationScale: Pin<&mut ITensor>, dataType: DataType, )
Configure static quantization after the mul op.
┌── fcGate ── activation ───┐
│ │
hiddenStates ───┤ ├── mul ── {Q ── DQ} ── fcDown ── output
│ │
└── fcUp ───────────────────┘
When using mul output static quantization, the user must provide:
fcDownActivationScale:the scale tensor.dataType:the type that the activation is quantized to. In addition, the user should also insert Q/DQ before the hiddenStates input of the MoE layer. The quantization method must be the same as the quantization method here.
If setQuantizationDynamicDblQ is called, then previous calls to this function are overridden. If setQuantizationToType is called, previous parameters set by this function are overridden.
See [setQuantizationToType()]
See [getQuantizationToType()]
Sourcepub fn setQuantizationDynamicDblQ(
self: Pin<&mut IMoELayer>,
fcDownActivationDblQScale: Pin<&mut ITensor>,
dataType: DataType,
blockShape: &Dims64,
dynQOutputScaleType: DataType,
)
pub fn setQuantizationDynamicDblQ( self: Pin<&mut IMoELayer>, fcDownActivationDblQScale: Pin<&mut ITensor>, dataType: DataType, blockShape: &Dims64, dynQOutputScaleType: DataType, )
Configure dynamic quantization (with double quantization) after the mul op.
┌── fcGate ── activation ───┐ ┌──── DQ
│ │ │ │
hiddenStates ───┤ ├── mul ── {DynQ ── DQ} ── fcDown ── output
│ │
└── fcUp ───────────────────┘
When using mul output dynamic quantization (with double quantization), the user must provide:
fcDownActivationDblQScale:the double quantization scale tensor.dataType:the type that the activation is quantized to.blockShape:the blockShape used in quantization.dynQOutputScaleType:the data type of the scale tensor. In addition, the user should also insert DynQ/DQ/DQ before the hiddenStates input of the MoE layer. The quantization method must be the same as the quantization method here.
If setQuantizationStatic is called, then previous calls to this function are overridden. If setQuantizationToType, setQuantizationBlockShape or setDynQOutputScaleType is called, previous parameters set by this function are overridden.
See [setQuantizationToType()]
See [getQuantizationToType()]
See [setQuantizationBlockShape()]
See [getQuantizationBlockShape()]
See [setDynQOutputScaleType()]
See [getDynQOutputScaleType()]
Sourcepub fn setQuantizationToType(self: Pin<&mut IMoELayer>, type_: DataType)
pub fn setQuantizationToType(self: Pin<&mut IMoELayer>, type_: DataType)
Set the data type the mul output is quantized to.
type:the data type the mul output is quantized to. The type must be one of DataType::kFP8, DataType::kFP4.
Default: DataType::kFLOAT which means the MoE layer is not quantized.
See [getQuantizationToType()]
Sourcepub fn getQuantizationToType(self: &IMoELayer) -> DataType
pub fn getQuantizationToType(self: &IMoELayer) -> DataType
Get the data type the mul in MoE layer is quantized to.
See [setQuantizationToType()]
the data type the mul in MoE layer is quantized to.
Sourcepub fn setQuantizationBlockShape(self: Pin<&mut IMoELayer>, blockShape: &Dims64)
pub fn setQuantizationBlockShape(self: Pin<&mut IMoELayer>, blockShape: &Dims64)
Set the block shape for the quantization of the Mul output.
blockShape:the block shape for the quantization of the Mul output.
The shape must have rank 4 and the dimensions representing block sizes for Mul output dimensions (batchSize, seqLen, topK, moeInterSize) respectively. For example, a shape of [1, 1, 1, 16] means block quantization on the last (moeInterSize) axis. -1 means a fully blocked dimension.
See [getQuantizationBlockShape()]
Sourcepub fn getQuantizationBlockShape(self: &IMoELayer) -> Dims64
pub fn getQuantizationBlockShape(self: &IMoELayer) -> Dims64
Get the block shape for the quantization of the Mul output.
See [setQuantizationBlockShape()]
the block shape for the quantization of the Mul output.
Sourcepub fn setDynQOutputScaleType(self: Pin<&mut IMoELayer>, type_: DataType)
pub fn setDynQOutputScaleType(self: Pin<&mut IMoELayer>, type_: DataType)
Set the dynamic quantization output scale type.
type:the dynamic quantization output scale type.
See [getDynQOutputScaleType()]
Sourcepub fn getDynQOutputScaleType(self: &IMoELayer) -> DataType
pub fn getDynQOutputScaleType(self: &IMoELayer) -> DataType
Get the dynamic quantization output scale type.
See [setDynQOutputScaleType()]
the dynamic quantization output scale type.
Sourcepub fn setSwigluParams(
self: Pin<&mut IMoELayer>,
limit: f32,
alpha: f32,
beta: f32,
)
pub fn setSwigluParams( self: Pin<&mut IMoELayer>, limit: f32, alpha: f32, beta: f32, )
Set the SwiGLU parameters.
limitthe SwiGLU parameter limit.alphathe SwiGLU parameter alpha.betathe SwiGLU parameter beta.
Default: +inf, 1.0, 0.0
See [setSwigluParamLimit()]
See [getSwigluParamLimit()]
See [setSwigluParamAlpha()]
See [getSwigluParamAlpha()]
See [setSwigluParamBeta()]
See [getSwigluParamBeta()]
Sourcepub fn setSwigluParamLimit(self: Pin<&mut IMoELayer>, limit: f32)
pub fn setSwigluParamLimit(self: Pin<&mut IMoELayer>, limit: f32)
Set the SwiGLU parameter limit.
limitthe SwiGLU parameter limit.
Default: +inf
See [getSwigluParamLimit()]
Sourcepub fn getSwigluParamLimit(self: &IMoELayer) -> f32
pub fn getSwigluParamLimit(self: &IMoELayer) -> f32
Get the SwiGLU parameter limit.
See [setSwigluParamLimit()]
the SwiGLU parameter limit.
Sourcepub fn setSwigluParamAlpha(self: Pin<&mut IMoELayer>, alpha: f32)
pub fn setSwigluParamAlpha(self: Pin<&mut IMoELayer>, alpha: f32)
Set the SwiGLU parameter alpha.
alphathe SwiGLU parameter alpha.
Default: 1.0
See [getSwigluParamAlpha()]
Sourcepub fn getSwigluParamAlpha(self: &IMoELayer) -> f32
pub fn getSwigluParamAlpha(self: &IMoELayer) -> f32
Get the SwiGLU parameter alpha.
See [setSwigluParamAlpha()]
the SwiGLU parameter alpha.
Sourcepub fn setSwigluParamBeta(self: Pin<&mut IMoELayer>, beta: f32)
pub fn setSwigluParamBeta(self: Pin<&mut IMoELayer>, beta: f32)
Set the SwiGLU parameter beta.
betathe SwiGLU parameter beta.
Default: 0.0
See [getSwigluParamBeta()]
Sourcepub fn getSwigluParamBeta(self: &IMoELayer) -> f32
pub fn getSwigluParamBeta(self: &IMoELayer) -> f32
Get the SwiGLU parameter beta.
See [setSwigluParamBeta()]
the SwiGLU parameter beta.
Sourcepub fn setInput(
self: Pin<&mut IMoELayer>,
index: i32,
tensor: Pin<&mut ITensor>,
)
pub fn setInput( self: Pin<&mut IMoELayer>, index: i32, tensor: Pin<&mut ITensor>, )
Set the input of the MoE layer.
indexthe index of the input to modify.tensorthe new input tensor
The indices are as follows:
Input 0: hiddenStates: the input activations, with shape [batchSize, seqLen, hiddenSize] Input 1: selectedExpertsForTokens: the selected experts for tokens, with shape [batchSize, seqLen, topK] Input 2: scoresForSelectedExperts: the scores for selected experts, with shape [batchSize, seqLen, topK]
Trait Implementations§
Source§impl AsLayerTyped for IMoELayer
Available on crate feature v_1_4 only.
impl AsLayerTyped for IMoELayer
v_1_4 only.