Skip to main content

IMoELayer

Struct IMoELayer 

Source
pub struct IMoELayer { /* private fields */ }
Expand description
┌──────────────┐┌────────────────────────┐┌────────────────────────┐
│ hiddenStates ││selectedExpertsForTokens││scoresForSelectedExperts│
└──────────────┘└────────────────────────┘└────────────────────────┘
│                    │                    │
│                    │                    │
┌───────────────────────────────────────────────────────────────────────────────────┐
│                                                                                   │
│  ┌──────────────────────────┐                        ┌──────────────────────────┐ │
│  │      │  Expert 0   │     │         MOE            │      │  Expert i   │     │ │
│  │      │             │     │                        │      │             │     │ │
│  │  ┌────────┐    ┌────────┐│                        │  ┌────────┐    ┌────────┐│ │
│  │  │ fcGate │    │  fcUp  ││                        │  │ fcGate │    │  fcUp  ││ │
│  │  │        │    │        ││                        │  │        │    │        ││ │
│  │  └───┬────┘    └────┬───┘│                        │  └───┬────┘    └────┬───┘│ │
│  │      │              │    │                        │      │              │    │ │
│  │ ┌──────────┐        │    │                        │ ┌──────────┐        │    │ │
│  │ │activation│        │    │                        │ │activation│        │    │ │
│  │ └────┬─────┘        │    │                        │ └────┬─────┘        │    │ │
│  │      │              │    │       .......          │      │              │    │ │
│  │      └──────┬───────┘    │                        │      └──────┬───────┘    │ │
│  │             │            │                        │             │            │ │
│  │         ┌────────┐       │                        │         ┌────────┐       │ │
│  │         │  mul   │       │                        │         │  mul   │       │ │
│  │         └───┬────┘       │                        │         └───┬────┘       │ │
│  │             │            │                        │             │            │ │
│  │         ┌───▼────┐       │                        │         ┌───▼────┐       │ │
│  │         │ fcDown │       │                        │         │ fcDown │       │ │
│  │         └───┬────┘       │                        │         └───┬────┘       │ │
│  │             │            │                        │             │            │ │
│  │         ┌───▼────┐       │                        │         ┌───▼────┐       │ │
│  │         │output 0│       │                        │         │output i│       │ │
│  │         └───┬────┘       │                        │         └───┬────┘       │ │
│  └─────────────┼────────────┘                        └─────────────┼────────────┘ │
│                │                                                   │              │
│                └───────────────────┬───────────────────────────────┘              │
│                                    │                                              │
│                                    ▼                                              │
│                            ┌───────────────┐                                      │
│                            │  weightedSum  │                                      │
│                            └───────┬───────┘                                      │
└────────────────────────────────────│──────────────────────────────────────────────┘
▼
┌───────────────┐
│   moeOutput   │
└───────────────┘

IMoELayer

A MoE layer in a network definition. Mixture of Experts (MoE) is a collection of experts with each expert specializing in processing different subsets of input data. The key innovation lies in using a Router that selectively activates only the specific experts needed for a given input, rather than engaging the entire neural network for every task.

Definition in the MoE layer: fcDown, fcGate, fcUp are three linear layers. fc(x) = x * w + b, where x is the input, w is the weight, b is the bias, * is the matrix multiplication. activation is the activation function. mul is the multiplication between the output of fc_up and the output of fc_gate. weightedSum is the weighted sum of the output of the experts. moeOutput is the output of the MoE layer.

MoE is a collection of experts. Each expert is a GLU (gated linear unit), which consists by fcGate, fcUp, fcDown, activation, mul.

Definitions and Abbreviations: batchSize: batch size seqLen: sequence length hiddenSize: the size of the hidden states numExperts: the number of experts in the MoE layer moeInterSize: the intermediate size of the MoE layer topK: the number of experts to select for each token

This layer takes several activation inputs:

  1. hiddenStates: the hidden states of the layer, with shape [batchSize, seqLen, hiddenSize]
  2. selectedExpertsForTokens: the top K experts selected for each token, with shape [batchSize, seqLen, topK]
  3. scoresForSelectedExperts: the scales for the selected experts per token, with shape [batchSize, seqLen, topK] The MoE will take the selected experts and the corresponding scales for the selected experts to compute the output.

The weights in the MoE layer:

  1. fcGateWeights with shape [numExperts, hiddenSize, moeInterSize]: the weight matrix for fcGate
  2. fcUpWeights with shape [numExperts, hiddenSize, moeInterSize]: the weight matrix for fcUp
  3. fcDownWeights with shape [numExperts, moeInterSize, hiddenSize]: the weight matrix for fcDown

Several optional inputs are supported:

  1. fcGateBias: the bias for the fcGate, with shape [numExperts, moeInterSize]

  2. fcUpBias: the bias for the fcUp, with shape [numExperts, moeInterSize]

  3. fcDownBias: the bias for the fcDown, with shape [numExperts, hiddenSize] All the bias are none by default. You must either set all the bias or none of them.

  4. activation: the activation type for the MoE layer, currently only support SILU.

MoE computation process description: For each token, the MoE layer computation process is as follows:

  1. Input processing:
  • Receive hiddenStates:
  • Receive selectedExpertsForTokens:
  • Receive scoresForSelectedExperts:
  1. Expert computation for each token:
  • output_i = fcDown(fcUp(hiddenStates) * activation(fcGate(hiddenStates)))
  1. Expert output aggregation: For each token, firstly select all the experts that need to be activated to do the computation.
  • calculate the selected expert’s output according to expert id in selectedExpertsForTokens for each token
  • Weighted sum of each expert’s output according to weights in scoresForSelectedExperts for each token
  • Final output for the token: moeOutput = Σ(score_i * output_i) The output of MoE has the same shape as the input hiddenStates.

MoE requires Blackwell or Thor GPU architecture (SM 10.x or SM 11.x). SM 12.x is not currently supported. And performance is limited when seqLen > 16.

Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.

Implementations§

Source§

impl IMoELayer

Source

pub fn setGatedWeights( self: Pin<&mut IMoELayer>, fcGateWeights: Pin<&mut ITensor>, fcUpWeights: Pin<&mut ITensor>, fcDownWeights: Pin<&mut ITensor>, activationType: MoEActType, )

Set the weights of the experts when each expert is a GLU (gated linear unit). In each GLU, there are 3 linear layers and 1 activation function, so this function requires 3 weight tensors and 1 activation type.

  • fcGateWeights The weights for the gate-projection layer of all experts in MoE. Shape: [numExperts, hiddenSize, moeInterSize].
  • fcUpWeights The weights for the up-projection layer of all experts in MoE. Shape: [numExperts, hiddenSize, moeInterSize].
  • fcDownWeights The weights for the down-projection layer of all experts in MoE. Shape: [numExperts, moeInterSize, hiddenSize].
  • activationType The activation function to use for the MoE layer. Currently only kSILU is supported.

See [setActivationType()] See [getActivationType()]

Source

pub fn setGatedBiases( self: Pin<&mut IMoELayer>, fcGateBiases: Pin<&mut ITensor>, fcUpBiases: Pin<&mut ITensor>, fcDownBiases: Pin<&mut ITensor>, )

Set the biases of the experts when each expert is a GLU (gated linear unit). In each GLU, there are 3 linear layers, so this function requires 3 bias tensors.

  • fcGateBiases The biases for the gate-projection layer of all experts in MoE. Shape: [numExperts, moeInterSize].
  • fcUpBiases The biases for the up-projection layer of all experts in MoE. Shape: [numExperts, moeInterSize].
  • fcDownBiases The biases for the down-projection layer of all experts in MoE. Shape: [numExperts, hiddenSize].
Source

pub fn setActivationType(self: Pin<&mut IMoELayer>, activationType: MoEActType)

Set the activation type for the MoE layer.

  • activationType: the activation type for the MoE layer.

See [getActivationType()]

Source

pub fn getActivationType(self: &IMoELayer) -> MoEActType

Get the activation type for the MoE layer.

See [setActivationType()]

the activation type for the MoE layer.

Source

pub fn setQuantizationStatic( self: Pin<&mut IMoELayer>, fcDownActivationScale: Pin<&mut ITensor>, dataType: DataType, )

Configure static quantization after the mul op.

┌── fcGate ── activation ───┐
│                           │
hiddenStates ───┤                           ├── mul ── {Q ── DQ} ── fcDown ── output
│                           │
└── fcUp ───────────────────┘

When using mul output static quantization, the user must provide:

  • fcDownActivationScale: the scale tensor.
  • dataType: the type that the activation is quantized to. In addition, the user should also insert Q/DQ before the hiddenStates input of the MoE layer. The quantization method must be the same as the quantization method here.

If setQuantizationDynamicDblQ is called, then previous calls to this function are overridden. If setQuantizationToType is called, previous parameters set by this function are overridden.

See [setQuantizationToType()] See [getQuantizationToType()]

Source

pub fn setQuantizationDynamicDblQ( self: Pin<&mut IMoELayer>, fcDownActivationDblQScale: Pin<&mut ITensor>, dataType: DataType, blockShape: &Dims64, dynQOutputScaleType: DataType, )

Configure dynamic quantization (with double quantization) after the mul op.

┌── fcGate ── activation ───┐             ┌──── DQ
│                           │             │      │
hiddenStates ───┤                           ├── mul ── {DynQ ── DQ} ── fcDown ── output
│                           │
└── fcUp ───────────────────┘

When using mul output dynamic quantization (with double quantization), the user must provide:

  • fcDownActivationDblQScale: the double quantization scale tensor.
  • dataType: the type that the activation is quantized to.
  • blockShape: the blockShape used in quantization.
  • dynQOutputScaleType: the data type of the scale tensor. In addition, the user should also insert DynQ/DQ/DQ before the hiddenStates input of the MoE layer. The quantization method must be the same as the quantization method here.

If setQuantizationStatic is called, then previous calls to this function are overridden. If setQuantizationToType, setQuantizationBlockShape or setDynQOutputScaleType is called, previous parameters set by this function are overridden.

See [setQuantizationToType()] See [getQuantizationToType()] See [setQuantizationBlockShape()] See [getQuantizationBlockShape()] See [setDynQOutputScaleType()] See [getDynQOutputScaleType()]

Source

pub fn setQuantizationToType(self: Pin<&mut IMoELayer>, type_: DataType)

Set the data type the mul output is quantized to.

  • type: the data type the mul output is quantized to. The type must be one of DataType::kFP8, DataType::kFP4.

Default: DataType::kFLOAT which means the MoE layer is not quantized.

See [getQuantizationToType()]

Source

pub fn getQuantizationToType(self: &IMoELayer) -> DataType

Get the data type the mul in MoE layer is quantized to.

See [setQuantizationToType()]

the data type the mul in MoE layer is quantized to.

Source

pub fn setQuantizationBlockShape(self: Pin<&mut IMoELayer>, blockShape: &Dims64)

Set the block shape for the quantization of the Mul output.

  • blockShape: the block shape for the quantization of the Mul output.

The shape must have rank 4 and the dimensions representing block sizes for Mul output dimensions (batchSize, seqLen, topK, moeInterSize) respectively. For example, a shape of [1, 1, 1, 16] means block quantization on the last (moeInterSize) axis. -1 means a fully blocked dimension.

See [getQuantizationBlockShape()]

Source

pub fn getQuantizationBlockShape(self: &IMoELayer) -> Dims64

Get the block shape for the quantization of the Mul output.

See [setQuantizationBlockShape()]

the block shape for the quantization of the Mul output.

Source

pub fn setDynQOutputScaleType(self: Pin<&mut IMoELayer>, type_: DataType)

Set the dynamic quantization output scale type.

  • type: the dynamic quantization output scale type.

See [getDynQOutputScaleType()]

Source

pub fn getDynQOutputScaleType(self: &IMoELayer) -> DataType

Get the dynamic quantization output scale type.

See [setDynQOutputScaleType()]

the dynamic quantization output scale type.

Source

pub fn setSwigluParams( self: Pin<&mut IMoELayer>, limit: f32, alpha: f32, beta: f32, )

Set the SwiGLU parameters.

  • limit the SwiGLU parameter limit.
  • alpha the SwiGLU parameter alpha.
  • beta the SwiGLU parameter beta.

Default: +inf, 1.0, 0.0

See [setSwigluParamLimit()] See [getSwigluParamLimit()] See [setSwigluParamAlpha()] See [getSwigluParamAlpha()] See [setSwigluParamBeta()] See [getSwigluParamBeta()]

Source

pub fn setSwigluParamLimit(self: Pin<&mut IMoELayer>, limit: f32)

Set the SwiGLU parameter limit.

  • limit the SwiGLU parameter limit.

Default: +inf

See [getSwigluParamLimit()]

Source

pub fn getSwigluParamLimit(self: &IMoELayer) -> f32

Get the SwiGLU parameter limit.

See [setSwigluParamLimit()]

the SwiGLU parameter limit.

Source

pub fn setSwigluParamAlpha(self: Pin<&mut IMoELayer>, alpha: f32)

Set the SwiGLU parameter alpha.

  • alpha the SwiGLU parameter alpha.

Default: 1.0

See [getSwigluParamAlpha()]

Source

pub fn getSwigluParamAlpha(self: &IMoELayer) -> f32

Get the SwiGLU parameter alpha.

See [setSwigluParamAlpha()]

the SwiGLU parameter alpha.

Source

pub fn setSwigluParamBeta(self: Pin<&mut IMoELayer>, beta: f32)

Set the SwiGLU parameter beta.

  • beta the SwiGLU parameter beta.

Default: 0.0

See [getSwigluParamBeta()]

Source

pub fn getSwigluParamBeta(self: &IMoELayer) -> f32

Get the SwiGLU parameter beta.

See [setSwigluParamBeta()]

the SwiGLU parameter beta.

Source

pub fn setInput( self: Pin<&mut IMoELayer>, index: i32, tensor: Pin<&mut ITensor>, )

Set the input of the MoE layer.

  • index the index of the input to modify.
  • tensor the new input tensor

The indices are as follows:

Input 0: hiddenStates: the input activations, with shape [batchSize, seqLen, hiddenSize] Input 1: selectedExpertsForTokens: the selected experts for tokens, with shape [batchSize, seqLen, topK] Input 2: scoresForSelectedExperts: the scores for selected experts, with shape [batchSize, seqLen, topK]

Trait Implementations§

Source§

impl AsLayer for IMoELayer

Available on crate feature v_1_4 only.
Source§

fn as_layer(&self) -> &ILayer

Source§

fn as_layer_pin_mut(&mut self) -> Pin<&mut ILayer>

Source§

impl AsLayerTyped for IMoELayer

Available on crate feature v_1_4 only.
Source§

const TYPE: LayerType = LayerType::kMOE

Source§

impl AsRef<ILayer> for IMoELayer

Source§

fn as_ref(self: &IMoELayer) -> &ILayer

Converts this type into a shared reference of the (usually inferred) input type.
Source§

impl ExternType for IMoELayer

Source§

type Id = (n, v, i, n, f, e, r, _1, (), I, M, o, E, L, a, y, e, r)

A type-level representation of the type’s C++ namespace and type name. Read more
Source§

type Kind = Opaque

Source§

impl MakeCppStorage for IMoELayer

Source§

unsafe fn allocate_uninitialized_cpp_storage() -> *mut IMoELayer

Allocates heap space for this type in C++ and return a pointer to that space, but do not initialize that space (i.e. do not yet call a constructor). Read more
Source§

unsafe fn free_uninitialized_cpp_storage(arg0: *mut IMoELayer)

Frees a C++ allocation which has not yet had a constructor called. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.