Struct TransformerDecoderConfig

Source

pub struct TransformerDecoderConfig {
    pub d_model: usize,
    pub d_ff: usize,
    pub n_heads: usize,
    pub n_layers: usize,
    pub dropout: f64,
    pub norm_first: bool,
    pub quiet_softmax: bool,
    pub initializer: Initializer,
    pub activation: ActivationConfig,
    pub layer_norm_eps: f64,
}

Expand description

Configuration to create a Transformer Decoder layer using the init function.

Fields§

§d_model: usize

The size of the model.

§d_ff: usize

The size of the position-wise feed-forward network.

§n_heads: usize

The number of attention heads.

§n_layers: usize

The number of layers.

§dropout: f64

The dropout rate. Default: 0.1

§norm_first: bool

Layer norm will be applied first instead of after the other modules.

§quiet_softmax: bool

Use “quiet softmax” instead of regular softmax.

Usage may improve performance by allowing attention heads to deposit no information (if the sequence contains no information relevant to that head).
Usage may reduce the entropy of weights in the model, enhancing quantization and compression.

Reference: https://www.evanmiller.org/attention-is-off-by-one.html

§initializer: Initializer

The type of function used to initialize neural network parameters

§activation: ActivationConfig

The activation function used in the position-wise feed-forward network. Default: Gelu

§layer_norm_eps: f64

The epsilon value for layer normalization. Default: 1e-5

Implementations§

Source §

impl TransformerDecoderConfig

Source

pub fn new( d_model: usize, d_ff: usize, n_heads: usize, n_layers: usize, ) -> TransformerDecoderConfig

Create a new instance of the config.

§Arguments

§Required Arguments

§`d_model`

The size of the model.

§`d_ff`

The size of the position-wise feed-forward network.

§`n_heads`

The number of attention heads.

§`n_layers`

The number of layers.

§Default Arguments

§`dropout`

The dropout rate. Default: 0.1

Defaults to 0.1

§`norm_first`

Layer norm will be applied first instead of after the other modules.

Defaults to false

§`quiet_softmax`

Use “quiet softmax” instead of regular softmax.

Usage may improve performance by allowing attention heads to deposit no information (if the sequence contains no information relevant to that head).
Usage may reduce the entropy of weights in the model, enhancing quantization and compression.

Reference: https://www.evanmiller.org/attention-is-off-by-one.html

Defaults to false

§`initializer`

The type of function used to initialize neural network parameters

Defaults to "Initializer::KaimingUniform{gain:1.0/num_traits::Float::sqrt(3.0), fan_out_only:false}"

§`activation`

The activation function used in the position-wise feed-forward network. Default: Gelu

Defaults to "ActivationConfig::Gelu"

§`layer_norm_eps`

The epsilon value for layer normalization. Default: 1e-5

Defaults to 1e-5

Source §

impl TransformerDecoderConfig

Source

pub fn with_dropout(self, dropout: f64) -> TransformerDecoderConfig

Sets the value for the field dropout.

The dropout rate. Default: 0.1

Defaults to 0.1

Source

pub fn with_norm_first(self, norm_first: bool) -> TransformerDecoderConfig

Sets the value for the field norm_first.

Layer norm will be applied first instead of after the other modules.

Defaults to false

Source

pub fn with_quiet_softmax(self, quiet_softmax: bool) -> TransformerDecoderConfig

Sets the value for the field quiet_softmax.

Use “quiet softmax” instead of regular softmax.

Usage may improve performance by allowing attention heads to deposit no information (if the sequence contains no information relevant to that head).
Usage may reduce the entropy of weights in the model, enhancing quantization and compression.

Reference: https://www.evanmiller.org/attention-is-off-by-one.html