Module attention

Expand description

This module focuses on implementing various attention mechanisms that are commonly used in the transformer architecture.

Scaled Dot-Product Attention: This is the basic attention mechanism that computes the attention scores using the dot product of the query and key vectors, scales them by the square root of the dimension of the key vectors, and applies a softmax function to obtain the attention scores.
Multi-Head Attention: This mechanism extends the scaled dot-product attention by allowing the model to jointly attend to information from different representation subspaces at different positions. It does this by projecting the queries, keys and values multiple times with different learned linear projections, and then concatenating the results.
FFT Attention: This is a more advanced attention mechanism that uses the Fast Fourier Transform (FFT) to compute the attention scores more efficiently. It is particularly useful for long sequences where the standard attention mechanism can be computationally expensive.

Modules§

MultiHeadAttention: Multi-Headed attention is the first evolution of the Scaled Dot-Product Attention mechanism. They allow the model to jointly attend to information from different representation subspaces at different positions.
QkvParamsBase: This object is designed to store the parameters of the QKV (Query, Key, Value)
ScaledDotProductAttention: Scaled Dot-Product Attention mechanism is the core of the Transformer architecture. It computes the attention scores using the dot product of the query and key vectors, scales them by the square root of the dimension of the key vectors, and applies a softmax function to obtain the attention weights.