pub struct IAttention { /* private fields */ }Expand description
IAttention
Helper for constructing an attention that consumes query, key and value tensors.
An attention subgraph implicitly includes three main components, two MatrixMultiply layers known as BMM1 and BMM2, and one normalization operation which defaults to be a Softmax. By default, IAttention is not decomposable and TensorRT will try to use a single fused kernel, which may be more efficient than if the subgraph is expressed without IAttention. Setting the IAttention to decomposable=True can allow IAttention to be decomposed to use multiple kernels if no fused kernel support found.
Query Key Value Mask (optional) NormalizationQuantizeScale (optional) | | | | | | Transpose | | | | | | | | ––BMM1–– | | | | | | | *————————— | | | | Normalization | | | | | *———————————————— | | —––BMM2—— | Output
The attention has the following inputs, in order of input index:
- Query contains the input query. It is a tensor of type kFLOAT, kHALF or kBF16, its shape depends on the query form.
- For query form kPADDED_BHND, shape is [batchSize, numHeadsQuery, numTokens, dimHead]
- For query form kPACKED_NHD, shape is [totalTokens, numHeadsQuery, dimHead]
- Key contains the input key. It is a tensor of type kFLOAT, kHALF or kBF16, its shape depends on the key value form.
- For key value form kPADDED_BHND, shape is [batchSize, numHeadsKeyValue, numTokens, dimHead]
- For key value form kPACKED_NHD, shape is [totalTokens, numHeadsKeyValue, dimHead]
- Value contains the input value. It is a tensor of type kFLOAT, kHALF or kBF16, its shape depends on the key value form.
- For key value form kPADDED_BHND, shape is [batchSize, numHeadsKeyValue, numTokens, dimHead]
- For key value form kPACKED_NHD, shape is [totalTokens, numHeadsKeyValue, dimHead]
- Mask (optional) contains the mask value. It is a tensor of type kBOOL or the same data type of BMM1 output. Shape is [batchSize, numHeadsQuery, numTokensQuery, numTokensKeyValue] with batchSize and numHeadsQuery broadcastable. TensorRT uses stride-based indexing to load the mask data.
- For a kBOOL mask, a True value indicates that the corresponding position is allowed to attend.
- For other data types, the mask values will be added to the BMM1 output, known as an add mask.
- NormalizationQuantizeScale (optional) contains the quantization scale for the attention normalization output. It is a tensor of type kFLOAT, kHALF or kBF16 with dimension 0 or 1.
The attention has one output:
- Output has the same shape, form, and data type as the query input.
- For query form kPADDED_BHND, output shape is [batchSize, numHeadsQuery, numTokens, dimHead]
- For query form kPACKED_NHD, output shape is [totalTokens, numHeadsQuery, dimHead]
https://docs.nvidia.com/deeplearning/tensorrt/latest/performance/best-practices.html#multi-head-attention-fusion for the complete matrix of fused kernel support.
Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.
Implementations§
Source§impl IAttention
impl IAttention
Sourcepub fn setNormalizationOperation(
self: Pin<&mut IAttention>,
op: AttentionNormalizationOp,
) -> bool
pub fn setNormalizationOperation( self: Pin<&mut IAttention>, op: AttentionNormalizationOp, ) -> bool
Set the normalization operation for the attention.
See [getNormalizationOperation()], AttentionNormalizationOp
True if the normalization operation is set successfully, false otherwise.
Sourcepub fn getNormalizationOperation(self: &IAttention) -> AttentionNormalizationOp
pub fn getNormalizationOperation(self: &IAttention) -> AttentionNormalizationOp
Get the normalization operation for the attention.
See [setNormalizationOperation()], AttentionNormalizationOp
The normalization operation for the attention. Default is kSOFTMAX.
Sourcepub fn setMask(self: Pin<&mut IAttention>, mask: Pin<&mut ITensor>) -> bool
pub fn setMask(self: Pin<&mut IAttention>, mask: Pin<&mut ITensor>) -> bool
Set whether a mask will be used for the normalization operation.
maskthe mask tensor of type kBOOL or the same data type of BMM1 output with 4d shape broadcastable to [batchSize, numHeadsQuery, sequenceLengthQuery, sequenceLengthKeyValue]. For a kBOOL mask, a True value indicates that the corresponding position is allowed to attend. For other data types, the mask values will be added to the BMM1 output, known as an add mask.
See [getMask]
True if the mask is set successfully, false otherwise.
Sourcepub fn getMask(self: Pin<&mut IAttention>) -> *mut ITensor
pub fn getMask(self: Pin<&mut IAttention>) -> *mut ITensor
Get the optional mask in attention.
See [setMask]
The optional mask in attention, nullptr if no mask is set.
Sourcepub fn setCausal(self: Pin<&mut IAttention>, isCausal: bool) -> bool
pub fn setCausal(self: Pin<&mut IAttention>, isCausal: bool) -> bool
Set whether the attention will run a causal inference. Cannot be used together with setMask().
isCausalTrue to enable causal masking with kUPPER_LEFT alignment, false to disable causal masking.
See [getCausal()], setCausalKind()
Deprecated in TensorRT 10.16. Superseded by setCausalKind.
True if the causal inference is set successfully, false otherwise.
Sourcepub fn getCausal(self: &IAttention) -> bool
pub fn getCausal(self: &IAttention) -> bool
Get whether the attention will run a causal inference.
See [setCausal()], getCausalKind()
Deprecated in TensorRT 10.16. Superseded by getCausalKind.
True if the attention will run a causal inference, false otherwise. Default is false.
Sourcepub fn setCausalKind(self: Pin<&mut IAttention>, kind: CausalMaskKind) -> bool
pub fn setCausalKind(self: Pin<&mut IAttention>, kind: CausalMaskKind) -> bool
Set the causal mask alignment orientation for the attention.
When set to kUPPER_LEFT or kLOWER_RIGHT, an implicit causal mask is applied. When set to kNONE, no causal masking is applied.
Cannot be used together with setMask(). Building with both a mask tensor and a causal orientation other than kNONE will fail validation.
kindThe causal mask alignment to apply.
See [getCausalKind()], CausalMaskKind
True if the causal mask kind is set successfully, false otherwise.
Sourcepub fn getCausalKind(self: &IAttention) -> CausalMaskKind
pub fn getCausalKind(self: &IAttention) -> CausalMaskKind
Get the causal mask alignment orientation for the attention.
See [setCausalKind()], CausalMaskKind
The causal mask alignment orientation. Default is kNONE.
Sourcepub fn setDecomposable(self: Pin<&mut IAttention>, decomposable: bool) -> bool
pub fn setDecomposable(self: Pin<&mut IAttention>, decomposable: bool) -> bool
Set whether the attention can be decomposed to use multiple kernels if no fused kernel support found.
See [getDecomposable]
True if the decomposable attention is set successfully, false otherwise.
Sourcepub fn getDecomposable(self: &IAttention) -> bool
pub fn getDecomposable(self: &IAttention) -> bool
Get whether the attention can be decomposed to use multiple kernels if no fused kernel support found.
True if the attention can be decomposed to use multiple kernels by the compiler, false otherwise. Default is false.
See [setDecomposable]
Sourcepub fn setInput(
self: Pin<&mut IAttention>,
index: i32,
input: Pin<&mut ITensor>,
) -> bool
pub fn setInput( self: Pin<&mut IAttention>, index: i32, input: Pin<&mut ITensor>, ) -> bool
Append or replace an input of this layer with a specific tensor.
indexthe index of the input to modify.inputthe new input tensor.
The indices are as follows:
Input 0 is the input query tensor. Input 1 is the input key tensor. Input 2 is the input value tensor.
True if the input tensor is set successfully, false otherwise.
Sourcepub fn getNbInputs(self: &IAttention) -> i32
pub fn getNbInputs(self: &IAttention) -> i32
Get the number of inputs of IAttention. IAttention has three inputs.
The number of inputs of IAttention.
Sourcepub fn getInput(self: &IAttention, index: i32) -> *mut ITensor
pub fn getInput(self: &IAttention, index: i32) -> *mut ITensor
Get the IAttention input corresponding to the given index.
indexThe index of the input tensor.
The input tensor, or nullptr if the index is out of range.
Sourcepub fn getNbOutputs(self: &IAttention) -> i32
pub fn getNbOutputs(self: &IAttention) -> i32
Get the number of outputs of a layer. IAttention has one output.
Sourcepub fn getOutput(self: &IAttention, index: i32) -> *mut ITensor
pub fn getOutput(self: &IAttention, index: i32) -> *mut ITensor
Get the IAttention output corresponding to the given index. IAttention has only one output.
indexThe index of the output tensor.
The indexed output tensor, or nullptr if the index is out of range.
Sourcepub unsafe fn setName(self: Pin<&mut IAttention>, name: *const c_char) -> bool
pub unsafe fn setName(self: Pin<&mut IAttention>, name: *const c_char) -> bool
Set the name of the attention.
The name is used in error diagnostics. This method copies the name string.
The string name must be null-terminated, and be at most 4096 bytes including the terminator.
See [getName()]
True if the name is set successfully, false otherwise.
Sourcepub fn getName(self: &IAttention) -> *const c_char
pub fn getName(self: &IAttention) -> *const c_char
Return the name of the attention.
See [setName()]
The name of the attention.
Sourcepub fn setNormalizationQuantizeScale(
self: Pin<&mut IAttention>,
tensor: Pin<&mut ITensor>,
) -> bool
pub fn setNormalizationQuantizeScale( self: Pin<&mut IAttention>, tensor: Pin<&mut ITensor>, ) -> bool
Set the quantization scale for the attention normalization output.
tensorfor quantization scale. Data type must be DataType::kFLOAT, DataType::kHALF or DataType::kBF16. Must be a 0-d or 1-d.
True if the quantization scale is set successfully, false otherwise.
Must be used together with setNormalizationQuantizeToType to set normalization output datatype to DataType::kFP8 or DataType::kINT8.
Sourcepub fn getNormalizationQuantizeScale(self: &IAttention) -> *mut ITensor
pub fn getNormalizationQuantizeScale(self: &IAttention) -> *mut ITensor
Get the quantization scale for the attention normalization output.
The quantization scale for the attention normalization output or nullptr if no quantization scale is set.
Sourcepub fn setNormalizationQuantizeToType(
self: Pin<&mut IAttention>,
type_: DataType,
) -> bool
pub fn setNormalizationQuantizeToType( self: Pin<&mut IAttention>, type_: DataType, ) -> bool
Set the datatype the attention normalization is quantized to.
typethe datatype the attention normalization is quantized to. Must be one of DataType::kFP8, DataType::kINT8.
True if the quantization to type is set successfully, false otherwise.
Sourcepub fn getNormalizationQuantizeToType(self: &IAttention) -> DataType
pub fn getNormalizationQuantizeToType(self: &IAttention) -> DataType
Get the datatype the attention normalization is quantized to.
The datatype the attention normalization is quantized to. The default value is DataType::kFLOAT.
Must be used after normalization quantization to type is set by setNormalizationQuantizeToType.
Sourcepub unsafe fn setMetadata(
self: Pin<&mut IAttention>,
metadata: *const c_char,
) -> bool
pub unsafe fn setMetadata( self: Pin<&mut IAttention>, metadata: *const c_char, ) -> bool
Set the metadata for IAttention.
The metadata is emitted in the JSON returned by IEngineInspector with ProfilingVerbosity set to kDETAILED.
metadataThe per-layer metadata.
The string name must be null-terminated and be at most 4096 bytes including the terminator.
See [getMetadata()]
See [getLayerInformation()]
True if the metadata is set successfully, false otherwise.
Sourcepub fn getMetadata(self: &IAttention) -> *const c_char
pub fn getMetadata(self: &IAttention) -> *const c_char
Get the metadata of IAttention.
The metadata as a null-terminated C-style string. If setMetadata() has not been called, an empty string “” will be returned as a default value.
See [setMetadata()]
Sourcepub fn setNbRanks(self: Pin<&mut IAttention>, nbRanks: i32) -> bool
pub fn setNbRanks(self: Pin<&mut IAttention>, nbRanks: i32) -> bool
Set the number of ranks for multi-device attention execution.
When nbRanks > 1, this hints attention to perform multi-device attention.
nbRanksThe number of ranks. Must be >= 1.
True if successful, false otherwise.
See [getNbRanks()]
Sourcepub fn getNbRanks(self: &IAttention) -> i32
pub fn getNbRanks(self: &IAttention) -> i32
Get the number of ranks for multi-device execution.
The number of ranks configured for multi-device attention. Default is 1.
See [setNbRanks()]
Sourcepub fn setQueryForm(self: Pin<&mut IAttention>, form: AttentionIOForm) -> bool
pub fn setQueryForm(self: Pin<&mut IAttention>, form: AttentionIOForm) -> bool
Set the query form.
Default is kPADDED_BHND.
formThe query form.
True if the query form is set successfully, false otherwise.
See [getQueryForm()]
See AttentionIOForm
Sourcepub fn getQueryForm(self: &IAttention) -> AttentionIOForm
pub fn getQueryForm(self: &IAttention) -> AttentionIOForm
Get the query form.
The query form. Default is kPADDED_BHND.
See [setQueryForm()]
See AttentionIOForm
Sourcepub fn setKeyValueForm(
self: Pin<&mut IAttention>,
form: AttentionIOForm,
) -> bool
pub fn setKeyValueForm( self: Pin<&mut IAttention>, form: AttentionIOForm, ) -> bool
Set the key-value form.
Default is kPADDED_BHND.
formThe key-value form.
True if the key-value form is set successfully, false otherwise.
See [getKeyValueForm()]
See AttentionIOForm
Sourcepub fn getKeyValueForm(self: &IAttention) -> AttentionIOForm
pub fn getKeyValueForm(self: &IAttention) -> AttentionIOForm
Get the key-value form.
The key-value form. Default is kPADDED_BHND.
See [setKeyValueForm()]
See AttentionIOForm
Sourcepub unsafe fn setQueryLengths(
self: Pin<&mut IAttention>,
lengths: *mut ITensor,
) -> bool
pub unsafe fn setQueryLengths( self: Pin<&mut IAttention>, lengths: *mut ITensor, ) -> bool
Set the query lengths tensor.
An optional tensor to specify the cumulative number of tokens per batch element. Must be set when query form is kPACKED_NHD. Ignored when query form is kPADDED_BHND. When set, contains cumulative token counts with shape [batchSize + 1]. The first element should be 0 and the last element equals totalTokens. The number of tokens for batch i is lengths[i + 1] - lengths[i]. The total_tokens dimension of the query tensor must be >= the last element of this tensor.
Providing a first element that is not 0 results in undefined behavior.
lengthsA 1D tensor of type kINT32 with shape [batchSize + 1]. If nullptr, clears a previously set query lengths tensor.
True if the query lengths tensor is set or cleared successfully, false otherwise.
See [getQueryLengths()]
Sourcepub fn getQueryLengths(self: &IAttention) -> *mut ITensor
pub fn getQueryLengths(self: &IAttention) -> *mut ITensor
Get the query lengths tensor.
The query lengths tensor, or nullptr if not set.
See [setQueryLengths()]
Sourcepub unsafe fn setKeyValueLengths(
self: Pin<&mut IAttention>,
lengths: *mut ITensor,
) -> bool
pub unsafe fn setKeyValueLengths( self: Pin<&mut IAttention>, lengths: *mut ITensor, ) -> bool
Set the key-value lengths tensor.
An optional tensor to specify per-batch key-value lengths. The semantics depend on the key-value form:
- When key-value form is kPADDED_BHND: contains per-batch lengths with shape [batchSize]. Each element must be <= the sequence length dimension of the KV tensor. If not set, the sequence length dimension of the KV tensor is used for all batches.
- When key-value form is kPACKED_NHD: contains cumulative token counts with shape [batchSize + 1]. The first element should be 0 and the last element equals totalTokens. The total_tokens dimension of the KV tensor must be >= the last element of this tensor. Must be set when key-value form is kPACKED_NHD.
When key-value form is kPACKED_NHD, providing a first element that is not 0 results in undefined behavior.
lengthsA 1D tensor of type kINT32. If nullptr, clears a previously set key-value lengths tensor.
True if the key-value lengths tensor is set or cleared successfully, false otherwise.
See [getKeyValueLengths()]
Sourcepub fn getKeyValueLengths(self: &IAttention) -> *mut ITensor
pub fn getKeyValueLengths(self: &IAttention) -> *mut ITensor
Get the key-value lengths tensor.
The key-value lengths tensor, or nullptr if not set.
See [setKeyValueLengths()]