Struct IAttention

Source

pub struct IAttention { /* private fields */ }

Expand description

IAttention

Helper for constructing an attention that consumes query, key and value tensors.

An attention subgraph implicitly includes three main components, two MatrixMultiply layers known as BMM1 and BMM2, and one normalization operation which defaults to be a Softmax. By default, IAttention is not decomposable and TensorRT will try to use a single fused kernel, which may be more efficient than if the subgraph is expressed without IAttention. Setting the IAttention to decomposable=True can allow IAttention to be decomposed to use multiple kernels if no fused kernel support found.

Query Key Value Mask (optional) NormalizationQuantizeScale (optional) | | | | | | Transpose | | | | | | | | ––BMM1–– | | | | | | | *————————— | | | | Normalization | | | | | *———————————————— | | —––BMM2—— | Output

The attention has the following inputs, in order of input index:

Query contains the input query. It is a tensor of type kFLOAT, kHALF or kBF16 with shape [batchSize, numHeadsQuery, sequenceLengthQuery, dimHead]
Key contains the input key. It is a tensor of type kFLOAT, kHALF or kBF16 with shape [batchSize, numHeadsKeyValue, sequenceLengthKeyValue, dimHead]
Value contains the input value. It is a tensor of type kFLOAT, kHALF or kBF16 with shape [batchSize, numHeadsKeyValue, sequenceLengthKeyValue, dimHead]
Mask (optional) contains the mask value. It is a tensor of type kBOOL or the same data type of BMM1 output with shape [batchSize, numHeadsQuery, sequenceLengthQuery, sequenceLengthKeyValue] with batchSize and numHeadsQuery broadcastable. For a kBOOL mask, a True value indicates that the corresponding position is allowed to attend. For other data types, the mask values will be added to the BMM1 output, known as an add mask.
NormalizationQuantizeScale (optional) contains the quantization scale for the attention normalization output. It is a tensor of type kFLOAT, kHALF or kBF16 with dimension 0 or 1.

https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-with-transformers.html#multi-head-attention-fusion for the complete matrix of fused kernel support.

Do not inherit from this class, as doing so will break forward-compatibility of the API and ABI.

Implementations§

Source §

impl IAttention

Source

pub fn setNormalizationOperation( self: Pin<&mut IAttention>, op: AttentionNormalizationOp, ) -> bool

Set the normalization operation for the attention.

See [getNormalizationOperation()], AttentionNormalizationOp

True if the normalization operation is set successfully, false otherwise.

Source

pub fn getNormalizationOperation(self: &IAttention) -> AttentionNormalizationOp

Get the normalization operation for the attention.

See [setNormalizationOperation()], AttentionNormalizationOp

The normalization operation for the attention. Default is kSOFTMAX.

Source

pub fn setMask(self: Pin<&mut IAttention>, mask: Pin<&mut ITensor>) -> bool

Set whether a mask will be used for the normalization operation.

mask the mask tensor of type kBOOL or the same data type of BMM1 output with 4d shape broadcastable to [batchSize, numHeadsQuery, sequenceLengthQuery, sequenceLengthKeyValue]. For a kBOOL mask, a True value indicates that the corresponding position is allowed to attend. For other data types, the mask values will be added to the BMM1 output, known as an add mask.

See [getMask]

True if the mask is set successfully, false otherwise.

Source

pub fn getMask(self: Pin<&mut IAttention>) -> *mut ITensor

Get the optional mask in attention.

See [setMask]

The optional mask in attention, nullptr if no mask is set.

Source

pub fn setCausal(self: Pin<&mut IAttention>, isCausal: bool) -> bool

Set whether the attention will run a causal inference. Cannot be used together with setMask().

See [getCausal]

True if the causal inference is set successfully, false otherwise.

Source

pub fn getCausal(self: &IAttention) -> bool

Get whether the attention will run a causal inference.

See [setCausal]

True if the attention will run a causal inference, false otherwise. Default is false.

Source

pub fn setDecomposable(self: Pin<&mut IAttention>, decomposable: bool) -> bool

Set whether the attention can be decomposed to use multiple kernels if no fused kernel support found.

See [getDecomposable]

True if the decomposable attention is set successfully, false otherwise.

Source

pub fn getDecomposable(self: &IAttention) -> bool

Get whether the attention can be decomposed to use multiple kernels if no fused kernel support found.

True if the attention can be decomposed to use multiple kernels by the compiler, false otherwise. Default is false.

See [setDecomposable]

Source

pub fn setInput( self: Pin<&mut IAttention>, index: i32, input: Pin<&mut ITensor>, ) -> bool

Append or replace an input of this layer with a specific tensor.

index the index of the input to modify.
input the new input tensor.

The indices are as follows:

Input 0 is the input query tensor. Input 1 is the input key tensor. Input 2 is the input value tensor.

True if the input tensor is set successfully, false otherwise.

Source

pub fn getNbInputs(self: &IAttention) -> i32

Get the number of inputs of IAttention. IAttention has three inputs.

The number of inputs of IAttention.

Source

pub fn getInput(self: &IAttention, index: i32) -> *mut ITensor

Get the IAttention input corresponding to the given index.

index The index of the input tensor.

The input tensor, or nullptr if the index is out of range.

Source

pub fn getNbOutputs(self: &IAttention) -> i32

Get the number of outputs of a layer. IAttention has one output.

Source

pub fn getOutput(self: &IAttention, index: i32) -> *mut ITensor

Get the IAttention output corresponding to the given index. IAttention has only one output.

index The index of the output tensor.

The indexed output tensor, or nullptr if the index is out of range.

Source

pub unsafe fn setName(self: Pin<&mut IAttention>, name: *const c_char) -> bool

Set the name of the attention.

The name is used in error diagnostics. This method copies the name string.

The string name must be null-terminated, and be at most 4096 bytes including the terminator.

See [getName()]

True if the name is set successfully, false otherwise.

Source

pub fn getName(self: &IAttention) -> *const c_char

Return the name of the attention.

See [setName()]

The name of the attention.

Source

pub fn setNormalizationQuantizeScale( self: Pin<&mut IAttention>, tensor: Pin<&mut ITensor>, ) -> bool

Set the quantization scale for the attention normalization output.

tensor for quantization scale. Data type must be DataType::kFLOAT, DataType::kHALF or DataType::kBF16. Must be a 0-d or 1-d.

True if the quantization scale is set successfully, false otherwise.

Must be used together with setNormalizationQuantizeToType to set normalization output datatype to DataType::kFP8 or DataType::kINT8.

Source

pub fn getNormalizationQuantizeScale(self: &IAttention) -> *mut ITensor

Get the quantization scale for the attention normalization output.

The quantization scale for the attention normalization output or nullptr if no quantization scale is set.

Source

pub fn setNormalizationQuantizeToType( self: Pin<&mut IAttention>, type_: DataType, ) -> bool

Set the datatype the attention normalization is quantized to.

type the datatype the attention normalization is quantized to. Must be one of DataType::kFP8, DataType::kINT8.

True if the quantization to type is set successfully, false otherwise.

Source

pub fn getNormalizationQuantizeToType(self: &IAttention) -> DataType

Get the datatype the attention normalization is quantized to.

The datatype the attention normalization is quantized to. The default value is DataType::kFLOAT.

Must be used after normalization quantization to type is set by setNormalizationQuantizeToType.

Source

pub unsafe fn setMetadata( self: Pin<&mut IAttention>, metadata: *const c_char, ) -> bool

Set the metadata for IAttention.

The metadata is emitted in the JSON returned by IEngineInspector with ProfilingVerbosity set to kDETAILED.

metadata The per-layer metadata.

The string name must be null-terminated and be at most 4096 bytes including the terminator.

See [getMetadata()] See [getLayerInformation()]

True if the metadata is set successfully, false otherwise.

Source

pub fn getMetadata(self: &IAttention) -> *const c_char

Get the metadata of IAttention.

The metadata as a null-terminated C-style string. If setMetadata() has not been called, an empty string “” will be returned as a default value.

See [setMetadata()]

Source

pub fn setNbRanks(self: Pin<&mut IAttention>, nbRanks: i32) -> bool

Set the number of ranks for multi-device attention execution.

When nbRanks > 1, this hints attention to perform multi-device attention.

nbRanks The number of ranks. Must be >= 1.

True if successful, false otherwise.

See [getNbRanks()]

Source

pub fn getNbRanks(self: &IAttention) -> i32

Get the number of ranks for multi-device execution.

The number of ranks configured for multi-device attention. Default is 1.

See [setNbRanks()]

Trait Implementations§

Source §

impl ExternType for IAttention

Source §

type Id = (n, v, i, n, f, e, r, _1, (), I, A, t, t, e, n, t, i, o, n)

A type-level representation of the type’s C++ namespace and type name. Read more

Source §

type Kind = Opaque

Either cxx::kind::Opaque or cxx::kind::Trivial. Read more

Source §

impl MakeCppStorage for IAttention

Source §

unsafe fn allocate_uninitialized_cpp_storage() -> *mut IAttention

Allocates heap space for this type in C++ and return a pointer to that space, but do not initialize that space (i.e. do not yet call a constructor). Read more

Source §