Skip to main content

MultiHeadSelfAttention

Struct MultiHeadSelfAttention 

Source
pub struct MultiHeadSelfAttention<B: Backend> { /* private fields */ }
Expand description

Scaled dot-product multi-head self-attention with optional chunked computation.

When chunk_size > 0 the query sequence is processed in windows of chunk_size rows, keeping the forward-pass peak attention memory at O(B · H · chunk_size · N) instead of O(B · H · N²), and ensuring each individual WGPU GPU dispatch remains small enough to avoid OS watchdog (TDR) timeouts.

§⚠ Training memory — chunking reduces dispatch size but NOT total tape

Burn’s forward pass builds an autodiff tape for every transformer layer before loss.backward() runs. At the forward→backward boundary all depth layers’ chunk tensors are simultaneously in GPU memory:

peak = depth × 2 × ceil(N/chunk) × B × H × chunk × N × 4 bytes
     = 12 × 2 × 39 × B × 12 × 64 × 2448 × 4   (ViT-B defaults)
     ≈ 6.56 GB × B

Chunking (small chunk_size) keeps individual GPU dispatch sizes small (preventing OS watchdog / TDR timeouts), but the cumulative tape size is the same as full attention. The only way to reduce training memory is gradient checkpointing (recompute attention during backward instead of storing it) — not yet implemented in this codebase.

Safe configurations (24 GB GPU, ViT-B):

  • batch_size = 2 → all-layers peak ≈ 13 GB ✓
  • batch_size = 4 → all-layers peak ≈ 26 GB ✗ OOM

The crate::training::learner::train function guards against unsafe configurations using --vram-gb to derive the correct limit.

§Forward memory comparison (N = 2 448, H = 12, B = 8, fp32)

modepeak fwd attn tensorsize
full (chunk=0)(8, 12, 2448, 2448)~18 GB
chunk=256(8, 12, 256, 2448)~1.9 GB
chunk=128(8, 12, 128, 2448)~960 MB
chunk=64(8, 12, 64, 2448)~480 MB

Implementations§

Source§

impl<B: Backend> MultiHeadSelfAttention<B>

Source

pub fn new( d_model: usize, num_heads: usize, dropout: f64, chunk_size: usize, device: &B::Device, ) -> Self

Construct MHSA.

  • chunk_size – query chunk window; 0 disables chunking.
Source

pub fn forward(&self, x: Tensor<B, 3>) -> Tensor<B, 3>

Self-attention: (B, N, D) → (B, N, D).

When chunk_size > 0 the computation is split into ceil(N / chunk_size) passes, each allocating an attention matrix of shape (B, H, chunk_size, N) rather than (B, H, N, N).

Trait Implementations§

Source§

impl<B> AutodiffModule<B> for MultiHeadSelfAttention<B>

Source§

type InnerModule = MultiHeadSelfAttention<<B as AutodiffBackend>::InnerBackend>

Inner module without auto-differentiation.
Source§

fn valid(&self) -> Self::InnerModule

Get the same module, but on the inner backend without auto-differentiation.
Source§

impl<B: Backend> Clone for MultiHeadSelfAttention<B>

Source§

fn clone(&self) -> Self

Returns a duplicate of the value. Read more
1.0.0 · Source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
Source§

impl<B: Debug + Backend> Debug for MultiHeadSelfAttention<B>

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl<B: Backend> Display for MultiHeadSelfAttention<B>

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more
Source§

impl<B: Backend> Module<B> for MultiHeadSelfAttention<B>

Source§

type Record = MultiHeadSelfAttentionRecord<B>

Type to save and load the module.
Source§

fn load_record(self, record: Self::Record) -> Self

Load the module state from a record.
Source§

fn into_record(self) -> Self::Record

Convert the module into a record containing the state.
Source§

fn num_params(&self) -> usize

Get the number of parameters the module has, including all of its sub-modules.
Source§

fn visit<Visitor: ModuleVisitor<B>>(&self, visitor: &mut Visitor)

Visit each tensor parameter in the module with a visitor.
Source§

fn map<Mapper: ModuleMapper<B>>(self, mapper: &mut Mapper) -> Self

Map each tensor parameter in the module with a mapper.
Source§

fn collect_devices(&self, devices: Devices<B>) -> Devices<B>

Return all the devices found in the underneath module tree added to the given vector without duplicates.
Source§

fn to_device(self, device: &B::Device) -> Self

Move the module and all of its sub-modules to the given device. Read more
Source§

fn fork(self, device: &B::Device) -> Self

Fork the module and all of its sub-modules to the given device. Read more
Source§

fn devices(&self) -> Vec<<B as Backend>::Device>

Return all the devices found in the underneath module tree without duplicates.
Source§

fn no_grad(self) -> Self

Each tensor in the module tree will not require grad. Read more
Source§

fn save_file<FR, PB>( self, file_path: PB, recorder: &FR, ) -> Result<(), RecorderError>
where FR: FileRecorder<B>, PB: Into<PathBuf>,

Save the module to a file using the provided file recorder. Read more
Source§

fn load_file<FR, PB>( self, file_path: PB, recorder: &FR, device: &<B as Backend>::Device, ) -> Result<Self, RecorderError>
where FR: FileRecorder<B>, PB: Into<PathBuf>,

Load the module from a file using the provided file recorder. Read more
Source§

fn quantize_weights<C>(self, quantizer: &mut Quantizer<C>) -> Self
where C: Calibration,

Quantize the weights of the module.
Source§

impl<B: Backend> ModuleDisplay for MultiHeadSelfAttention<B>

Source§

fn format(&self, passed_settings: DisplaySettings) -> String

Formats the module with provided display settings. Read more
Source§

fn custom_settings(&self) -> Option<DisplaySettings>

Custom display settings for the module. Read more
Source§

fn custom_content(&self, _content: Content) -> Option<Content>

Custom attributes for the module. Read more
Source§

impl<B: Backend> ModuleDisplayDefault for MultiHeadSelfAttention<B>

Source§

fn content(&self, content: Content) -> Option<Content>

Attributes of the module used for display purposes. Read more
Source§

fn num_params(&self) -> usize

Gets the number of the parameters of the module.

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> CloneToUninit for T
where T: Clone,

Source§

unsafe fn clone_to_uninit(&self, dest: *mut u8)

🔬This is a nightly-only experimental API. (clone_to_uninit)
Performs copy-assignment from self to dest. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> PolicyExt for T
where T: ?Sized,

Source§

fn and<P, B, E>(self, other: P) -> And<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow only if self and other return Action::Follow. Read more
Source§

fn or<P, B, E>(self, other: P) -> Or<T, P>
where T: Policy<B, E>, P: Policy<B, E>,

Create a new Policy that returns Action::Follow if either self or other returns Action::Follow. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T> ToOwned for T
where T: Clone,

Source§

type Owned = T

The resulting type after obtaining ownership.
Source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
Source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
Source§

impl<T> ToString for T
where T: Display + ?Sized,

Source§

fn to_string(&self) -> String

Converts the given value to a String. Read more
Source§

impl<T> ToStringFallible for T
where T: Display,

Source§

fn try_to_string(&self) -> Result<String, TryReserveError>

ToString::to_string, but without panic on OOM.

Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more