pub struct FuseAttentionBlock;Expand description
Fuses matmul(QKV) → narrow(Q,K,V) → [rope] → attention → matmul(out)
into a single FusedAttentionBlock when batch*seq is small.
The optimizer auto-detects batch size from graph input shapes. For small inputs (batch*seq ≤ 64), intermediate tensors fit in L1 cache, making a monolithic kernel faster than separate BLAS calls.
Threshold is configurable via RLX_FUSE_ATTN_THRESHOLD (default: 64).
Trait Implementations§
Auto Trait Implementations§
impl Freeze for FuseAttentionBlock
impl RefUnwindSafe for FuseAttentionBlock
impl Send for FuseAttentionBlock
impl Sync for FuseAttentionBlock
impl Unpin for FuseAttentionBlock
impl UnsafeUnpin for FuseAttentionBlock
impl UnwindSafe for FuseAttentionBlock
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Mutably borrows from an owned value. Read more