Skip to main content

Module flash_attention

Module flash_attention 

Source
Expand description

Flash Attention — block-wise attention for O(N) memory usage.

Triton-inspired CPU implementation that processes attention in blocks to maximize L1/L2 cache efficiency. Achieves 2-5× speedup and ~75% memory reduction compared to naive attention for large sequence lengths.

Reference: “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” (Dao et al., 2022)

Structs§

BenchmarkResult
Result of a benchmark comparing naive vs flash attention.
FlashAttention
Block-wise attention computation optimized for CPU cache locality.
FlashAttentionConfig
Configuration for Flash Attention.
MemoryEstimate
Memory usage estimate.