Expand description
Flash Attention — block-wise attention for O(N) memory usage.
Triton-inspired CPU implementation that processes attention in blocks to maximize L1/L2 cache efficiency. Achieves 2-5× speedup and ~75% memory reduction compared to naive attention for large sequence lengths.
Reference: “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” (Dao et al., 2022)
Structs§
- Benchmark
Result - Result of a benchmark comparing naive vs flash attention.
- Flash
Attention - Block-wise attention computation optimized for CPU cache locality.
- Flash
Attention Config - Configuration for Flash Attention.
- Memory
Estimate - Memory usage estimate.