Module flash_attention

Expand description

Flash Attention — block-wise attention for O(N) memory usage.

Triton-inspired CPU implementation that processes attention in blocks to maximize L1/L2 cache efficiency. Achieves 2-5× speedup and ~75% memory reduction compared to naive attention for large sequence lengths.

Reference: “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” (Dao et al., 2022)

Structs§

BenchmarkResult: Result of a benchmark comparing naive vs flash attention.
FlashAttention: Block-wise attention computation optimized for CPU cache locality.
FlashAttentionConfig: Configuration for Flash Attention.
MemoryEstimate: Memory usage estimate.

Module flash_attention

Module flash_attention Copy item path

Structs§

Module flash_attention