Expand description
§Portable SIMD Scan Kernels (Task 6)
Provides a family of SIMD kernels that avoid gather pathologies and work across diverse hardware:
- AVX-512: Gather or permute-based
- AVX2: Byte LUT via shuffle + partial sums
- NEON: Table lookup primitives
- Scalar: Universal fallback
§Design Principles
- Prefer layouts that allow structured loads (SoA)
- Use int16/int32 accumulation to reduce bandwidth
- Minimize unpredictable memory refs
- Maximize instruction-level parallelism (ILP)
§Math/Algorithm
Inner loop is Θ(N_scanned). Performance is dominated by:
- Memory access patterns
- Instruction throughput
Kernel design minimizes cache misses and maximizes ILP.
Structs§
- Avx2
Kernel - CpuFeatures
- Detected CPU features
- Kernel
Dispatcher - Global kernel dispatcher that selects best implementation
- Scalar
Kernel - Scalar fallback implementation (works everywhere)
- ScanOps
- High-level scan operations using best available SIMD
Enums§
- Simd
Level - SIMD level for kernel dispatch
Traits§
- Distance
Kernel - Trait for portable distance kernels