Expand description
Int8 Dot Product Kernel
This module provides SIMD-accelerated int8 dot product computation for reranking candidates after the initial BPS scan.
§Algorithm
dot(Q, V) = Σ_{d=0}^{D-1} Q[d] × V[d]§Overflow Analysis
For D=768 dimensions with i8 values in [-127, 127]:
max_product = 127 × 127 = 16,129
max_sum = 768 × 16,129 = 12,387,072 < 2^31 - 1 (i32 max)Thus, i32 accumulation is sufficient.
§Implementation Strategy
§x86_64 AVX2
Uses sign-extension to i16 followed by _mm256_madd_epi16:
- Load 32 i8 values
- Sign-extend to 2×16 i16 values
- Multiply-add pairs: (a0b0 + a1b1) -> i32
- Accumulate i32 results
§ARM NEON
Uses vmull_s8 to multiply 8 i8 pairs to i16, then vpadalq_s16 to
widen and accumulate to i32.
§Future: VNNI/SDOT
- AVX-512 VNNI:
_mm256_dpbssd_epi32(single instruction, i8×i8→i32) - ARM v8.2 SDOT:
vdotq_s32(single instruction, i8×i8→i32)
Functions§
- dot_i8
- Compute the dot product of two i8 vectors.
- dot_
i8_ batch - Compute dot products for a batch of vectors with dequantization.
- dot_
i8_ indexed - Compute dot products for indexed candidates.
- l2_
distance_ i8 - Compute squared L2 distance between two i8 vectors.