Skip to main content

Module dot_i8

Module dot_i8 

Source
Expand description

Int8 Dot Product Kernel

This module provides SIMD-accelerated int8 dot product computation for reranking candidates after the initial BPS scan.

§Algorithm

dot(Q, V) = Σ_{d=0}^{D-1} Q[d] × V[d]

§Overflow Analysis

For D=768 dimensions with i8 values in [-127, 127]:

max_product = 127 × 127 = 16,129
max_sum = 768 × 16,129 = 12,387,072 < 2^31 - 1 (i32 max)

Thus, i32 accumulation is sufficient.

§Implementation Strategy

§x86_64 AVX2

Uses sign-extension to i16 followed by _mm256_madd_epi16:

  1. Load 32 i8 values
  2. Sign-extend to 2×16 i16 values
  3. Multiply-add pairs: (a0b0 + a1b1) -> i32
  4. Accumulate i32 results

§ARM NEON

Uses vmull_s8 to multiply 8 i8 pairs to i16, then vpadalq_s16 to widen and accumulate to i32.

§Future: VNNI/SDOT

  • AVX-512 VNNI: _mm256_dpbssd_epi32 (single instruction, i8×i8→i32)
  • ARM v8.2 SDOT: vdotq_s32 (single instruction, i8×i8→i32)

Functions§

dot_i8
Compute the dot product of two i8 vectors.
dot_i8_batch
Compute dot products for a batch of vectors with dequantization.
dot_i8_indexed
Compute dot products for indexed candidates.
l2_distance_i8
Compute squared L2 distance between two i8 vectors.