Module fastscan

Module fastscan 

Source
Expand description

FastScan SIMD-accelerated distance computation for quantized vectors

FastScan uses SIMD shuffle instructions (pshufb/vqtbl1q) to perform parallel LUT lookups, computing distances for 32 neighbors at once.

§Performance

Benchmark on M3 Max showed 5x speedup vs per-neighbor ADC:

  • Per-neighbor ADC: 1.93 µs for 32 neighbors
  • FastScan NEON: 390 ns for 32 neighbors

§Memory Layout

FastScan requires codes to be interleaved by sub-quantizer position:

[n0_sq0, n1_sq0, ..., n31_sq0]  // 32 bytes - sub-quantizer 0 for all neighbors
[n0_sq1, n1_sq1, ..., n31_sq1]  // 32 bytes - sub-quantizer 1 for all neighbors

For 4-bit RaBitQ with 768 dimensions:

  • code_size = 768 / 2 = 384 bytes per vector
  • 384 sub-quantizers, each holding 2 dimension codes (lo/hi nibbles)

§LUT Format

For 4-bit quantization, each sub-quantizer has TWO 16-entry u8 LUTs:

  • luts_lo[sq][code] for the lo nibble (even dimension)
  • luts_hi[sq][code] for the hi nibble (odd dimension)

Structs§

FastScanLUT
Quantized LUT for FastScan (u8 distances for SIMD efficiency)

Constants§

BATCH_SIZE
Batch size for FastScan - AVX2/NEON process 32 bytes at a time

Functions§

fastscan_batch
Choose the best FastScan implementation for the current platform
fastscan_batch_neon
Compute batched distances using FastScan NEON (ARM)
fastscan_batch_scalar
Scalar fallback for platforms without SIMD
fastscan_batch_with_lut
Convenience wrapper using FastScanLUT struct