Expand description
FastScan SIMD-accelerated distance computation for quantized vectors
FastScan uses SIMD shuffle instructions (pshufb/vqtbl1q) to perform parallel LUT lookups, computing distances for 32 neighbors at once.
§Performance
Benchmark on M3 Max showed 5x speedup vs per-neighbor ADC:
- Per-neighbor ADC: 1.93 µs for 32 neighbors
- FastScan NEON: 390 ns for 32 neighbors
§Memory Layout
FastScan requires codes to be interleaved by sub-quantizer position:
[n0_sq0, n1_sq0, ..., n31_sq0] // 32 bytes - sub-quantizer 0 for all neighbors
[n0_sq1, n1_sq1, ..., n31_sq1] // 32 bytes - sub-quantizer 1 for all neighborsFor 4-bit RaBitQ with 768 dimensions:
- code_size = 768 / 2 = 384 bytes per vector
- 384 sub-quantizers, each holding 2 dimension codes (lo/hi nibbles)
§LUT Format
For 4-bit quantization, each sub-quantizer has TWO 16-entry u8 LUTs:
luts_lo[sq][code]for the lo nibble (even dimension)luts_hi[sq][code]for the hi nibble (odd dimension)
Structs§
- Fast
ScanLUT - Quantized LUT for FastScan (u8 distances for SIMD efficiency)
Constants§
- BATCH_
SIZE - Batch size for FastScan - AVX2/NEON process 32 bytes at a time
Functions§
- fastscan_
batch - Choose the best FastScan implementation for the current platform
- fastscan_
batch_ neon - Compute batched distances using FastScan NEON (ARM)
- fastscan_
batch_ scalar - Scalar fallback for platforms without SIMD
- fastscan_
batch_ with_ lut - Convenience wrapper using FastScanLUT struct