Module fastscan

Expand description

FastScan SIMD-accelerated distance computation for quantized vectors

FastScan uses SIMD shuffle instructions (pshufb/vqtbl1q) to perform parallel LUT lookups, computing distances for 32 neighbors at once.

§Performance

Benchmark on M3 Max showed 5x speedup vs per-neighbor ADC:

Per-neighbor ADC: 1.93 µs for 32 neighbors
FastScan NEON: 390 ns for 32 neighbors

§Memory Layout

FastScan requires codes to be interleaved by sub-quantizer position:

[n0_sq0, n1_sq0, ..., n31_sq0]  // 32 bytes - sub-quantizer 0 for all neighbors
[n0_sq1, n1_sq1, ..., n31_sq1]  // 32 bytes - sub-quantizer 1 for all neighbors

For 4-bit RaBitQ with 768 dimensions:

code_size = 768 / 2 = 384 bytes per vector
384 sub-quantizers, each holding 2 dimension codes (lo/hi nibbles)

§LUT Format

For 4-bit quantization, each sub-quantizer has TWO 16-entry u8 LUTs:

luts_lo[sq][code] for the lo nibble (even dimension)
luts_hi[sq][code] for the hi nibble (odd dimension)

Structs§

FastScanLUT: Quantized LUT for FastScan (u8 distances for SIMD efficiency)

Constants§

BATCH_SIZE: Batch size for FastScan - AVX2/NEON process 32 bytes at a time

Functions§

fastscan_batch: Choose the best FastScan implementation for the current platform
fastscan_batch_neon: Compute batched distances using FastScan NEON (ARM)
fastscan_batch_scalar: Scalar fallback for platforms without SIMD
fastscan_batch_with_lut: Convenience wrapper using FastScanLUT struct

Module fastscan

Module fastscan Copy item path

§Performance

§Memory Layout

§LUT Format

Structs§

Constants§

Functions§

Module fastscan