Expand description
Quantized tensor storage (4-bit, 8-bit) for memory-efficient inference. Quantized BLAS — i8/i4 dequantization into BinnedAccumulator.
§Design
Quantized integer products (i8×i8 → i32) are dequantized and fed directly into the BinnedAccumulator, bypassing intermediate f32 rounding entirely. This eliminates a major source of non-determinism in quantized inference.
§Saturation
Integer overflow is handled via saturation arithmetic — values clamp to
i32::MAX / i32::MIN rather than wrapping silently.
Structs§
- Quant
Params I4 - Quantization parameters for i4 (nibble-packed) tensors.
- Quant
Params I8 - Quantization parameters for i8 tensors.
Functions§
- quantized_
dot_ i8 - Quantized dot product of two i8 vectors, returning f64.
- quantized_
matmul_ i8 - Quantized matrix multiply: C[m,n] = dequant(A[m,k]) × dequant(B[k,n])
- quantized_
sum_ i4 - Sum dequantized i4 (packed) values using BinnedAccumulator.
- quantized_
sum_ i8 - Sum dequantized i8 values using BinnedAccumulator.
- saturating_
dot_ i8 - Saturating dot product of two i8 slices, accumulating into i32.
- saturating_
mul_ i8 - Saturating multiply of two i8 values, producing i32 without overflow.