Skip to main content

Module quantized

Module quantized 

Source
Expand description

Quantized tensor storage (4-bit, 8-bit) for memory-efficient inference. Quantized BLAS — i8/i4 dequantization into BinnedAccumulator.

§Design

Quantized integer products (i8×i8 → i32) are dequantized and fed directly into the BinnedAccumulator, bypassing intermediate f32 rounding entirely. This eliminates a major source of non-determinism in quantized inference.

§Saturation

Integer overflow is handled via saturation arithmetic — values clamp to i32::MAX / i32::MIN rather than wrapping silently.

Structs§

QuantParamsI4
Quantization parameters for i4 (nibble-packed) tensors.
QuantParamsI8
Quantization parameters for i8 tensors.

Functions§

quantized_dot_i8
Quantized dot product of two i8 vectors, returning f64.
quantized_matmul_i8
Quantized matrix multiply: C[m,n] = dequant(A[m,k]) × dequant(B[k,n])
quantized_sum_i4
Sum dequantized i4 (packed) values using BinnedAccumulator.
quantized_sum_i8
Sum dequantized i8 values using BinnedAccumulator.
saturating_dot_i8
Saturating dot product of two i8 slices, accumulating into i32.
saturating_mul_i8
Saturating multiply of two i8 values, producing i32 without overflow.