Expand description
Weight Quantization for Efficient Inference
Provides INT8 quantization for model weights to reduce memory usage and improve inference speed on integer-optimized hardware.
§Features
- INT8 quantization: Symmetric and asymmetric quantization
- Per-tensor quantization: Single scale/zero-point for entire tensor
- Per-channel quantization: Independent scale/zero-point per output channel
- Calibration: Automatic scale/zero-point computation from data
- Mixed precision: Selective quantization of layers
§Theory
Quantization maps floating-point values to integers:
q = round(x / scale) + zero_point
x_approx = (q - zero_point) * scaleFor symmetric quantization (zero_point = 0):
scale = max(|x|) / 127
q = clamp(round(x / scale), -128, 127)Structs§
- Activation
Quantizer - Activation quantization for dynamic runtime quantization
- Calibration
Stats - Statistics for quantization calibration
- Quantization
Params - Quantization parameters
- Quantized
Weight - Quantized weight tensor
Enums§
- Quantization
Granularity - Quantization granularity
- Quantization
Method - Quantization method
Functions§
- quantize_
asymmetric_ 1d - Quantize f32 array to INT8 using asymmetric quantization
- quantize_
symmetric_ 1d - Quantize f32 array to INT8 using symmetric quantization
- quantize_
symmetric_ 2d - Quantize f32 2D array to INT8 using symmetric per-tensor quantization
- quantize_
symmetric_ per_ channel - Quantize f32 2D array to INT8 using symmetric per-channel quantization