Module quantization

Module quantization 

Source
Expand description

Weight Quantization for Efficient Inference

Provides INT8 quantization for model weights to reduce memory usage and improve inference speed on integer-optimized hardware.

§Features

  • INT8 quantization: Symmetric and asymmetric quantization
  • Per-tensor quantization: Single scale/zero-point for entire tensor
  • Per-channel quantization: Independent scale/zero-point per output channel
  • Calibration: Automatic scale/zero-point computation from data
  • Mixed precision: Selective quantization of layers

§Theory

Quantization maps floating-point values to integers:

q = round(x / scale) + zero_point
x_approx = (q - zero_point) * scale

For symmetric quantization (zero_point = 0):

scale = max(|x|) / 127
q = clamp(round(x / scale), -128, 127)

Structs§

ActivationQuantizer
Activation quantization for dynamic runtime quantization
CalibrationStats
Statistics for quantization calibration
QuantizationParams
Quantization parameters
QuantizedWeight
Quantized weight tensor

Enums§

QuantizationGranularity
Quantization granularity
QuantizationMethod
Quantization method

Functions§

quantize_asymmetric_1d
Quantize f32 array to INT8 using asymmetric quantization
quantize_symmetric_1d
Quantize f32 array to INT8 using symmetric quantization
quantize_symmetric_2d
Quantize f32 2D array to INT8 using symmetric per-tensor quantization
quantize_symmetric_per_channel
Quantize f32 2D array to INT8 using symmetric per-channel quantization