Skip to main content

Module quantization

Module quantization 

Source
Expand description

INT8/INT4 Quantization for inference acceleration

Provides quantization and dequantization primitives used in neural network inference to reduce memory bandwidth and leverage integer arithmetic units. Supports symmetric and asymmetric quantization schemes.

Reference: “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference” — Jacob et al., CVPR 2018

Structs§

QuantError
Quantization error statistics.
QuantParams
Quantization parameters computed from calibration.

Enums§

QuantBits
Quantization bit width.
QuantScheme
Quantization scheme.

Functions§

dequantize_int4
Dequantize INT4 (packed) to f32.
dequantize_int8
Dequantize INT8 to f32.
quantization_error
Compute quantization error (MSE) between original and quantized-dequantized.
quantize_int4
Quantize an f32 tensor to INT4 (packed, 2 values per byte).
quantize_int8
Quantize an f32 tensor to INT8.
quantized_gemm_int8
INT8 matrix multiply with f32 accumulation: C = A · B A: (m × k) as i8, B: (k × n) as i8, C: (m × n) as i32 → f32