Expand description
INT8/INT4 Quantization for inference acceleration
Provides quantization and dequantization primitives used in neural network inference to reduce memory bandwidth and leverage integer arithmetic units. Supports symmetric and asymmetric quantization schemes.
Reference: “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference” — Jacob et al., CVPR 2018
Structs§
- Quant
Error - Quantization error statistics.
- Quant
Params - Quantization parameters computed from calibration.
Enums§
- Quant
Bits - Quantization bit width.
- Quant
Scheme - Quantization scheme.
Functions§
- dequantize_
int4 - Dequantize INT4 (packed) to f32.
- dequantize_
int8 - Dequantize INT8 to f32.
- quantization_
error - Compute quantization error (MSE) between original and quantized-dequantized.
- quantize_
int4 - Quantize an f32 tensor to INT4 (packed, 2 values per byte).
- quantize_
int8 - Quantize an f32 tensor to INT8.
- quantized_
gemm_ int8 - INT8 matrix multiply with f32 accumulation: C = A · B A: (m × k) as i8, B: (k × n) as i8, C: (m × n) as i32 → f32