quantize-rs
Neural network quantization toolkit for ONNX models, written in Rust with Python bindings.
quantize-rs converts float32 ONNX models to INT8 or INT4 representation using post-training quantization. It supports weight-only quantization, activation-based calibration, per-channel quantization, and outputs standard ONNX QDQ (DequantizeLinear) graphs compatible with ONNX Runtime.
Features
- INT8 and INT4 quantization -- per-tensor or per-channel
- Activation-based calibration -- runs real inference on calibration data via tract to determine optimal quantization ranges
- Multiple calibration methods -- MinMax, Percentile (99.9th), Entropy (KL divergence), MSE
- ONNX QDQ output format -- quantized models use
DequantizeLinearnodes and load directly in ONNX Runtime - Graph connectivity validation -- verifies that every node input resolves after quantization
- Per-layer selection -- exclude layers by name, set per-layer bit widths, or skip small tensors via
min_elements - CLI -- single-model quantization, batch processing, validation, benchmarking, config-file driven workflows
- Python bindings -- via PyO3; install with
pip install quantization-rs - Typed error handling --
QuantizeErrorenum at all public API boundaries (no more string-parsinganyhowerrors) - Rust library -- usable as a crate dependency; all public items have doc comments
- Property-based tests -- 15 proptest cases covering quantization round-trips, error bounds, and bit-packing
- Criterion benchmarks -- throughput and per-channel comparison benchmarks in
benches/
Installation
Python
Build from source (requires Rust toolchain):
Rust CLI
As a library dependency
[]
= "0.6"
Quick start
Python
# Weight-based INT8 quantization
# Activation-based calibration (better accuracy)
# Inspect model metadata
=
See Python API documentation for the full reference.
CLI
# INT8 quantization
# INT4 with per-channel quantization
# Activation-based calibration
# Validate a quantized model (structure, connectivity, numerical sanity)
# Compare original vs quantized
# Batch processing
# Config-file driven workflow
Rust library
use ;
use QdqWeightInput;
CLI reference
quantize
quantize-rs quantize <MODEL> [OPTIONS]
Options:
-o, --output <FILE> Output path [default: model_quantized.onnx]
-b, --bits <4|8> Bit width [default: 8]
--per-channel Per-channel quantization
--exclude <LAYER> Exclude a layer by name (repeatable)
--min-elements <N> Skip tensors with fewer than N elements
calibrate
quantize-rs calibrate <MODEL> --data <DATA> [OPTIONS]
Options:
--data <FILE> Calibration data (.npy)
-o, --output <FILE> Output path [default: model_calibrated.onnx]
-b, --bits <4|8> Bit width [default: 8]
--per-channel Per-channel quantization
--method <METHOD> minmax | percentile | entropy | mse [default: percentile]
batch
quantize-rs batch <MODELS>... -o <DIR> [OPTIONS]
Options:
-o, --output <DIR> Output directory (required)
-b, --bits <4|8> Bit width [default: 8]
--per-channel Per-channel quantization
--skip-existing Skip models that already have output files
--continue-on-error Do not abort on individual model failures
validate
quantize-rs validate <ORIGINAL> <QUANTIZED> [--detailed]
Checks structure preservation, graph connectivity, weight shapes, and numerical sanity (all-zero detection, constant-value detection). With --detailed, prints per-layer error analysis.
benchmark
quantize-rs benchmark <ORIGINAL> <QUANTIZED>
Compares node counts, weight counts, file sizes, and compression ratios.
info
quantize-rs info <MODEL>
Prints model name, opset version, node count, inputs, and outputs.
config
quantize-rs config <CONFIG_FILE> [--dry-run]
Runs quantization from a YAML or TOML configuration file. Example:
bits: 8
per_channel: true
models:
- input: models/resnet18.onnx
output: quantized/resnet18_int8.onnx
- input: models/mobilenet.onnx
output: quantized/mobilenet_int8.onnx
batch:
input_dir: "models/*.onnx"
output_dir: quantized/
skip_existing: true
How it works
Quantization
Each float32 weight tensor is mapped to a fixed-point integer representation:
scale = (max - min) / (qmax - qmin)
quantized = round(value / scale) + zero_point
dequantized = (quantized - zero_point) * scale
For INT8, the quantized range is [-128, 127]. For INT4, it is [-8, 7]. INT4 values are bit-packed (two values per byte) in memory for 8x compression, but stored as INT8 in ONNX files (DequantizeLinear requires INT8 input in opsets < 21).
Per-channel quantization
Computes separate scale and zero_point for each output channel (axis 0). This is particularly effective when different channels have vastly different weight distributions, which is common in convolutional layers.
Activation-based calibration
Instead of deriving quantization ranges from weight values alone, calibration runs forward passes on representative input samples and records the actual activation distributions at each layer. The observed ranges produce tighter quantization parameters. Four methods are available:
| Method | Strategy |
|---|---|
| MinMax | Use observed min/max directly |
| Percentile | Clip at 99.9th percentile to reduce outlier sensitivity |
| Entropy | Select the range that minimizes KL divergence between original and quantized distributions |
| MSE | Select the range that minimizes mean squared error |
Output format
Quantized models use the ONNX QDQ pattern. For each quantized weight, the original float32 initializer is replaced with:
{name}_quantized-- INT8 tensor (same shape){name}_scale-- float32 scalar{name}_zp-- INT8 scalar- A
DequantizeLinearnode whose output is the original tensor name
Because the DequantizeLinear output carries the original name, all downstream nodes (Conv, MatMul, etc.) remain unchanged. The graph loads and runs in ONNX Runtime without modification.
ONNX Runtime integration
=
= .
=
=
Testing
# Rust tests (90 passing: 63 unit + 12 integration + 15 property-based)
# With output
# Integration tests requiring model files on disk
# Criterion benchmarks
# Python tests (requires maturin develop)
Known limitations
- ONNX input only. PyTorch and TensorFlow models must be exported to ONNX first.
- Per-channel DequantizeLinear writes 1-D scale/zero_point tensors with the
axisattribute. ONNX Runtime supports this in opset >= 13. - INT4 values are stored as INT8 bytes in the ONNX file. True 4-bit packing requires opset 21 or a custom operator.
- Activation calibration uses tract for inference; tract may not support all ONNX ops found in large production models.
Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure
cargo testandcargo clippypass - Submit a pull request