Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
quantize-rs
Fast neural network quantization for ONNX models — now with Python support
quantize-rs compresses neural networks by 4-8× while maintaining accuracy. Convert float32 ONNX models to INT8/INT4 with activation-based calibration for optimal quality.
What's New in v0.3.0
- Python bindings - Use from Python with
pip install quantize-rs - Activation-based calibration - Real inference for 3× better accuracy vs weight-only
- ONNX Runtime compatibility - Quantized models load and run in ONNX Runtime
- DequantizeLinear pattern - Standard ONNX QDQ format for broad compatibility
Features
- INT8 & INT4 quantization - 4× to 8× compression
- Activation-based calibration - Real inference optimization (3× better accuracy)
- Per-channel quantization - 40-60% error reduction vs per-tensor
- Multiple calibration methods - MinMax, Percentile, Entropy, MSE
- ONNX Runtime compatible - Works out of the box with standard tooling
- Python + Rust - Use from Python or as a Rust library
- Complete CLI - Batch processing, validation, benchmarking
- Config files - YAML/TOML support for automation
Quick Start
Python
# Basic quantization
# With activation-based calibration (best accuracy)
# Get model info
=
See Python documentation for full API reference.
Rust CLI
# INT8 quantization (4× compression)
# INT4 quantization (8× compression)
# Activation-based calibration
# Validate quantized model
# Benchmark
Results
Compression (MNIST CNN)
| Method | Size | Compression | MSE Error |
|---|---|---|---|
| Float32 | 26 KB | 1.0× | - |
| INT8 | 10 KB | 2.6× | 0.000002 |
| INT4 | 6 KB | 4.3× | 0.000124 |
Accuracy (ResNet-18 on ImageNet)
| Method | Top-1 Accuracy | Accuracy Drop |
|---|---|---|
| Float32 | 69.76% | - |
| INT8 (weight-only) | 69.52% | -0.24% |
| INT8 (calibrated) | 69.68% | -0.08% |
Activation-based calibration improves accuracy by 3× vs weight-only (0.08% vs 0.24% drop).
ONNX Runtime Integration
Quantized models load and run in ONNX Runtime without modifications:
# Load quantized model
=
# Run inference (same API as float32)
= .
=
=
Performance: 2-3× faster on CPU, 3-5× on mobile/edge devices.
Python API
quantize()
Basic weight-based quantization.
quantize_with_calibration()
Activation-based calibration for optimal accuracy.
Calibration Methods:
minmax: Uses observed min/max (fast, good baseline)percentile: Clips at 99.9th percentile (reduces outlier impact)entropy: Minimizes KL divergence (best for CNNs)mse: Minimizes mean squared error (best for Transformers)
model_info()
Get model metadata.
=
CLI Reference
quantize
Basic quantization command.
|4> Quantization
)
calibrate
Activation-based calibration.
)
|4> Quantization
()
Example:
batch
Process multiple models.
)
|4> Quantization
validate
Verify quantized model structure.
benchmark
Compare original vs quantized.
config
Run from configuration file.
Example config (YAML):
bits: 8
per_channel: true
models:
- input: models/resnet18.onnx
output: quantized/resnet18_int8.onnx
- input: models/mobilenet.onnx
output: quantized/mobilenet_int8.onnx
batch:
input_dir: "models/*.onnx"
output_dir: quantized/
skip_existing: true
How It Works
Quantization Formula
scale = (max - min) / (2^bits - 1)
quantized = round(value / scale) + zero_point
dequantized = (quantized - zero_point) * scale
Per-Channel Quantization
Calculates separate scale and zero_point for each output channel:
- 40-60% lower error on convolutional layers
- Essential for INT4 quality
- Handles varied weight distributions across channels
Activation-Based Calibration
Instead of using weight min/max, runs real inference on calibration data:
- Load calibration samples (e.g., 100 images from validation set)
- Run forward pass through the model
- Capture actual activation values at each layer
- Use observed min/max for quantization ranges
Result: 3× better accuracy retention vs weight-only quantization.
DequantizeLinear Pattern
Quantized models use the ONNX QDQ (Quantize-Dequantize) pattern:
Float32 Weight → [Quantized INT8] → DequantizeLinear → Float32 → Conv/MatMul
This is the standard ONNX quantization format supported by:
- ONNX Runtime
- TensorFlow Lite
- TensorRT
- OpenVINO
Rust Library Usage
use ;
Installation
Python
Build from source:
Rust
Or add to Cargo.toml:
[]
= "0.3"
Testing
Python Tests
Rust Tests
Test coverage: 60+ tests covering quantization, calibration, ONNX I/O, and Python bindings.
Benchmarks
Compression Ratios
| Model | Original | INT8 | INT4 |
|---|---|---|---|
| ResNet-18 | 44.7 MB | 11.2 MB (4.0×) | 5.6 MB (8.0×) |
| MobileNetV2 | 13.6 MB | 3.5 MB (3.9×) | 1.8 MB (7.6×) |
| BERT-Base | 438 MB | 110 MB (4.0×) | 55 MB (8.0×) |
Accuracy Impact (ImageNet)
| Model | Float32 | INT8 (calibrated) | Drop |
|---|---|---|---|
| ResNet-18 | 69.76% | 69.68% | -0.08% |
| ResNet-50 | 76.13% | 76.02% | -0.11% |
| MobileNetV2 | 71.88% | 71.61% | -0.27% |
Speed (CPU Inference)
| Model | Float32 | INT8 | Speedup |
|---|---|---|---|
| ResNet-18 | 45ms | 16ms | 2.8× |
| MobileNetV2 | 12ms | 4ms | 3.0× |
Future Features
- Mixed precision (INT8 + INT4 hybrid)
- Dynamic quantization (runtime quantization)
- Quantization-aware training (QAT) integration
- Model optimization passes (fusion, pruning)
- More export formats (TFLite, CoreML)
- GPU acceleration for calibration
- WebAssembly support
Contributing
Contributions welcome! Areas we need help:
- Testing - More model architectures and edge cases
- Documentation - Tutorials, guides, examples
- Performance - Optimization and profiling
- Features - Dynamic quantization, mixed precision
Process:
- Fork the repository
- Create feature branch (
git checkout -b feature/amazing-feature) - Add tests for new features
- Ensure
cargo testandcargo clippypass - Submit pull request
Resources
Papers
- Quantization and Training of Neural Networks - Google's INT8 quantization
- A White Paper on Neural Network Quantization - Comprehensive survey
Tools
- ONNX Runtime - Cross-platform inference
- Netron - Visualize ONNX models
- PyTorch Quantization
License
MIT License - see LICENSE for details.
Acknowledgments
- Built with tract for ONNX inference
- PyO3 for Python bindings
- onnx-rs for ONNX parsing
- Thanks to the Rust ML community
Contact
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- PyPI: pypi.org/project/quantize-rs
- Crates.io: crates.io/crates/quantize-rs