Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
quantize-rs
Production-grade neural network quantization toolkit in pure Rust
quantize-rs reduces neural network size by up to 8x while preserving accuracy. Convert float32 weights to INT8/INT4 with advanced per-channel quantization, calibration framework, and custom packed storage.
Features
- INT8 & INT4 quantization - 4x to 8x compression
- Per-channel quantization - 40-60% error reduction vs per-tensor
- Calibration framework - Statistical optimization (MinMax, Percentile, Entropy, MSE)
- Custom packed storage - True 8x compression for INT4
- Fast - Pure Rust, no Python dependency
- Complete CLI - Batch processing, validation, benchmarking
- ONNX format - Works with PyTorch, TensorFlow, etc.
- Config files - YAML/TOML support for automation
Quick Start
Installation
Or build from source:
Basic Usage
# INT8 quantization (4x compression)
# INT4 quantization (8x compression)
# Per-channel for better quality
# Calibration-based quantization
# Validate quantized model
# Compare performance
Results
ResNet-18 Compression
| Method | Size | Compression | Avg MSE | Notes |
|---|---|---|---|---|
| Original | 44.65 MB | 1.0x | - | Float32 |
| INT8 | 11.18 MB | 4.0x | 0.000003 | Standard |
| INT8 Per-Channel | 11.18 MB | 4.0x | 0.000002 | 33% better |
| INT4 | 5.60 MB | 8.0x | 0.000907 | High compression |
| INT4 Per-Channel | 5.60 MB | 8.0x | 0.000862 | 5% better |
Real file sizes achieved with custom packed storage format.
Documentation
Commands
quantize - Quantize a model
|4> Quantization
)
Examples:
# Basic INT8
# INT4 with per-channel (best compression + quality)
calibrate - Calibration-based quantization
)
|4> Quantization
()
Examples:
# Calibrate with sample data
# Use random data for testing
batch - Process multiple models
)
|4> Quantization
Example:
validate - Verify quantized model
Example output:
Structure Validation
------------------------------------------------------------
Node count matches: 69
Input count matches: 9
Output count matches: 1
Weight Validation
------------------------------------------------------------
Weight tensor count matches: 102
All weight shapes match
No numerical issues detected
Size Analysis
------------------------------------------------------------
Original: 44.65 MB
Quantized: 11.18 MB
Reduction: 75.0% (4.00x smaller)
VALIDATION PASSED
benchmark - Compare models
config - Run from config file
Example config (YAML):
bits: 4
per_channel: true
models:
- input: models/resnet18.onnx
output: quantized/resnet18_int4.onnx
- input: models/mobilenet.onnx
output: quantized/mobilenet_int4.onnx
batch:
input_dir: "models/*.onnx"
output_dir: quantized/
skip_existing: true
How It Works
Quantization Methods
Per-Tensor Quantization
Uses global min/max for entire tensor:
scale = (max - min) / 255
quantized = round(value / scale) + zero_point
Per-Channel Quantization
Calculates separate scale/zero-point for each output channel:
- 40-60% lower error on Conv layers
- Essential for INT4 quality
- Handles varied weight distributions
INT4 Bit Packing
Stores 2 INT4 values per byte:
Byte: [AAAA BBBB]
↑ ↑
val1 val2
True 8x compression with custom storage format.
Calibration
Calibration optimizes quantization ranges for better accuracy:
- MinMax: Uses global min/max (baseline)
- Percentile: Clips outliers at specified percentile (default: 99.9%)
- Entropy: Minimizes KL divergence between original and quantized distributions
- MSE: Minimizes mean squared error
Calibration improves model accuracy without changing file size.
Library Usage
use ;
Testing
Test Coverage
Real Model Tests
# Test on real models (requires ONNX files)
Tested on:
- ResNet-18 (44.65 MB to 5.60 MB)
- MNIST CNN (26 KB to 5.6 KB)
- MobileNet (13.4 MB to 3.4 MB)
Future Features
- Activation-based calibration (v2.0)
- Mixed precision (INT8 + INT4)
- Dynamic quantization (runtime)
- Quantization-aware training (QAT) support
- Model optimization passes (fusion, pruning)
- WebAssembly support
- Python bindings
- More export formats (TFLite, CoreML)
Contributing
Contributions are welcome! Areas we'd love help with:
- Testing - More model formats and edge cases
- Calibration - Activation-based methods, better data loading
- Documentation - Tutorials, guides, use cases
- Performance - Optimization and benchmarking
- More quantization methods - Dynamic, mixed-precision
Process:
- Fork the repository
- Create a feature branch (
git checkout -b feature/<feature's name>) - Add tests for new features
- Ensure
cargo testandcargo clippypass - Submit a pull request
Resources
Papers & References
- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference - Google's INT8 quantization
- LUT-NN: Empower Efficient Neural Network Inference with Centroid Learning and Table Lookup - INT4 techniques
- A White Paper on Neural Network Quantization - Comprehensive overview
Tools & Frameworks
- ONNX - Open Neural Network Exchange
- TensorFlow Lite - Mobile quantization
- PyTorch Quantization - PyTorch approach
License
MIT License - see LICENSE for details.
Acknowledgments
- Built with onnx-rs for ONNX parsing
- Inspired by TensorFlow Lite and PyTorch quantization
- Thanks to the Rust ML community for feedback and support
Contact
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Author: @AR-Kamal