docs.rs failed to build quantize-rs-0.2.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.

Visit the last successful build: quantize-rs-0.6.0

quantize-rs

Production-grade neural network quantization toolkit in pure Rust

quantize-rs reduces neural network size by up to 8x while preserving accuracy. Convert float32 weights to INT8/INT4 with advanced per-channel quantization, calibration framework, and custom packed storage.

Features

INT8 & INT4 quantization - 4x to 8x compression
Per-channel quantization - 40-60% error reduction vs per-tensor
Calibration framework - Statistical optimization (MinMax, Percentile, Entropy, MSE)
Custom packed storage - True 8x compression for INT4
Fast - Pure Rust, no Python dependency
Complete CLI - Batch processing, validation, benchmarking
ONNX format - Works with PyTorch, TensorFlow, etc.
Config files - YAML/TOML support for automation

Quick Start

Installation

cargo install quantize-rs

Or build from source:

git clone https://github.com/AR-Kamal/quantize-rs

cd quantize-rs

cargo build --release

Basic Usage

# INT8 quantization (4x compression)

quantize-rs quantize model.onnx -o model_int8.onnx


# INT4 quantization (8x compression)

quantize-rs quantize model.onnx -o model_int4.onnx --bits 4


# Per-channel for better quality

quantize-rs quantize model.onnx -o model_int8.onnx --per-channel


# Calibration-based quantization

quantize-rs calibrate model.onnx --data calib.npy -o model_calibrated.onnx --bits 4 --method percentile


# Validate quantized model

quantize-rs validate model.onnx model_int8.onnx


# Compare performance

quantize-rs benchmark model.onnx model_int8.onnx

Results

ResNet-18 Compression

Method	Size	Compression	Avg MSE	Notes
Original	44.65 MB	1.0x	-	Float32
INT8	11.18 MB	4.0x	0.000003	Standard
INT8 Per-Channel	11.18 MB	4.0x	0.000002	33% better
INT4	5.60 MB	8.0x	0.000907	High compression
INT4 Per-Channel	5.60 MB	8.0x	0.000862	5% better

Real file sizes achieved with custom packed storage format.

Documentation

Commands

quantize - Quantize a model

quantize-rs quantize <MODEL> [OPTIONS]


Options:

  -o, --output <FILE>     Output path [default: model_quantized.onnx]

  -b, --bits <8|4>        Quantization bits [default: 8]

      --per-channel       Use per-channel quantization (better quality)
  -h, --help              Print help

Examples:

# Basic INT8

quantize-rs quantize resnet18.onnx -o resnet18_int8.onnx


# INT4 with per-channel (best compression + quality)

quantize-rs quantize resnet18.onnx -o resnet18_int4.onnx --bits 4 --per-channel

calibrate - Calibration-based quantization

quantize-rs calibrate <MODEL> --data <DATA> [OPTIONS]


Options:

      --data <FILE>       Calibration data (NPY file or 'random')
  -o, --output <FILE>     Output path [default: model_calibrated.onnx]

  -b, --bits <8|4>        Quantization bits [default: 8]

      --per-channel       Use per-channel quantization

      --method <METHOD>   Calibration method [default: percentile]

                          (minmax, percentile, entropy, mse)

  -h, --help              Print help

Examples:

# Calibrate with sample data

quantize-rs calibrate model.onnx --data calibration.npy -o model_cal.onnx --bits 4 --method percentile


# Use random data for testing

quantize-rs calibrate model.onnx --data random -o model_cal.onnx --bits 8 --method entropy

batch - Process multiple models

quantize-rs batch <MODELS>... --output <DIR> [OPTIONS]


Options:

  -o, --output <DIR>      Output directory (required)
  -b, --bits <8|4>        Quantization bits [default: 8]

      --per-channel       Use per-channel quantization

      --skip-existing     Skip already quantized models

      --continue-on-error Continue if some models fail

Example:

quantize-rs batch models/*.onnx --output quantized/ --bits 4 --per-channel

validate - Verify quantized model

quantize-rs validate <ORIGINAL> <QUANTIZED> [OPTIONS]


Options:

      --detailed          Show per-layer analysis

Example output:

Structure Validation
------------------------------------------------------------
  Node count matches: 69
  Input count matches: 9
  Output count matches: 1

Weight Validation
------------------------------------------------------------
  Weight tensor count matches: 102
  All weight shapes match
  No numerical issues detected

Size Analysis
------------------------------------------------------------
Original:  44.65 MB
Quantized: 11.18 MB
Reduction: 75.0% (4.00x smaller)

VALIDATION PASSED

benchmark - Compare models

quantize-rs benchmark <ORIGINAL> <QUANTIZED>

config - Run from config file

quantize-rs config <CONFIG_FILE> [--dry-run]

Example config (YAML):

bits: 4
per_channel: true

models:
  - input: models/resnet18.onnx
    output: quantized/resnet18_int4.onnx
  
  - input: models/mobilenet.onnx
    output: quantized/mobilenet_int4.onnx

batch:
  input_dir: "models/*.onnx"
  output_dir: quantized/
  skip_existing: true

How It Works

Quantization Methods

Per-Tensor Quantization

Uses global min/max for entire tensor:

scale = (max - min) / 255
quantized = round(value / scale) + zero_point

Per-Channel Quantization

Calculates separate scale/zero-point for each output channel:

40-60% lower error on Conv layers
Essential for INT4 quality
Handles varied weight distributions

INT4 Bit Packing

Stores 2 INT4 values per byte:

Byte: [AAAA BBBB]
      ↑    ↑
      val1 val2

True 8x compression with custom storage format.

Calibration

Calibration optimizes quantization ranges for better accuracy:

MinMax: Uses global min/max (baseline)
Percentile: Clips outliers at specified percentile (default: 99.9%)
Entropy: Minimizes KL divergence between original and quantized distributions
MSE: Minimizes mean squared error

Calibration improves model accuracy without changing file size.

Library Usage

use quantize_rs::{OnnxModel, Quantizer, QuantConfig};

fn main() -> anyhow::Result<()> {
    // Load model
    let mut model = OnnxModel::load("model.onnx")?;
    let weights = model.extract_weights();
    
    // Configure quantization
    let config = QuantConfig {
        bits: 4,                // INT4 for 8x compression
        per_channel: true,      // Better quality
        calibration_method: None,
    };
    let quantizer = Quantizer::new(config);
    
    // Quantize each weight
    let mut quantized_data = Vec::new();
    for weight in &weights {
        let quantized = quantizer.quantize_tensor(
            &weight.data, 
            weight.shape.clone()
        )?;
        
        let (scale, zero_point) = quantized.get_scale_zero_point();
        let bits = quantized.bits();
        
        quantized_data.push((
            weight.name.clone(),
            quantized.data(),
            scale,
            zero_point,
            bits,
        ));
    }
    
    // Save quantized model
    model.save_quantized(&quantized_data, "model_int4.onnx")?;
    
    Ok(())
}

Testing

Test Coverage

cargo test                    # Run all tests (30+ tests)

cargo test --lib              # Unit tests only

cargo test -- --nocapture     # Show output

Real Model Tests

# Test on real models (requires ONNX files)

cargo test test_int4_real_model -- --ignored --nocapture

Tested on:

ResNet-18 (44.65 MB to 5.60 MB)
MNIST CNN (26 KB to 5.6 KB)
MobileNet (13.4 MB to 3.4 MB)

Future Features

Activation-based calibration (v2.0)
Mixed precision (INT8 + INT4)
Dynamic quantization (runtime)
Quantization-aware training (QAT) support
Model optimization passes (fusion, pruning)
WebAssembly support
Python bindings
More export formats (TFLite, CoreML)

Contributing

Contributions are welcome! Areas we'd love help with:

Testing - More model formats and edge cases
Calibration - Activation-based methods, better data loading
Documentation - Tutorials, guides, use cases
Performance - Optimization and benchmarking
More quantization methods - Dynamic, mixed-precision

Process:

Fork the repository
Create a feature branch (git checkout -b feature/<feature's name>)
Add tests for new features
Ensure cargo test and cargo clippy pass
Submit a pull request

Resources

Papers & References

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference - Google's INT8 quantization
LUT-NN: Empower Efficient Neural Network Inference with Centroid Learning and Table Lookup - INT4 techniques
A White Paper on Neural Network Quantization - Comprehensive overview

Tools & Frameworks

ONNX - Open Neural Network Exchange
TensorFlow Lite - Mobile quantization
PyTorch Quantization - PyTorch approach

License

MIT License - see LICENSE for details.

Acknowledgments

Built with onnx-rs for ONNX parsing
Inspired by TensorFlow Lite and PyTorch quantization
Thanks to the Rust ML community for feedback and support

Contact

Issues: GitHub Issues
Discussions: GitHub Discussions
Author: @AR-Kamal

quantize-rs 0.2.0