torsh-quantization
Quantization toolkit for ToRSh, enabling efficient model deployment with reduced precision.
Overview
This crate provides comprehensive quantization support for deep learning models:
- Post-Training Quantization: Quantize trained models without retraining
- Quantization-Aware Training: Train models with simulated quantization
- Dynamic Quantization: Runtime quantization for specific operations
- Backends: Support for multiple quantization backends (FBGEMM, QNNPACK)
- Formats: INT8, INT4, and custom quantization schemes
Usage
Post-Training Quantization
use *;
// Static quantization (requires calibration)
let model = load_model?;
// Prepare model for calibration
let prepared = prepare_static?;
// Calibrate with representative data
for batch in calibration_loader
// Convert to quantized model
let quantized = convert?;
// Dynamic quantization (no calibration needed)
let dynamic_quantized = quantize_dynamic?;
Quantization-Aware Training (QAT)
use *;
// Prepare model for QAT
let model = create_model;
let qat_model = prepare_qat?;
// Train with fake quantization
for epoch in 0..num_epochs
// Convert to actual quantized model
let quantized = convert?;
Custom Quantization Configuration
use *;
// Per-layer configuration
let qconfig_dict = new
.set_global
.set_module_name
.;
let quantized = quantize_fx?;
Quantization Schemes
// Symmetric vs Asymmetric quantization
let symmetric_qconfig = new
.activation
.weight;
let asymmetric_qconfig = new
.activation
.weight;
// Custom bit widths
let int4_qconfig = new
.activation
.weight;
// Mixed precision
let mixed_qconfig = new
.
.
.set_module_name;
Model Analysis
use *;
// Compare quantized vs original
let comparison = compare_models?;
println!;
println!;
println!;
// Sensitivity analysis
let sensitivity = sensitivity_analysis?;
// Find layers sensitive to quantization
for in sensitivity
Export and Deployment
// Export for mobile (QNNPACK backend)
let mobile_model = optimize_for_mobile?;
mobile_model.save?;
// Export to ONNX with quantization
let onnx_model = export_quantized_onnx?;
// TensorRT export
let trt_model = export_tensorrt?;
Debugging and Visualization
use *;
// Visualize quantization ranges
let observer_dict = get_observer_dict?;
for in observer_dict
// Debug quantization errors
let debugger = QuantizationDebugger;
let layer_errors = debugger.calculate_layer_errors?;
debugger.plot_error_heatmap?;
Advanced Features
// Learnable quantization parameters
let learnable_fake_quant = new;
// Stochastic quantization
let stochastic_quant = new;
// Channel-wise quantization for Conv/Linear
let per_channel_qconfig = new
.weight;
// Group-wise quantization
let group_wise_qconfig = new;
Quantization Backends
// FBGEMM (x86 optimized)
let fbgemm_config = default
.backend;
// QNNPACK (mobile optimized)
let qnnpack_config = default
.backend;
// Custom backend
let custom_backend = new
.supported_ops
.kernel_library;
Supported Operations
- Linear layers: Linear, Bilinear
- Convolutional: Conv1d, Conv2d, Conv3d, ConvTranspose
- Recurrent: LSTM, GRU (dynamic quantization)
- Activations: ReLU, ReLU6, Hardswish, ELU
- Pooling: MaxPool, AvgPool, AdaptiveAvgPool
- Normalization: BatchNorm (fused with Conv/Linear)
Best Practices
- Use representative calibration data
- Start with INT8 before trying lower bit widths
- Use per-channel quantization for Conv/Linear layers
- Keep sensitive layers in higher precision
- Profile on target hardware
License
Licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT license (LICENSE-MIT)
at your option.