axonml-quant
Overview
axonml-quant provides model quantization support for reducing model size and improving inference performance. It supports multiple quantization formats including 8-bit, 4-bit, and half-precision floating point, with calibration methods for determining optimal quantization parameters.
Features
- Multiple Quantization Formats: Supports Q8_0 (8-bit), Q4_0/Q4_1 (4-bit), Q5_0/Q5_1 (5-bit), F16 (half-precision), and F32 (full precision)
- Block Quantization: Per-block scale factors for improved accuracy with 32-element block size
- Calibration Methods: MinMax, Percentile, Entropy, and MeanStd calibration for optimal quantization parameters
- Parallel Processing: Uses Rayon for parallel quantization and dequantization operations
- Compression Statistics: Tracks compression ratios and quantization error metrics (RMSE, max error, mean error)
- Model Quantization: Batch quantization of named tensor collections for full model compression
- Round-trip Support: Full dequantization support to restore tensors to floating point
Modules
| Module | Description |
|---|---|
types |
Quantization type definitions, block structures (Q8Block, Q4Block, Q4_1Block), and QuantizedTensor |
quantize |
Functions for quantizing tensors to various formats with parallel processing |
dequantize |
Functions for converting quantized tensors back to floating point |
calibration |
Calibration data collection and methods for optimal quantization parameters |
error |
Error types and Result alias for quantization operations |
Usage
Add this to your Cargo.toml:
[]
= "0.1.0"
Basic Quantization
use ;
use Tensor;
// Create a tensor
let tensor = from_vec?;
// Quantize to 8-bit
let quantized = quantize_tensor?;
// Check compression ratio
println!;
// Dequantize back to f32
let restored = dequantize_tensor?;
Model Quantization
use ;
// Quantize multiple named tensors
let tensors = vec!;
let quantized_model = quantize_model?;
Calibration
use ;
// Calibrate using percentile method (99.9%)
let calib_data = calibrate?;
// Get optimal scale for quantization
let scale = calib_data.symmetric_scale;
// Or use asymmetric quantization
let = calib_data.asymmetric_scale;
Quantization Error Analysis
use ;
let stats = compute_quantization_stats;
println!;
println!;
println!;
println!;
Quantization Types
| Type | Bits | Block Size | Compression | Use Case |
|---|---|---|---|---|
| Q8_0 | 8 | 32 | 4x | High accuracy, moderate compression |
| Q4_0 | 4 | 32 | 8x | Good balance of size and accuracy |
| Q4_1 | 4 | 32 | ~6x | Better accuracy with min/max tracking |
| Q5_0 | 5 | 32 | ~6x | Middle ground between Q4 and Q8 |
| F16 | 16 | 1 | 2x | Minimal accuracy loss |
| F32 | 32 | 1 | 1x | No compression (reference) |
Tests
Run the test suite:
License
Licensed under either of:
- MIT License
- Apache License, Version 2.0
at your option.