axonml-quant
Overview
axonml-quant provides model quantization for AxonML. It covers the GGUF-family
block formats (Q8_0, Q4_0, Q4_1, Q5_0, Q5_1, F16, F32) plus Microsoft's BitNet
b1.58 I2_S ternary format (GGUF dtype 36, 128-weight blocks, group-strided
2-bit layout). It exposes a QuantizedLinear drop-in layer, a QuantizedModel
wrapper with custom AXQT serialization, calibration methods (MinMax,
Percentile, Entropy/KL-divergence, MeanStd), and error analysis.
Features
- GGUF-family formats: Q8_0 (8-bit), Q4_0 / Q4_1 (4-bit), Q5_0 / Q5_1 (5-bit), F16, F32 — all with 32-element block size except F16/F32
- BitNet b1.58 I2_S: 128-weight ternary blocks in Microsoft's group-strided 2-bit layout, verified against
microsoft/BitNetreference (2026-04-14)matmul_i2s— fused add-only f32-activation matmul (ternary: +act / −act / skip)matmul_i2s_i8— int8-activation fused path with runtime AVX-VNNI dispatch + scalar fallback (+matmul_i2s_i8_avxvnnifeature-gated stub, currently delegates to scalar)quantize_row_to_int8— per-row absmax int8 activation quantizer
- Calibration methods:
MinMax,Percentile(p×10),Entropy(TensorRT-style KL divergence),MeanStd(k×10) - Parallel processing: Rayon-parallel block quantization/dequantization and per-column matmul
- Error analysis:
compute_quantization_statsreturns RMSE, max error, mean error, compression ratio - Inference layer:
QuantizedLinear::forward_f32supports batch matmul with Q8/Q4/Q4_1/Q5/Q5_1/F16/F32 weights;forward_varintegrates withaxonml-autograd(forward-only, no grad) - Model-level API:
QuantizedModel::from_modulequantizes allaxonml_nn::Moduleparameters;load_into_moduledequantizes back for inference - Serialization: custom
AXQTbinary format (magicAXQT, version 1) viaserialize_quantized/deserialize_quantized
Modules
| Module | Description |
|---|---|
types |
QuantType enum, Q8Block, Q4Block, Q4_1Block, Q5Block, Q5_1Block, QuantizedBlock, QuantizedTensor |
quantize |
Tensor/model quantization (Q8_0, Q4_0, Q4_1, Q5_0, Q5_1, F16, F32) + compute_quantization_error / compute_quantization_stats |
dequantize |
Block and tensor reconstruction to f32 |
bitnet |
I2_S 1.58-bit ternary — I2sBlock, dequantize_i2s(_block), matmul_i2s, matmul_i2s_i8, quantize_row_to_int8, decode_trit/encode_trit, bytes_for_elements |
calibration |
CalibrationData (Welford streaming mean/variance + histogram percentiles), CalibrationMethod, calibrate, calibrate_batch |
inference |
QuantizedLinear, QuantizedModel, quantize_parameters, serialize_quantized / deserialize_quantized |
error |
QuantError and QuantResult |
Usage
Add this to your Cargo.toml:
[]
= "0.6.1"
Basic Quantization
use ;
use Tensor;
// Create a tensor
let tensor = from_vec?;
// Quantize to 8-bit
let quantized = quantize_tensor?;
// Check compression ratio
println!;
// Dequantize back to f32
let restored = dequantize_tensor?;
Model Quantization
use ;
// Quantize multiple named tensors
let tensors = vec!;
let quantized_model = quantize_model?;
Calibration
use ;
use CalibrationMethod;
// Calibrate using percentile method (99.9%) — argument is percentile * 10
let calib_data = calibrate?;
// Get optimal scale for quantization
let scale = calib_data.symmetric_scale;
// Or use asymmetric quantization
let = calib_data.asymmetric_scale;
Quantization Error Analysis
use ;
use QuantType;
let stats = compute_quantization_stats;
println!;
println!;
println!;
println!;
BitNet I2_S Ternary
use ;
// f32-activation path
let mut output = vec!;
matmul_i2s;
// int8-activation path (AVX-VNNI-dispatched, scalar fallback)
let mut acts_i8 = vec!;
let mut act_scales = vec!;
for i in 0..m
matmul_i2s_i8;
k must be a multiple of I2S_BLOCK_SIZE (128).
Quantized Inference Layer
use ;
let qlinear = from_linear_params;
let output = qlinear.forward_f32;
Quantization Types
| Type | Bits | Block Size | Block Bytes | Compression | Use Case |
|---|---|---|---|---|---|
| Q8_0 | 8 | 32 | 34 | ~3.76x | High accuracy, moderate compression |
| Q4_0 | 4 | 32 | 18 | ~7.11x | Signed 4-bit, symmetric |
| Q4_1 | 4 | 32 | 20 | ~6.40x | Unsigned 4-bit, with per-block min |
| Q5_0 | 5 | 32 | 22 | ~5.82x | 5-bit signed, symmetric |
| Q5_1 | 5 | 32 | 24 | ~5.33x | 5-bit unsigned, with per-block min |
| I2_S (BitNet b1.58) | 2 | 128 | 32 | ~16x | Ternary {−1, 0, +1} with one tensor-wide scale |
| F16 | 16 | 1 | 2 | 2x | Minimal accuracy loss |
| F32 | 32 | 1 | 4 | 1x | No compression (reference) |
Tests
Test coverage includes: quant-type properties and string parsing, Q4 / Q5 / Q8
block pack-unpack roundtrips, Q-type quantization shape + compression checks,
calibration (MinMax, Percentile, symmetric/asymmetric scales, dynamic range),
AXQT serialize/deserialize roundtrip, QuantizedLinear forward parity vs f32
reference, and for BitNet: trit encode/decode, block roundtrip, group-strided
layout correctness, single/multi-block dequantization, fused vs reference
matmul agreement (f32 and int8), int8 activation roundtrip, and misaligned-k
rejection.
License
Licensed under either of:
- MIT License
- Apache License, Version 2.0
at your option.