Skip to main content

Module quant

Module quant 

Source
Expand description

Quantization: QAT and PTQ

Provides quantization for QLoRA and Quantization-Aware Training:

  • 4-bit block-wise quantization for QLoRA
  • Fake quantization with STE for QAT
  • PTQ calibration (min-max, percentile, moving average)
  • GGUF-compatible Q4_0/Q8_0 formats
  • Per-channel vs per-tensor quantization granularity
  • Quantization error analysis and metrics
  • Accuracy degradation benchmarks

Structs§

BenchmarkSuite
Suite of benchmark results
CalibrationResult
Calibration result containing scale and zero_point
Calibrator
PTQ Calibrator for collecting statistics and computing quantization parameters
DoubleQuantized4Bit
Double-quantized 4-bit representation
FakeQuantConfig
Fake quantization configuration
FakeQuantize
Fake quantization operation with Straight-Through Estimator (STE)
Q4_0
Q4_0 quantized tensor (GGUF format)
Q8_0
Q8_0 quantized tensor (GGUF format)
QuantBenchmarkResult
Benchmark results for quantization accuracy
QuantErrorStats
Error statistics for quantization analysis
QuantParams
Quantization parameters for a tensor
Quantized4Bit
4-bit quantized representation with block-wise scale factors
QuantizedTensor
Quantized tensor with per-channel or per-tensor quantization

Enums§

CalibrationMethod
Calibration method for PTQ
GGUFQuantType
Quantization type enum for GGUF export
QuantGranularity
Quantization granularity options
QuantMode
Quantization mode: symmetric or asymmetric

Constants§

BLOCK_SIZE
Block size for quantization (64 elements per block)
DOUBLE_QUANT_BLOCK_SIZE
Block size for second-level scale quantization (256 scales per super-block)
GGUF_BLOCK_SIZE
GGUF block size (standard for llama.cpp)

Functions§

accuracy_retention
Calculate accuracy retention percentage
analyze_error
Analyze quantization error for given values and parameters
analyze_outlier_impact
Analyze impact of outliers on quantization error
calibrate_min_max
Convenience function for min-max calibration
calibrate_per_channel
Calibrate quantization parameters for per-channel quantization
calibrate_per_group
Calibrate quantization parameters for per-group quantization
calibrate_per_tensor
Calibrate quantization parameters for per-tensor quantization
calibrate_percentile
Convenience function for percentile calibration
compare_bit_width_degradation
Compare accuracy degradation across bit widths
compare_bit_widths
Compare error between different bit widths
compare_granularities
Compare per-channel vs per-tensor quantization error
dequantize_4bit
Dequantize 4-bit values back to f32
dequantize_4bit_double
Dequantize double-quantized 4-bit values back to f32
dequantize_tensor
Dequantize tensor
dequantize_with_params
Dequantize values using given parameters
error_within_bounds
Check if error is within expected bounds
fake_quantize
Convenience function for fake quantization forward pass
generate_gaussian_weights
Generate Gaussian-like weight distribution (common in neural networks)
generate_multi_channel_weights
Generate multi-channel weights (like conv/linear layer)
generate_uniform_weights
Generate uniform weights in range
generate_weights_with_outliers
Generate weights with outliers (to test robustness)
quantization_mse
Compute quantization error (MSE)
quantize_4bit
Quantize f32 values to 4-bit with block-wise scaling
quantize_4bit_double
Quantize values to 4-bit with double quantization of scale factors
quantize_tensor
Quantize tensor with specified granularity
quantize_with_params
Quantize values using given parameters
run_benchmark
Run benchmark on given values with specified configuration
run_full_benchmark_suite
Run full benchmark suite on various weight patterns
scale_sensitivity
Analyze sensitivity of error to scale perturbation
ste_backward
Convenience function for STE backward pass
theoretical_max_error
Calculate theoretical maximum error for given quantization parameters
theoretical_sqnr
Calculate expected SQNR for uniform quantization