Expand description
Dynamic Quantization for On-the-Fly Model Compression
Provides automatic weight quantization during model loading for memory-efficient inference.
§Features
- Weight-Only Quantization: Quantize weights while keeping activations in FP32
- Dynamic Quantization: Quantize weights and activations at runtime
- Mixed Precision: Selective quantization based on layer sensitivity
- Multiple Backends: INT8, FP16, BF16 support
- HuggingFace Integration: Automatic quantization on model load
§Quantization Strategies
§INT8 Weight-Only Quantization
- Quantize weights to INT8 (4x compression)
- Keep activations in FP32 for accuracy
- Best for memory-bound workloads
§FP16 Mixed Precision
- Convert weights to FP16 (2x compression)
- Better accuracy than INT8
- Hardware acceleration on modern GPUs
§Dynamic Quantization
- Quantize both weights and activations
- Maximum memory savings (8x with INT8)
- Automatic calibration from data
§Example
ⓘ
use kizzasi_model::dynamic_quantization::*;
// Load and quantize HuggingFace model
let quantizer = DynamicQuantizer::new()
.with_strategy(QuantStrategy::INT8WeightOnly)
.with_calibration_samples(100);
let quantized_weights = quantizer.quantize_weights(&weights)?;Structs§
- Dynamic
Quantizer - Dynamic quantizer for automatic model compression
- Quantization
Stats - Quantization statistics
Enums§
- Layer
Sensitivity - Layer sensitivity classification for mixed precision
- Quant
Strategy - Quantization strategy
- Quantized
Weight Storage - Quantized model weights storage