Module dynamic_quantization

Module dynamic_quantization 

Source
Expand description

Dynamic Quantization for On-the-Fly Model Compression

Provides automatic weight quantization during model loading for memory-efficient inference.

§Features

  • Weight-Only Quantization: Quantize weights while keeping activations in FP32
  • Dynamic Quantization: Quantize weights and activations at runtime
  • Mixed Precision: Selective quantization based on layer sensitivity
  • Multiple Backends: INT8, FP16, BF16 support
  • HuggingFace Integration: Automatic quantization on model load

§Quantization Strategies

§INT8 Weight-Only Quantization

  • Quantize weights to INT8 (4x compression)
  • Keep activations in FP32 for accuracy
  • Best for memory-bound workloads

§FP16 Mixed Precision

  • Convert weights to FP16 (2x compression)
  • Better accuracy than INT8
  • Hardware acceleration on modern GPUs

§Dynamic Quantization

  • Quantize both weights and activations
  • Maximum memory savings (8x with INT8)
  • Automatic calibration from data

§Example

use kizzasi_model::dynamic_quantization::*;

// Load and quantize HuggingFace model
let quantizer = DynamicQuantizer::new()
    .with_strategy(QuantStrategy::INT8WeightOnly)
    .with_calibration_samples(100);

let quantized_weights = quantizer.quantize_weights(&weights)?;

Structs§

DynamicQuantizer
Dynamic quantizer for automatic model compression
QuantizationStats
Quantization statistics

Enums§

LayerSensitivity
Layer sensitivity classification for mixed precision
QuantStrategy
Quantization strategy
QuantizedWeightStorage
Quantized model weights storage