Expand description
Quantization Pipeline for RuvLTRA Models
This module provides quantization capabilities for converting full-precision models to optimized quantized formats suitable for edge inference on Apple Silicon.
§Supported Quantization Formats
| Format | Bits | Memory (0.5B) | Quality | Use Case |
|---|---|---|---|---|
| Q4_K_M | 4.5 | ~300 MB | Good | Best quality/size tradeoff |
| Q5_K_M | 5.5 | ~375 MB | Better | Higher quality, still compact |
| Q8_0 | 8.5 | ~500 MB | Best | Near-lossless quantization |
§Apple Neural Engine (ANE) Optimization
The quantization pipeline produces weights optimized for ANE inference:
- 16-byte aligned weight layouts
- Blocked quantization compatible with ANE tile operations
- Optimized memory access patterns for M4 Pro’s unified memory
§Example
ⓘ
use ruvllm::quantize::{RuvltraQuantizer, QuantConfig, TargetFormat};
use std::path::Path;
// Create quantizer for Q4_K_M format
let config = QuantConfig::default()
.with_format(TargetFormat::Q4_K_M)
.with_ane_optimization(true);
let quantizer = RuvltraQuantizer::new(config)?;
// Quantize a model
quantizer.quantize_model(
Path::new("qwen-0.5b.safetensors"),
Path::new("ruvltra-small-q4.gguf"),
)?;Structs§
- Memory
Estimate - Memory usage estimate for a quantized model
- Q4KM
Block - Q4_K_M block structure (144 bytes for 256 elements)
- Q5KM
Block - Q5_K_M block structure (176 bytes for 256 elements)
- Q8Block
- Q8_0 block structure (34 bytes for 32 elements)
- Quant
Config - Configuration for quantization pipeline
- Quant
Progress - Quantization progress information
- Quant
Stats - Quantization statistics
- Ruvltra
Quantizer - RuvLTRA model quantizer
Enums§
- Target
Format - Target quantization format
Functions§
- dequantize_
for_ ane - Dequantize Q4_K_M blocks for ANE inference
- estimate_
memory_ q4 - Estimate memory for Q4_K_M quantization
- estimate_
memory_ q5 - Estimate memory for Q5_K_M quantization
- estimate_
memory_ q8 - Estimate memory for Q8_0 quantization
- quantize_
ruvltra_ q4 - Quantize FP32 values to Q4_K_M format
- quantize_
ruvltra_ q5 - Quantize FP32 values to Q5_K_M format
- quantize_
ruvltra_ q8 - Quantize FP32 values to Q8_0 format (symmetric 8-bit)