Module quantize

Module quantize 

Source
Expand description

Quantization Pipeline for RuvLTRA Models

This module provides quantization capabilities for converting full-precision models to optimized quantized formats suitable for edge inference on Apple Silicon.

§Supported Quantization Formats

FormatBitsMemory (0.5B)QualityUse Case
Q4_K_M4.5~300 MBGoodBest quality/size tradeoff
Q5_K_M5.5~375 MBBetterHigher quality, still compact
Q8_08.5~500 MBBestNear-lossless quantization

§Apple Neural Engine (ANE) Optimization

The quantization pipeline produces weights optimized for ANE inference:

  • 16-byte aligned weight layouts
  • Blocked quantization compatible with ANE tile operations
  • Optimized memory access patterns for M4 Pro’s unified memory

§Example

use ruvllm::quantize::{RuvltraQuantizer, QuantConfig, TargetFormat};
use std::path::Path;

// Create quantizer for Q4_K_M format
let config = QuantConfig::default()
    .with_format(TargetFormat::Q4_K_M)
    .with_ane_optimization(true);

let quantizer = RuvltraQuantizer::new(config)?;

// Quantize a model
quantizer.quantize_model(
    Path::new("qwen-0.5b.safetensors"),
    Path::new("ruvltra-small-q4.gguf"),
)?;

Structs§

MemoryEstimate
Memory usage estimate for a quantized model
Q4KMBlock
Q4_K_M block structure (144 bytes for 256 elements)
Q5KMBlock
Q5_K_M block structure (176 bytes for 256 elements)
Q8Block
Q8_0 block structure (34 bytes for 32 elements)
QuantConfig
Configuration for quantization pipeline
QuantProgress
Quantization progress information
QuantStats
Quantization statistics
RuvltraQuantizer
RuvLTRA model quantizer

Enums§

TargetFormat
Target quantization format

Functions§

dequantize_for_ane
Dequantize Q4_K_M blocks for ANE inference
estimate_memory_q4
Estimate memory for Q4_K_M quantization
estimate_memory_q5
Estimate memory for Q5_K_M quantization
estimate_memory_q8
Estimate memory for Q8_0 quantization
quantize_ruvltra_q4
Quantize FP32 values to Q4_K_M format
quantize_ruvltra_q5
Quantize FP32 values to Q5_K_M format
quantize_ruvltra_q8
Quantize FP32 values to Q8_0 format (symmetric 8-bit)