oxicuda-quant
Part of the OxiCUDA ecosystem — Pure Rust CUDA replacement for the COOLJAPAN ecosystem.
Overview
oxicuda-quant is a GPU-accelerated quantization and model compression engine for OxiCUDA. It provides a comprehensive suite of post-training quantization (PTQ), quantization-aware training (QAT), pruning, knowledge distillation, and mixed-precision analysis tools — all backed by PTX kernels for GPU-side execution.
Features
- Quantization schemes (
scheme): MinMax INT4/INT8, NF4 (QLoRA-style), FP8 E4M3/E5M2, GPTQ, and SmoothQuant — with per-tensor and per-channel granularity - QAT support (
qat): MinMax, MovingAvg, and Histogram observers;FakeQuantizewith Straight-Through Estimator (STE) for gradient-preserving fake-quantization - Pruning (
pruning): magnitude-based unstructured pruning; structured channel, filter, and attention-head pruning - Knowledge distillation (
distill): KL-divergence, MSE, and cosine response distillation; intermediate feature distillation for layer-to-layer alignment - Sensitivity analysis (
analysis): per-layer quantization sensitivity scores, compression metrics (ratio, sparsity, bit-width), and mixed-precision policy generation - PTX kernels (
ptx_kernels): GPU kernel source strings for fake-quant, INT8 quant/dequant, NF4 encoding, and pruning mask application
Usage
Add to your Cargo.toml:
[]
= "0.1.0"
use ;
// Calibrate an INT8 symmetric quantizer on a set of activations.
let q = int8_symmetric;
let data = vec!;
let params = q.calibrate?;
let codes = q.quantize?;
let deq = q.dequantize;
License
Apache-2.0 — © 2026 COOLJAPAN OU (Team KitaSan)