qlora-rs
4-bit quantized LoRA (QLoRA) implementation for Rust with GGUF export.
Overview
qlora-rs provides efficient 4-bit quantization and QLoRA training capabilities:
- NF4 Quantization - 4-bit NormalFloat format optimized for neural network weights
- Double Quantization - Further compress scale factors for memory efficiency
- QLoRA Training - Train LoRA adapters on frozen quantized base weights
- GGUF Export - Export models for inference with llama.cpp
Features
- 🦀 Pure Rust implementation
- 📉 ~4x memory reduction for base model weights
- âš¡ Fast quantization and dequantization
- 📦 GGUF format support for deployment
- 🔗 Integrates with peft-rs for adapter management
Installation
[]
= "0.1"
Quick Start
Quantize Weights
use ;
use ;
QLoRA Layer
use ;
use ;
Export to GGUF
use ;
use ;
NF4 Quantization
NF4 (4-bit NormalFloat) uses 16 quantization levels optimized for normally-distributed data:
-1.0, -0.696, -0.525, -0.395, -0.284, -0.185, -0.091, 0.0,
0.080, 0.161, 0.246, 0.338, 0.441, 0.563, 0.723, 1.0
This provides better accuracy than uniform quantization for neural network weights.
Memory Comparison
| Model Size | FP16 | NF4 (qlora-rs) | Reduction |
|---|---|---|---|
| 7B params | 14GB | ~4GB | 3.5x |
| 13B params | 26GB | ~7GB | 3.7x |
| 70B params | 140GB | ~35GB | 4.0x |
Contributing
See workspace AGENTS.md for coding conventions.
License
Licensed under MIT or Apache-2.0 at your option.