qlora-rs
4-bit quantized LoRA (QLoRA) implementation for Rust with dual GGUF and Candle native export.
Overview
qlora-rs provides efficient 4-bit quantization and QLoRA inference capabilities for Rust:
- NF4 Quantization - 4-bit NormalFloat format optimized for neural network weights
- Double Quantization - Further compress scale factors for additional memory efficiency
- Advanced Quantization - Per-channel and zero-point asymmetric quantization strategies
- QLoRA Inference Layer - Forward pass with frozen quantized weights + LoRA adapters
- Dual Export Formats - GGUF (llama.cpp compatible) and Candle native (QNAT) formats
Status: Alpha - Active Development. Core quantization and inference are functional. Training support planned for Phase 2.
Features
- 🦀 Pure Rust implementation
- 📉 ~4x expected memory reduction for base model weights
- âš¡ Fast quantization and dequantization
- 📦 Dual export: GGUF format (llama.cpp) and Candle native (QNAT)
- 🔗 Integrates with peft-rs for LoRA adapter management
- ✅ 24/24 tests passing (100% coverage)
Installation
[]
= "0.1"
Quick Start
Quantize Weights
use ;
use ;
QLoRA Layer
use ;
use ;
Export to GGUF
use ;
use ;
NF4 Quantization
NF4 (4-bit NormalFloat) uses 16 quantization levels optimized for normally-distributed data:
-1.0, -0.696, -0.525, -0.395, -0.284, -0.185, -0.091, 0.0,
0.080, 0.161, 0.246, 0.338, 0.441, 0.563, 0.723, 1.0
This provides better accuracy than uniform quantization for neural network weights.
Expected Memory Reduction
Theoretical memory usage based on NF4 quantization (actual results may vary):
| Model Size | FP16 | NF4 (Expected) | Reduction |
|---|---|---|---|
| 7B params | 14GB | ~4GB | 3.5x |
| 13B params | 26GB | ~7GB | 3.7x |
| 70B params | 140GB | ~35GB | 4.0x |
Known Issues
-
Unmaintained
pastedependency: Thepastecrate (used bygemm→candle-core) is unmaintained (RUSTSEC-2024-0436). This is a transitive dependency and does not affect functionality.Solution: A maintained fork
qlora-paste(v1.0.17) has been created and published to crates.io. To resolve this issue:-
The
gemm-fork/directory contains a patched version of thegemmcrates that useqlora-pasteinstead ofpaste. -
To use this in your project, add the following to your workspace root
Cargo.toml:[] = { = "qlora-rs/gemm-fork/gemm" } = { = "qlora-rs/gemm-fork/gemm-common" } = { = "qlora-rs/gemm-fork/gemm-f16" } = { = "qlora-rs/gemm-fork/gemm-f32" } = { = "qlora-rs/gemm-fork/gemm-f64" } = { = "qlora-rs/gemm-fork/gemm-c32" } = { = "qlora-rs/gemm-fork/gemm-c64" }
The security audit warning is currently ignored in CI as the crate remains functional. Future updates to Candle may resolve this.
-
Contributing
See workspace AGENTS.md for coding conventions.
License
Dual licensed under MIT OR Apache-2.0 at your option.