qlora-rs 1.0.5

4-bit quantized LoRA (QLoRA) implementation with dual GGUF and Candle native export for Rust
Documentation

qlora-rs

4-bit quantized LoRA (QLoRA) implementation for Rust with dual GGUF and Candle native export.

Crates.io Documentation License

Overview

qlora-rs provides efficient 4-bit quantization and QLoRA inference capabilities for Rust:

  • NF4 Quantization - 4-bit NormalFloat format optimized for neural network weights
  • Double Quantization - Further compress scale factors for additional memory efficiency
  • Advanced Quantization - Per-channel and zero-point asymmetric quantization strategies
  • QLoRA Inference Layer - Forward pass with frozen quantized weights + LoRA adapters
  • Dual Export Formats - GGUF (llama.cpp compatible) and Candle native (QNAT) formats

Status: 1.0.0 Release - Core quantization, QLoRA inference, training support, and dual export formats are fully functional.

Features

  • 🦀 Pure Rust implementation
  • 📉 ~4x expected memory reduction for base model weights
  • âš¡ Fast quantization and dequantization
  • 📦 Dual export: GGUF format (llama.cpp) and Candle native (QNAT)
  • 🔗 Integrates with peft-rs for LoRA adapter management
  • ✅ 59/59 tests passing

Installation

[dependencies]
qlora-rs = "1.0"

Quick Start

Quantize Weights

use qlora_rs::{quantize_nf4, dequantize_nf4};
use candle_core::{Device, Tensor};

fn main() -> anyhow::Result<()> {
    let device = Device::Cpu;
    
    // Create some weights
    let weights = Tensor::randn(0.0, 1.0, (4096, 4096), &device)?;
    
    // Quantize to 4-bit NF4
    let quantized = quantize_nf4(&weights, 64)?;  // block_size = 64
    
    println!("Original size: {} bytes", 4096 * 4096 * 4);
    println!("Quantized size: {} bytes", quantized.size_bytes());
    
    // Dequantize for computation
    let restored = dequantize_nf4(&quantized, &device)?;
    
    Ok(())
}

QLoRA Layer

use qlora_rs::{QLoraConfig, QuantizedLinear};
use candle_core::{Device, Tensor, DType};

fn main() -> anyhow::Result<()> {
    let device = Device::Cpu;
    let config = QLoraConfig::default();
    
    // Create layer from existing weights
    let weights = Tensor::randn(0.0, 1.0, (768, 768), &device)?;
    let layer = QuantizedLinear::from_weight(&weights, None, config, &device)?;
    
    // Forward pass
    let input = Tensor::zeros(&[1, 10, 768], DType::F32, &device)?;
    let output = layer.forward(&input)?;
    
    println!("Trainable parameters: {}", layer.num_trainable_parameters());
    
    Ok(())
}

Export to GGUF

use qlora_rs::{quantize_nf4, export_gguf};
use candle_core::{Device, Tensor};

fn main() -> anyhow::Result<()> {
    let device = Device::Cpu;
    
    // Quantize model weights
    let q_proj = Tensor::randn(0.0, 1.0, (4096, 4096), &device)?;
    let q_proj_quantized = quantize_nf4(&q_proj, 64)?;
    
    // Export to GGUF
    export_gguf(
        &[("model.layers.0.self_attn.q_proj.weight", &q_proj_quantized)],
        "model.gguf",
    )?;
    
    Ok(())
}

NF4 Quantization

NF4 (4-bit NormalFloat) uses 16 quantization levels optimized for normally-distributed data:

-1.0, -0.696, -0.525, -0.395, -0.284, -0.185, -0.091, 0.0,
 0.080, 0.161, 0.246, 0.338, 0.441, 0.563, 0.723, 1.0

This provides better accuracy than uniform quantization for neural network weights.

Expected Memory Reduction

Theoretical memory usage based on NF4 quantization (actual results may vary):

Model Size FP16 NF4 (Expected) Reduction
7B params 14GB ~4GB 3.5x
13B params 26GB ~7GB 3.7x
70B params 140GB ~35GB 4.0x

Known Issues

  • Unmaintained paste dependency: The paste crate (used by gemm → candle-core) is unmaintained (RUSTSEC-2024-0436). This is a transitive dependency and does not affect functionality.

    Solution: A maintained fork qlora-paste (v1.0.17) has been created and published to crates.io. To resolve this issue:

    1. The gemm-fork/ directory contains a patched version of the gemm crates that use qlora-paste instead of paste.

    2. To use this in your project, add the following to your workspace root Cargo.toml:

      [patch.crates-io]
      gemm = { path = "qlora-rs/gemm-fork/gemm" }
      gemm-common = { path = "qlora-rs/gemm-fork/gemm-common" }
      gemm-f16 = { path = "qlora-rs/gemm-fork/gemm-f16" }
      gemm-f32 = { path = "qlora-rs/gemm-fork/gemm-f32" }
      gemm-f64 = { path = "qlora-rs/gemm-fork/gemm-f64" }
      gemm-c32 = { path = "qlora-rs/gemm-fork/gemm-c32" }
      gemm-c64 = { path = "qlora-rs/gemm-fork/gemm-c64" }
      

    The security audit warning is currently ignored in CI as the crate remains functional. Future updates to Candle may resolve this.

Contributing

See workspace AGENTS.md for coding conventions.

License

Dual licensed under MIT OR Apache-2.0 at your option.