bitnet-quantize 0.1.1

Microsoft BitNet b1.58 quantization and inference for Rust
Documentation

bitnet-quantize

Microsoft BitNet b1.58 implementation in Rust with ternary weight quantization.

Crates.io Documentation License: MIT

Overview

bitnet-quantize implements the BitNet b1.58 architecture for efficient neural network inference:

  • Ternary Weights: Quantized to {-1, 0, +1} using AbsMean
  • INT8 Activations: Per-token AbsMax quantization
  • BitLinear Layer: Drop-in replacement for nn::Linear
  • Straight-Through Estimator: For training with quantization
  • peft-rs Integration: Use as a PEFT adapter
  • GGUF Export: Compatible with llama.cpp

Installation

Add to your Cargo.toml:

[dependencies]
bitnet-quantize = "0.1"

Optional Features

[dependencies]
bitnet-quantize = { version = "0.1", features = ["cuda", "peft", "gguf-export"] }
Feature Description
cuda GPU acceleration via CubeCL
peft peft-rs adapter integration
gguf-export Export to GGUF format

Quick Start

use bitnet_quantize::{BitLinear, BitNetConfig};
use candle_core::{Device, Tensor};
use candle_nn::Module;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let device = Device::Cpu;
    let config = BitNetConfig::default();

    // Create layer from existing weights
    let weight = Tensor::randn(0.0f32, 1.0, (512, 256), &device)?;
    let layer = BitLinear::from_weight(&weight, None, &config)?;

    // Forward pass
    let input = Tensor::randn(0.0f32, 1.0, (4, 256), &device)?;
    let output = layer.forward(&input)?;

    println!("Input shape: {:?}", input.shape());
    println!("Output shape: {:?}", output.shape());
    println!("Compression ratio: {:.2}x", layer.compression_ratio());
    println!("Weight sparsity: {:.1}%", layer.sparsity() * 100.0);

    Ok(())
}

BitNet b1.58 Algorithm

Weight Quantization (AbsMean)

Weights are quantized to ternary values:

W_q = round(W / mean(|W|))  clamped to {-1, 0, +1}
  • Values near zero become 0 (sparse)
  • Large positive values become +1
  • Large negative values become -1

Activation Quantization (AbsMax)

Activations are quantized to INT8 per-token:

X_q = round(X * 127 / max(|X|))  clamped to [-127, +127]

Compression Benefits

Original Quantized Compression
FP32 (32 bits) 2 bits/weight 16x
FP16 (16 bits) 2 bits/weight 8x

Configuration

use bitnet_quantize::BitNetConfig;

let config = BitNetConfig::builder()
    .group_size(128)           // Weights per scale group
    .activation_bits(8)        // INT8 activations
    .per_token(true)           // Per-token scaling
    .use_ste(true)             // Straight-Through Estimator
    .build()?;

Training with STE

The Straight-Through Estimator enables training through quantization:

use bitnet_quantize::layer::{ternary_ste, int8_ste};

// Forward: quantize to ternary
let quantized = ternary_ste(&weights)?;

// Backward: gradients pass through unchanged
// (handled automatically by Candle's autograd)

peft-rs Integration

Use BitNet as a PEFT adapter:

use bitnet_quantize::BitNetAdapter;
use peft_rs::Adapter;

let adapter = BitNetAdapter::new(config)?;
let adapted_weight = adapter.forward(&base_weight)?;

Performance

Benchmarks on CPU (Intel i7):

Layer Size Forward Pass Quantization
256x512 0.8ms 0.2ms
512x1024 2.1ms 0.5ms
1024x4096 12ms 2.1ms

Run benchmarks:

cargo bench -p bitnet-quantize

Documentation

Full API documentation: docs.rs/bitnet-quantize

References

  • Ma, S., et al. (2024). "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits"
  • Wang, H., et al. (2023). "BitNet: Scaling 1-bit Transformers for Large Language Models"

License

MIT License - see LICENSE for details.