bitnet-quantize
Microsoft BitNet b1.58 implementation in Rust with ternary weight quantization.
Overview
bitnet-quantize implements the BitNet b1.58 architecture for efficient neural network inference:
- Ternary Weights: Quantized to {-1, 0, +1} using AbsMean
- INT8 Activations: Per-token AbsMax quantization
- BitLinear Layer: Drop-in replacement for
nn::Linear - Straight-Through Estimator: For training with quantization
- peft-rs Integration: Use as a PEFT adapter
- GGUF Export: Compatible with llama.cpp
Installation
Add to your Cargo.toml:
[]
= "0.1"
Optional Features
[]
= { = "0.1", = ["cuda", "peft", "gguf-export"] }
| Feature | Description |
|---|---|
cuda |
GPU acceleration via CubeCL |
peft |
peft-rs adapter integration |
gguf-export |
Export to GGUF format |
Quick Start
use ;
use ;
use Module;
BitNet b1.58 Algorithm
Weight Quantization (AbsMean)
Weights are quantized to ternary values:
W_q = round(W / mean(|W|)) clamped to {-1, 0, +1}
- Values near zero become 0 (sparse)
- Large positive values become +1
- Large negative values become -1
Activation Quantization (AbsMax)
Activations are quantized to INT8 per-token:
X_q = round(X * 127 / max(|X|)) clamped to [-127, +127]
Compression Benefits
| Original | Quantized | Compression |
|---|---|---|
| FP32 (32 bits) | 2 bits/weight | 16x |
| FP16 (16 bits) | 2 bits/weight | 8x |
Configuration
use BitNetConfig;
let config = builder
.group_size // Weights per scale group
.activation_bits // INT8 activations
.per_token // Per-token scaling
.use_ste // Straight-Through Estimator
.build?;
Training with STE
The Straight-Through Estimator enables training through quantization:
use ;
// Forward: quantize to ternary
let quantized = ternary_ste?;
// Backward: gradients pass through unchanged
// (handled automatically by Candle's autograd)
peft-rs Integration
Use BitNet as a PEFT adapter:
use BitNetAdapter;
use Adapter;
let adapter = new?;
let adapted_weight = adapter.forward?;
Performance
Benchmarks on CPU (Intel i7):
| Layer Size | Forward Pass | Quantization |
|---|---|---|
| 256x512 | 0.8ms | 0.2ms |
| 512x1024 | 2.1ms | 0.5ms |
| 1024x4096 | 12ms | 2.1ms |
Run benchmarks:
Documentation
Full API documentation: docs.rs/bitnet-quantize
References
- Ma, S., et al. (2024). "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits"
- Wang, H., et al. (2023). "BitNet: Scaling 1-bit Transformers for Large Language Models"
License
MIT License - see LICENSE for details.