vsa-optim-rs
Deterministic training optimization using Vector Symbolic Architecture (VSA), ternary quantization, and closed-form gradient prediction.
A pure Rust implementation enabling efficient large model fine-tuning on consumer hardware through mathematically principled gradient compression and prediction.
Key Properties
- Deterministic: Identical inputs produce identical outputs—no stochastic variance in predictions
- Closed-form: Weighted least squares with Cramer's rule—no iterative optimization
- Memory-efficient: ~90% gradient storage reduction via VSA compression
- Compute-efficient: ~80% backward pass reduction via gradient prediction
Installation
[]
= "0.1"
Quick Start
Deterministic Phase Training (Recommended)
The DeterministicPhaseTrainer orchestrates training through mathematically
rigorous phases with guaranteed reproducibility:
use ;
use Device;
use HashMap;
// Define parameter shapes
let shapes = vec!;
// Configure deterministic training
let config = DeterministicPhaseConfig ;
let mut trainer = new?;
// Training loop
for step in 0..100
let stats = trainer.get_stats;
println!;
VSA Gradient Compression
Compress gradients using hyperdimensional computing with bind/bundle/unbind operations:
use ;
let config = builder
.dimension // Hypervector dimension
.compression_ratio // 10x compression target
.seed // Reproducible projections
.build;
let param_shapes = vec!;
let mut compressor = new?;
// Compress gradients
let compressed = compressor.compress?;
println!;
// Decompress when needed
let restored = compressor.decompress?;
Ternary Gradient Accumulation
Memory-efficient accumulation using balanced ternary {-1, 0, +1}:
use ;
let config = builder
.accumulation_steps
.use_stochastic_rounding // Unbiased quantization
.build;
let mut accumulator = new?;
for micro_batch in micro_batches
// Retrieve accumulated gradients for optimizer step
let accumulated = accumulator.get_accumulated?;
optimizer.step?;
accumulator.reset?;
Architecture
Deterministic Gradient Prediction
The core innovation: predict gradients using weighted least squares model fitting with a closed-form solution (no iterative optimization):
Gradient Model: g(t) = baseline + velocity × t + residual
Where:
- baseline: Weighted mean of historical gradients
- velocity: Gradient change rate (fitted via normal equations)
- residual: Exponentially-averaged prediction error for drift correction
Warmup Phase: Collect initial gradient samples to establish prediction baseline.
Prediction Fitting: Solve normal equations using Cramer's rule:
[Σw Σwt ] [b] [Σwg ]
[Σwt Σwt²] [v] = [Σwtg ]
Residual Tracking: Maintain exponentially-decayed average of prediction errors to correct systematic drift without stochastic noise.
Training Phase Cycle
┌─────────────────────────────────────────────────────────────────┐
│ │
│ WARMUP ──► FULL ──► PREDICT ──► CORRECT ──► FULL ──► ... │
│ (N steps) (M) (P steps) (periodic) (M) │
│ │ │ │
│ └──────────────┘ │
│ (correction cycle) │
│ │
└─────────────────────────────────────────────────────────────────┘
| Phase | Description | Backward Pass |
|---|---|---|
| Warmup | Collect gradients to initialize predictor | ✓ |
| Full | Standard training with gradient recording | ✓ |
| Predict | Use predicted gradients | ✗ |
| Correct | Compute actual gradient, update residuals | ✓ |
VSA Compression Pipeline
Gradients ──► Project to HD ──► Bind with keys ──► Bundle (majority) ──► Compressed
│
Decompressed ◄── Inverse Project ◄── Unbind with keys ◄────────────────────┘
Operations leverage the quasi-orthogonality of random vectors in high dimensions (Johnson-Lindenstrauss lemma) for information-preserving compression.
Performance Characteristics
| Metric | Value | Notes |
|---|---|---|
| Gradient Storage | ~90% reduction | VSA compression |
| Backward Passes | ~80% reduction | Prediction phases |
| Accumulation Memory | ~93% reduction | Ternary quantization |
| Prediction Overhead | O(history_window × params) | Linear in tracked history |
| Determinism | 100% | Bit-exact reproducibility |
Speedup Analysis
With default configuration (warmup=10, full=5, predict=20, correct_every=5):
100 steps = 10 warmup + (5 full + 20 predict) × cycles
= 10 warmup + ~25 full + ~65 predict
≈ 35 backward passes instead of 100
= 2.9x theoretical speedup
Actual speedup depends on backward pass cost relative to forward pass.
Configuration Reference
DeterministicPhaseConfig
DeterministicPhaseConfig
VSAConfig
builder
.dimension // HD space dimension (↑ = better reconstruction)
.compression_ratio // Target compression factor
.seed // RNG seed for reproducibility
.build
TernaryConfig
builder
.accumulation_steps // Micro-batches per optimizer step
.use_stochastic_rounding // Unbiased quantization to {-1, 0, +1}
.build
Requirements
- Rust: 1.92+ (2021 edition)
- Dependencies: candle-core 0.9+, trit-vsa 0.1+
Integration with axolotl-rs
For YAML-driven LLM fine-tuning with automatic VSA acceleration:
use ;
let config = default; // Or ::conservative(), ::aggressive()
let mut accel = new?;
for batch in dataloader
println!; // "VSA: 100 steps (35 full, 65 predicted), 2.86x speedup"
Sister Crates
| Crate | Description |
|---|---|
| trit-vsa | Balanced ternary arithmetic with VSA operations |
| bitnet-quantize | BitNet b1.58 quantization for neural networks |
| axolotl-rs | YAML-driven LLM fine-tuning toolkit |
| qlora-rs | 4-bit QLoRA with double quantization |
| peft-rs | Parameter-efficient fine-tuning adapters |
License
MIT License. See LICENSE-MIT for details.
References
- Kanerva, P. (2009). Hyperdimensional Computing: An Introduction to Computing in Distributed Representation with High-Dimensional Random Vectors.
- Rahimi, A. et al. (2016). High-Dimensional Computing as a Nanoscalable Paradigm.
- Johnson, W. & Lindenstrauss, J. (1984). Extensions of Lipschitz mappings into a Hilbert space.
- Ma, S. et al. (2024). The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits.
"Simplicity is the ultimate sophistication." — Leonardo da Vinci