ruvector-sparse-inference
PowerInfer-style Activation Locality Inference Engine for RuVector.
A high-performance sparse inference engine that exploits neural network activation patterns to achieve 2×–10× speedups with <1% accuracy loss.
Features
Core Capabilities
- Activation Locality: Exploits power-law distribution where ~10% of neurons handle ~90% of activations
- Low-Rank Prediction: Fast P·Q matrix factorization predicts active neurons in O(r·d) time
- Sparse FFN: Computes only active neurons, skipping cold weights entirely
- SIMD Optimization: AVX2/FMA (GELU, SiLU, axpy), SSE4.1, NEON, and WASM SIMD backends
- GGUF Support: Full compatibility with quantized Llama models (Q4_0 through Q6_K)
- Hot/Cold Caching: LRU/LFU strategies for intelligent neuron weight management
Precision Lanes (3/5/7-bit)
Layered quantization that turns activation selectivity into anatomical control:
| Lane | Bits | Range | Use Case |
|---|---|---|---|
| Bit3 | 3 | -4..3 | Reflex signals, gating, anomaly triggers |
| Bit5 | 5 | -16..15 | Streaming embeddings, drift detection |
| Bit7 | 7 | -64..63 | Reasoning, synthesis, micro-LoRA |
| Float | 32 | Full | Training, offline calibration |
Graduation Rules: Signals move UP lanes on novelty/drift, DOWN on stability/stall.
π Integration
π (pi) provides structural constants for low-precision systems:
π breaks symmetry.
| Module | Purpose |
|---|---|
| Calibration | π-derived constants avoid power-of-2 resonance |
| Drift Detection | Quantization honesty signals via π transforms |
| Angular Embeddings | Hyperspherical projections with π phase encoding |
| Chaos Seeding | Deterministic pseudo-randomness from π digits |
Performance (v0.1.31)
6× speedup over previous version through W2 transpose optimization and SIMD-accelerated activations.
| Sparsity Level | Latency | vs Dense | Improvement |
|---|---|---|---|
| 10% active | 130µs | 52× faster | 83% reduction |
| 30% active | 383µs | 18× faster | 83% reduction |
| 50% active | 651µs | 10× faster | 83% reduction |
| 70% active | 912µs | 7× faster | 83% reduction |
Key Optimizations (v0.1.31)
- W2 Transpose Storage: Column access becomes contiguous row access
- SIMD GELU/SiLU: AVX2 polynomial approximations for activations
- Cached Feature Detection: OnceLock eliminates runtime CPUID calls
- SIMD axpy: Vectorized accumulation in sparse second layer
Target Performance
| Model | Target Latency | Speedup | Memory Reduction |
|---|---|---|---|
| LFM2 350M | ~5-10ms/sentence | 2.5× | 40% |
| Sentence-transformers | ~2-5ms/sentence | 2× | 30% |
| Llama 7B | 50-100ms/token | 5-10× | 50% |
Quick Start
use ;
// Create sparse inference engine
let engine = new_sparse?;
// Run inference
let input = vec!;
let output = engine.infer?;
// Use π context for calibration
let pi_ctx = new;
let calibrated = pi_ctx.calibrate;
// Check quantization honesty
let honesty = pi_ctx.check_honesty;
if !honesty.is_honest
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Input Embedding │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Low-Rank Predictor (P·Q) │
│ ┌───────────┐ ┌───────────┐ ┌──────────────────┐ │
│ │ Input x │───▶│ P matrix │───▶│ Q matrix │ │
│ │ [d×1] │ │ [d×r] │ │ [r×hidden] │ │
│ └───────────┘ └───────────┘ └──────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────┐ │
│ │ Threshold/Top-K Selection │ │
│ │ Active Neuron Indices │ │
│ └──────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Sparse FFN Forward │
│ ┌─────────────────┐ │
│ │ Hot Weights │◀── Always in memory │
│ │ (20% neurons) │ │
│ └─────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ ┌──────────────────────────────┐ │
│ │ W1[active] @ x │───▶│ Activation (ReLU/GELU/SiLU) │ │
│ └─────────────────┘ └──────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ W2 @ activated │───▶ Output │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
π-Based Systems
Why π Matters
In 3/5/7-bit math, you deliberately throw away bits. π lets you check whether the system is still behaving honestly.
use *;
// π as calibration constant
let calibration = for_lane;
let normalized = calibration.normalize;
// π as drift detector
let mut detector = new;
let honesty = detector.check;
if honesty.should_escalate
// π for angular embeddings
let angular = new;
let projected = angular.project;
let distance = angular.angular_distance;
// π for deterministic chaos
let chaos = new;
let jitter = chaos.jitter; // Same input = same output, always
let schedule = chaos.schedule_order;
Key Constants
// π-based scale factors (avoid power-of-2 resonance)
pub const PI_SCALE_3BIT: f32 = π / 4.0; // ~0.785
pub const PI_SCALE_5BIT: f32 = π / 16.0; // ~0.196
pub const PI_SCALE_7BIT: f32 = π / 64.0; // ~0.049
Precision Lane Graduation
use *;
// Configure graduation policy
let config = GraduationConfig ;
let mut policy = new;
// Update metrics during inference
policy.update_metrics;
// Check graduation decision
match policy.decide
Configuration Options
Sparsity Selection
// Top-K selection
with_top_k;
// Threshold-based selection
with_threshold;
// Target sparsity ratio
with_target_sparsity; // 95% sparse
Activation Functions
Relu: max(0, x)Gelu: Gaussian Error Linear UnitSilu/Swish: x * sigmoid(x)Identity: No activation
Quantization
use QuantizedWeights;
// Int8 quantization
let weights = quantize_int8;
let dequantized = weights.dequantize_row;
// Int4 quantization (GGUF-style)
let weights = quantize_int4;
WASM Support
// In ruvector-sparse-inference-wasm
use *;
;
;
Integration
With RuVector (EmbeddingProvider)
use SparseEmbeddingProvider;
let provider = new?;
let embedding = provider.embed?;
With RuvLLM (InferenceBackend)
use SparseInferenceBackend;
let backend = new?;
let output = backend.generate?;
Benchmarks
Run benchmarks:
SIMD kernel benchmarks:
Testing
# Unit tests
# Integration tests
Hardware Targets
| Platform | SIMD Backend | Precision Lanes |
|---|---|---|
| x86_64 (AVX2) | 256-bit vectors | All |
| x86_64 (SSE4.1) | 128-bit vectors | All |
| ARM (NEON) | 128-bit vectors | All |
| WASM | 128-bit SIMD | Bit5, Bit7 |
| ESP32 | Scalar | Bit3 only |
The Deeper Insight
π is not about geometry here. It is about injecting infinite structure into finite machines without breaking determinism.
Low-bit quantization simplifies the math. π reintroduces richness without cost.
- Quantization makes systems stable
- π makes them expressive
- Together: the math stays boring, the behavior stays interesting, the proofs stay simple
Features
default = ["simd"]simd: Enable SIMD optimizationsparallel: Enable parallel computation with rayonquantization: Enable quantization supportnpu: Enable ARM NPU support (experimental)
License
MIT OR Apache-2.0