ruvector-sparse-inference 0.1.31

PowerInfer-style sparse inference engine for efficient neural network inference on edge devices
Documentation

ruvector-sparse-inference

PowerInfer-style Activation Locality Inference Engine for RuVector.

A high-performance sparse inference engine that exploits neural network activation patterns to achieve 2×–10× speedups with <1% accuracy loss.

Features

Core Capabilities

  • Activation Locality: Exploits power-law distribution where ~10% of neurons handle ~90% of activations
  • Low-Rank Prediction: Fast P·Q matrix factorization predicts active neurons in O(r·d) time
  • Sparse FFN: Computes only active neurons, skipping cold weights entirely
  • SIMD Optimization: AVX2/FMA (GELU, SiLU, axpy), SSE4.1, NEON, and WASM SIMD backends
  • GGUF Support: Full compatibility with quantized Llama models (Q4_0 through Q6_K)
  • Hot/Cold Caching: LRU/LFU strategies for intelligent neuron weight management

Precision Lanes (3/5/7-bit)

Layered quantization that turns activation selectivity into anatomical control:

Lane Bits Range Use Case
Bit3 3 -4..3 Reflex signals, gating, anomaly triggers
Bit5 5 -16..15 Streaming embeddings, drift detection
Bit7 7 -64..63 Reasoning, synthesis, micro-LoRA
Float 32 Full Training, offline calibration

Graduation Rules: Signals move UP lanes on novelty/drift, DOWN on stability/stall.

π Integration

π (pi) provides structural constants for low-precision systems:

π breaks symmetry.
Module Purpose
Calibration π-derived constants avoid power-of-2 resonance
Drift Detection Quantization honesty signals via π transforms
Angular Embeddings Hyperspherical projections with π phase encoding
Chaos Seeding Deterministic pseudo-randomness from π digits

Performance (v0.1.31)

6× speedup over previous version through W2 transpose optimization and SIMD-accelerated activations.

Sparsity Level Latency vs Dense Improvement
10% active 130µs 52× faster 83% reduction
30% active 383µs 18× faster 83% reduction
50% active 651µs 10× faster 83% reduction
70% active 912µs 7× faster 83% reduction

Key Optimizations (v0.1.31)

  • W2 Transpose Storage: Column access becomes contiguous row access
  • SIMD GELU/SiLU: AVX2 polynomial approximations for activations
  • Cached Feature Detection: OnceLock eliminates runtime CPUID calls
  • SIMD axpy: Vectorized accumulation in sparse second layer

Target Performance

Model Target Latency Speedup Memory Reduction
LFM2 350M ~5-10ms/sentence 2.5× 40%
Sentence-transformers ~2-5ms/sentence 30%
Llama 7B 50-100ms/token 5-10× 50%

Quick Start

use ruvector_sparse_inference::{
    SparseInferenceEngine, SparsityConfig, PiContext, PrecisionLane
};

// Create sparse inference engine
let engine = SparseInferenceEngine::new_sparse(512, 2048, 0.1)?;

// Run inference
let input = vec![0.1f32; 512];
let output = engine.infer(&input)?;

// Use π context for calibration
let pi_ctx = PiContext::new(PrecisionLane::Bit5);
let calibrated = pi_ctx.calibrate(1.0);

// Check quantization honesty
let honesty = pi_ctx.check_honesty(&original, &quantized);
if !honesty.is_honest {
    // Escalate to higher precision lane
}

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Input Embedding                          │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│              Low-Rank Predictor (P·Q)                       │
│  ┌───────────┐    ┌───────────┐    ┌──────────────────┐    │
│  │ Input x   │───▶│  P matrix │───▶│  Q matrix        │    │
│  │ [d×1]     │    │  [d×r]    │    │  [r×hidden]      │    │
│  └───────────┘    └───────────┘    └──────────────────┘    │
│                                             │               │
│                                             ▼               │
│                              ┌──────────────────────────┐  │
│                              │ Threshold/Top-K Selection │  │
│                              │ Active Neuron Indices    │  │
│                              └──────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    Sparse FFN Forward                       │
│  ┌─────────────────┐                                        │
│  │ Hot Weights     │◀── Always in memory                    │
│  │ (20% neurons)   │                                        │
│  └─────────────────┘                                        │
│           │                                                 │
│           ▼                                                 │
│  ┌─────────────────┐    ┌──────────────────────────────┐   │
│  │ W1[active] @ x  │───▶│ Activation (ReLU/GELU/SiLU)  │   │
│  └─────────────────┘    └──────────────────────────────┘   │
│           │                                                 │
│           ▼                                                 │
│  ┌─────────────────┐                                        │
│  │ W2 @ activated  │───▶ Output                             │
│  └─────────────────┘                                        │
└─────────────────────────────────────────────────────────────┘

π-Based Systems

Why π Matters

In 3/5/7-bit math, you deliberately throw away bits. π lets you check whether the system is still behaving honestly.

use ruvector_sparse_inference::pi::*;

// π as calibration constant
let calibration = PiCalibration::for_lane(PrecisionLane::Bit5);
let normalized = calibration.normalize(value);

// π as drift detector
let mut detector = DriftDetector::new(PrecisionLane::Bit5);
let honesty = detector.check(&original, &quantized);
if honesty.should_escalate {
    // Precision too low or hardware misbehaving
}

// π for angular embeddings
let angular = AngularEmbedding::new(PrecisionLane::Bit7);
let projected = angular.project(&vector);
let distance = angular.angular_distance(&a, &b);

// π for deterministic chaos
let chaos = PiChaos::new();
let jitter = chaos.jitter(index);  // Same input = same output, always
let schedule = chaos.schedule_order(n_agents, round);

Key Constants

// π-based scale factors (avoid power-of-2 resonance)
pub const PI_SCALE_3BIT: f32 = π / 4.0;   // ~0.785
pub const PI_SCALE_5BIT: f32 = π / 16.0;  // ~0.196
pub const PI_SCALE_7BIT: f32 = π / 64.0;  // ~0.049

Precision Lane Graduation

use ruvector_sparse_inference::precision::*;

// Configure graduation policy
let config = GraduationConfig {
    novelty_threshold: 0.3,
    drift_persistence_threshold: 5,
    confidence_threshold: 0.8,
    escalation_budget: 0.2,
};

let mut policy = GraduationPolicy::new(PrecisionLane::Bit5, config);

// Update metrics during inference
policy.update_metrics(GraduationMetrics {
    novelty: 0.4,      // High novelty detected
    drift_steps: 3,
    confidence: 0.9,
    cost_usage: 0.1,
    ..Default::default()
});

// Check graduation decision
match policy.decide() {
    GraduationDecision::Stay => { /* Continue at Bit5 */ }
    GraduationDecision::Escalate(PrecisionLane::Bit7) => { /* Upgrade */ }
    GraduationDecision::Demote(PrecisionLane::Bit3) => { /* Downgrade */ }
}

Configuration Options

Sparsity Selection

// Top-K selection
SparsityConfig::with_top_k(100);

// Threshold-based selection
SparsityConfig::with_threshold(0.01);

// Target sparsity ratio
SparsityConfig::with_target_sparsity(0.95); // 95% sparse

Activation Functions

  • Relu: max(0, x)
  • Gelu: Gaussian Error Linear Unit
  • Silu/Swish: x * sigmoid(x)
  • Identity: No activation

Quantization

use ruvector_sparse_inference::memory::QuantizedWeights;

// Int8 quantization
let weights = QuantizedWeights::quantize_int8(&original);
let dequantized = weights.dequantize_row(0);

// Int4 quantization (GGUF-style)
let weights = QuantizedWeights::quantize_int4(&original, 32);

WASM Support

// In ruvector-sparse-inference-wasm
use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub fn create_sparse_engine(
    input_dim: usize,
    hidden_dim: usize,
    sparsity: f32,
) -> Result<SparseEngineWasm, JsValue>;

#[wasm_bindgen]
pub fn infer(
    engine: &SparseEngineWasm,
    input: &[f32],
) -> Result<Vec<f32>, JsValue>;

Integration

With RuVector (EmbeddingProvider)

use ruvector_sparse_inference::integration::SparseEmbeddingProvider;

let provider = SparseEmbeddingProvider::new(config)?;
let embedding = provider.embed("Hello world")?;

With RuvLLM (InferenceBackend)

use ruvector_sparse_inference::integration::SparseInferenceBackend;

let backend = SparseInferenceBackend::new(model_path)?;
let output = backend.generate(tokens, &config)?;

Benchmarks

Run benchmarks:

cargo bench -p ruvector-sparse-inference

SIMD kernel benchmarks:

cargo bench -p ruvector-sparse-inference --bench simd_kernels

Testing

# Unit tests
cargo test -p ruvector-sparse-inference

# Integration tests
cargo test -p ruvector-sparse-inference --test '*'

Hardware Targets

Platform SIMD Backend Precision Lanes
x86_64 (AVX2) 256-bit vectors All
x86_64 (SSE4.1) 128-bit vectors All
ARM (NEON) 128-bit vectors All
WASM 128-bit SIMD Bit5, Bit7
ESP32 Scalar Bit3 only

The Deeper Insight

π is not about geometry here. It is about injecting infinite structure into finite machines without breaking determinism.

Low-bit quantization simplifies the math. π reintroduces richness without cost.

  • Quantization makes systems stable
  • π makes them expressive
  • Together: the math stays boring, the behavior stays interesting, the proofs stay simple

Features

  • default = ["simd"]
  • simd: Enable SIMD optimizations
  • parallel: Enable parallel computation with rayon
  • quantization: Enable quantization support
  • npu: Enable ARM NPU support (experimental)

License

MIT OR Apache-2.0