Crate ruvector_mincut_gated_transformer

Crate ruvector_mincut_gated_transformer 

Source
Expand description

§Mincut Gated Transformer

Ultra low latency transformer inference designed for continuous systems. Governed by a coherence controller driven by dynamic minimum cut signals and optionally a spiking scheduler that skips work when nothing meaningful is happening.

§Academic Foundations

This crate integrates multiple state-of-the-art optimization techniques:

  1. Mixture-of-Depths (Raposo et al., 2024) - Dynamic compute allocation with 50% FLOPs reduction
  2. Early Exit (Elhoushi et al., 2024) - Layer-skipping with 30-50% latency reduction
  3. Sparse Attention (Jiang et al., 2024) - 90% attention FLOPs reduction for long contexts
  4. Energy-Based Transformers (Gladstone et al., 2025) - Principled compute-quality tradeoffs
  5. Spike-Driven Inference (Yao et al., 2023, 2024) - 87× energy reduction via event-driven compute
  6. Spectral Methods (Kreuzer et al., 2021) - Graph-based coherence via spectral partitioning

See docs/THEORY.md for detailed academic references and theoretical analysis.

§Primary Outcomes

  1. Deterministic, bounded inference - Same inputs yield same outputs
  2. Allocation-free hot path - Zero heap allocations after initialization
  3. Predictable tail latency - Bounded p99 latency guarantees
  4. Explainable interventions - Every gate decision produces a witness
  5. Easy integration - Works with RuVector, ruvector-mincut, and agent orchestration

§Core Concepts

The system has three roles:

  1. Transformer Kernel - Produces logits or scores under fixed compute budgets
  2. Spike Scheduler (optional) - Decides whether to run and selects compute tier
  3. Mincut Gate (authoritative) - Decides what state changes are allowed

§Example

use ruvector_mincut_gated_transformer::{
    MincutGatedTransformer, TransformerConfig, GatePolicy,
    GatePacket, InferInput, InferOutput,
};

// Create configuration
let config = TransformerConfig::micro();
let policy = GatePolicy::default();

// Load weights (pseudo-code)

// Create transformer
let mut transformer = MincutGatedTransformer::new(config, policy, weights).unwrap();

// Create gate packet from mincut signals
let gate = GatePacket {
    lambda: 100,
    lambda_prev: 95,
    boundary_edges: 5,
    boundary_concentration_q15: 8192,
    partition_count: 3,
    flags: 0,
};

// Prepare input
let input = InferInput {
    tokens: Some(&[1, 2, 3, 4]),
    embedding_q: None,
    embedding_scale: 1.0,
    input_signature: None,
    gate,
    spikes: None,
};

// Allocate output buffer
let mut logits = vec![0i32; 1024];
let mut output = InferOutput::new(&mut logits);

// Run inference
transformer.infer(&input, &mut output).unwrap();

// Check witness for allowed actions
if output.witness.external_writes_enabled == 1 {
    // Safe to persist memory
}

Re-exports§

pub use arena::calculate_arena_size;
pub use arena::LayerWeights;
pub use arena::WeightArena;
pub use arena::WeightRef;
pub use config::GatePolicy;
pub use config::TransformerConfig;
pub use early_exit::CoherenceEarlyExit;
pub use early_exit::EarlyExitConfig;
pub use early_exit::EarlyExitDecision;
pub use early_exit::ExitReason;
pub use error::Error;
pub use error::Result;
pub use flash_attention::flash_attention_forward;
pub use flash_attention::flash_attention_forward_i8;
pub use flash_attention::flash_mha;
pub use flash_attention::FlashAttentionConfig;
pub use gate::GateController;
pub use gate::TierDecision;
pub use kv_cache::HadamardTransform;
pub use kv_cache::QuantBits;
pub use kv_cache::QuantizedKVCache;
pub use mamba::MambaConfig;
pub use mamba::MambaLayer;
pub use mamba::MambaState;
pub use mamba::MambaWeights;
pub use mod_routing::MincutDepthRouter;
pub use mod_routing::ModRoutingConfig;
pub use mod_routing::RoutingStats;
pub use mod_routing::TokenRoute;
pub use model::MincutGatedTransformer;
pub use model::QuantizedWeights;
pub use model::WeightsLoader;
pub use packets::GateDecision;
pub use packets::GatePacket;
pub use packets::GateReason;
pub use packets::InferInput;
pub use packets::InferOutput;
pub use packets::InferStats;
pub use packets::SpikePacket;
pub use packets::Witness;
pub use q15::f32_to_q15_batch;
pub use q15::q15_batch_add;
pub use q15::q15_batch_lerp;
pub use q15::q15_batch_mul;
pub use q15::q15_dot;
pub use q15::q15_to_f32_batch;
pub use q15::Q15;
pub use rope::RopeConfig;
pub use rope::RopeEmbedding;
pub use rope::RopeScaling;
pub use speculative::generate_tree_attention_mask;
pub use speculative::DraftToken;
pub use speculative::DraftTree;
pub use speculative::SpeculativeConfig;
pub use speculative::SpeculativeDecoder;
pub use speculative::VerificationResult;
pub use spike::SpikeScheduler;
pub use state::RuntimeState;

Modules§

arena
Arena allocator for efficient weight storage.
attention
Attention mechanisms for the transformer.
config
Configuration types for the mincut gated transformer.
configs
Supported model configurations
early_exit
Coherence-driven early exit for self-speculative inference.
error
Error types for mincut gated transformer.
ffn
Quantized Feed-Forward Network (FFN) layer.
flash_attention
FlashAttention-style tiled attention for CPU.
gate
Gate controller for coherence-based intervention.
kernel
Kernel operations for quantized inference.
kv_cache
KV Cache quantization with Hadamard transforms.
mamba
Mamba State Space Model layer.
mod_routing
λ-based Mixture-of-Depths (MoD) routing.
model
Transformer model and weights.
packets
Packet types for gate and spike signaling.
prelude
Prelude module for convenient imports
q15
Q15 Fixed-Point Arithmetic
rope
Rotary Position Embeddings (RoPE).
speculative
Speculative decoding with EAGLE-3 style draft trees.
spike
Spike scheduler for event-driven inference.
state
Runtime state and memory management.

Constants§

VERSION
Crate version