Expand description
§Mincut Gated Transformer
Ultra low latency transformer inference designed for continuous systems. Governed by a coherence controller driven by dynamic minimum cut signals and optionally a spiking scheduler that skips work when nothing meaningful is happening.
§Academic Foundations
This crate integrates multiple state-of-the-art optimization techniques:
- Mixture-of-Depths (Raposo et al., 2024) - Dynamic compute allocation with 50% FLOPs reduction
- Early Exit (Elhoushi et al., 2024) - Layer-skipping with 30-50% latency reduction
- Sparse Attention (Jiang et al., 2024) - 90% attention FLOPs reduction for long contexts
- Energy-Based Transformers (Gladstone et al., 2025) - Principled compute-quality tradeoffs
- Spike-Driven Inference (Yao et al., 2023, 2024) - 87× energy reduction via event-driven compute
- Spectral Methods (Kreuzer et al., 2021) - Graph-based coherence via spectral partitioning
See docs/THEORY.md for detailed academic references and theoretical analysis.
§Primary Outcomes
- Deterministic, bounded inference - Same inputs yield same outputs
- Allocation-free hot path - Zero heap allocations after initialization
- Predictable tail latency - Bounded p99 latency guarantees
- Explainable interventions - Every gate decision produces a witness
- Easy integration - Works with RuVector, ruvector-mincut, and agent orchestration
§Core Concepts
The system has three roles:
- Transformer Kernel - Produces logits or scores under fixed compute budgets
- Spike Scheduler (optional) - Decides whether to run and selects compute tier
- Mincut Gate (authoritative) - Decides what state changes are allowed
§Example
use ruvector_mincut_gated_transformer::{
MincutGatedTransformer, TransformerConfig, GatePolicy,
GatePacket, InferInput, InferOutput,
};
// Create configuration
let config = TransformerConfig::micro();
let policy = GatePolicy::default();
// Load weights (pseudo-code)
// Create transformer
let mut transformer = MincutGatedTransformer::new(config, policy, weights).unwrap();
// Create gate packet from mincut signals
let gate = GatePacket {
lambda: 100,
lambda_prev: 95,
boundary_edges: 5,
boundary_concentration_q15: 8192,
partition_count: 3,
flags: 0,
};
// Prepare input
let input = InferInput {
tokens: Some(&[1, 2, 3, 4]),
embedding_q: None,
embedding_scale: 1.0,
input_signature: None,
gate,
spikes: None,
};
// Allocate output buffer
let mut logits = vec![0i32; 1024];
let mut output = InferOutput::new(&mut logits);
// Run inference
transformer.infer(&input, &mut output).unwrap();
// Check witness for allowed actions
if output.witness.external_writes_enabled == 1 {
// Safe to persist memory
}Re-exports§
pub use arena::calculate_arena_size;pub use arena::LayerWeights;pub use arena::WeightArena;pub use arena::WeightRef;pub use config::GatePolicy;pub use config::TransformerConfig;pub use early_exit::CoherenceEarlyExit;pub use early_exit::EarlyExitConfig;pub use early_exit::EarlyExitDecision;pub use early_exit::ExitReason;pub use error::Error;pub use error::Result;pub use flash_attention::flash_attention_forward;pub use flash_attention::flash_attention_forward_i8;pub use flash_attention::flash_mha;pub use flash_attention::FlashAttentionConfig;pub use gate::GateController;pub use gate::TierDecision;pub use kv_cache::HadamardTransform;pub use kv_cache::QuantBits;pub use kv_cache::QuantizedKVCache;pub use mamba::MambaConfig;pub use mamba::MambaLayer;pub use mamba::MambaState;pub use mamba::MambaWeights;pub use mod_routing::MincutDepthRouter;pub use mod_routing::ModRoutingConfig;pub use mod_routing::RoutingStats;pub use mod_routing::TokenRoute;pub use model::MincutGatedTransformer;pub use model::QuantizedWeights;pub use model::WeightsLoader;pub use packets::GateDecision;pub use packets::GatePacket;pub use packets::GateReason;pub use packets::InferInput;pub use packets::InferOutput;pub use packets::InferStats;pub use packets::SpikePacket;pub use packets::Witness;pub use q15::f32_to_q15_batch;pub use q15::q15_batch_add;pub use q15::q15_batch_lerp;pub use q15::q15_batch_mul;pub use q15::q15_dot;pub use q15::q15_to_f32_batch;pub use q15::Q15;pub use rope::RopeConfig;pub use rope::RopeEmbedding;pub use rope::RopeScaling;pub use speculative::generate_tree_attention_mask;pub use speculative::DraftToken;pub use speculative::DraftTree;pub use speculative::SpeculativeConfig;pub use speculative::SpeculativeDecoder;pub use speculative::VerificationResult;pub use spike::SpikeScheduler;pub use state::RuntimeState;
Modules§
- arena
- Arena allocator for efficient weight storage.
- attention
- Attention mechanisms for the transformer.
- config
- Configuration types for the mincut gated transformer.
- configs
- Supported model configurations
- early_
exit - Coherence-driven early exit for self-speculative inference.
- error
- Error types for mincut gated transformer.
- ffn
- Quantized Feed-Forward Network (FFN) layer.
- flash_
attention - FlashAttention-style tiled attention for CPU.
- gate
- Gate controller for coherence-based intervention.
- kernel
- Kernel operations for quantized inference.
- kv_
cache - KV Cache quantization with Hadamard transforms.
- mamba
- Mamba State Space Model layer.
- mod_
routing - λ-based Mixture-of-Depths (MoD) routing.
- model
- Transformer model and weights.
- packets
- Packet types for gate and spike signaling.
- prelude
- Prelude module for convenient imports
- q15
- Q15 Fixed-Point Arithmetic
- rope
- Rotary Position Embeddings (RoPE).
- speculative
- Speculative decoding with EAGLE-3 style draft trees.
- spike
- Spike scheduler for event-driven inference.
- state
- Runtime state and memory management.
Constants§
- VERSION
- Crate version