ruvector-attention
Advanced attention mechanisms for vector search and geometric AI, implementing 7 mathematical theories for transformer attention.
Overview
ruvector-attention provides production-ready implementations of advanced attention mechanisms based on mathematical foundations from differential geometry, information theory, and optimal transport. The library combines theoretical rigor with practical optimizations including SIMD acceleration, caching, and quantization.
Features
- 🚀 High-Performance: SIMD-accelerated with 4-way unrolled accumulators
- 🎯 Ergonomic API: Fluent builder pattern and preset configurations
- 📦 Modular Design: Mix and match attention mechanisms
- 🔧 Flexible: Support for standard, sparse, graph, and geometric attention
- 🧠 7 Mathematical Theories: Optimal Transport, Mixed Curvature, Topology, Information Geometry, Information Bottleneck, PDE/Diffusion, and Unified Diagnostics
- 📊 Unified Reporting: Health monitoring and automatic attention mode selection
- 🔢 Quantization-Friendly: Component-wise precision control (8-bit Euclidean, 5-bit Hyperbolic/Spherical)
Supported Attention Mechanisms
Standard Attention
- Scaled Dot-Product:
softmax(QK^T / √d)V - Multi-Head: Parallel attention heads with diverse representations
Sparse Attention (Memory Efficient)
- Flash Attention: O(n) memory complexity with tiled computation
- Linear Attention: O(n) complexity using kernel approximation
- Local-Global: Sliding window + global tokens (Longformer-style)
Geometric Attention
- Hyperbolic Attention: Attention in hyperbolic space for hierarchical data
- Mixed Curvature: Dynamic curvature for complex geometries
Graph Attention
- Edge-Featured GAT: Graph attention with edge features
- RoPE: Rotary Position Embeddings for graphs
Mixture-of-Experts
- MoE Attention: Learned routing to specialized expert modules
- Top-k Routing: Efficient expert selection
7 Mathematical Theories
This crate implements attention mechanisms grounded in 7 distinct mathematical theories:
| # | Theory | Module | Key Types | Use Case |
|---|---|---|---|---|
| 1 | Optimal Transport | transport |
SlicedWassersteinAttention, CentroidOTAttention |
Distribution matching, Earth mover distance |
| 2 | Mixed Curvature | curvature |
MixedCurvatureFusedAttention, TangentSpaceMapper |
Product spaces Ee × Hh × S^s |
| 3 | Topology | topology |
TopologyGatedAttention, WindowCoherence |
Coherence-based mode switching |
| 4 | Information Geometry | info_geometry |
FisherMetric, NaturalGradient |
Natural gradient descent |
| 5 | Information Bottleneck | info_bottleneck |
InformationBottleneck, KLDivergence |
Compression via KL minimization |
| 6 | PDE/Diffusion | pde_attention |
DiffusionAttention, GraphLaplacian |
Heat equation on similarity graph |
| 7 | Unified Diagnostics | unified_report |
GeometryReport, ReportBuilder |
Health monitoring & mode selection |
Theory 1: Optimal Transport Attention
Attention as mass transport between query and key distributions using Wasserstein distance.
use ;
// Configure Sliced Wasserstein with 16 random projections
let config = SlicedWassersteinConfig ;
let ot_attention = new;
// Compute OT-based attention scores
let query = vec!;
let keys: = key_data.iter.map.collect;
let values: = value_data.iter.map.collect;
let output = ot_attention.compute_sliced?;
Key Features:
- Sliced Wasserstein with cached sorted projections
- Two-stage filtering: cheap dot-product → expensive OT kernel
- Centroid OT: cluster keys into M centroids for O(M) transport
Theory 2: Mixed Curvature Attention
Attention in product manifolds combining Euclidean (E), Hyperbolic (H), and Spherical (S) spaces.
use ;
// Configure mixed curvature with component dimensions
let config = FusedCurvatureConfig ;
let mixed_attention = new;
// Map hyperbolic vectors to tangent space for efficient computation
let mapper = new;
let tangent_keys = mapper.map_to_tangent;
Key Features:
- Tangent space mapping (avoids expensive geodesic computations)
- Fused dot kernel: single vectorized loop for E+H+S similarities
- Per-head learned mixing weights
- Component quantization: 8-bit E, 5-bit H/S
Theory 3: Topology-Gated Attention
Adaptive attention that switches modes based on local coherence metrics.
use ;
let config = TopologyGatedConfig ;
let gated = new;
// Attention automatically adjusts based on window coherence
let output = gated.compute_gated?;
let mode = gated.current_mode; // Stable, Cautious, or Freeze
Coherence Metrics:
| Metric | Description |
|---|---|
BoundaryMass |
Mass near window boundaries |
CutProxy |
Proxy for graph cut quality |
Disagreement |
Variance in attention weights |
SimilarityVariance |
Local similarity variance |
Theory 4: Information Geometry
Natural gradient optimization using the Fisher Information Matrix.
use ;
// Fisher metric for probability distributions
let fisher = new;
// Compute F * v (Fisher-vector product)
let probs = vec!;
let direction = vec!;
let fv = fisher.apply;
// Natural gradient optimizer
let ng = new;
// Update logits using natural gradient: θ ← θ - lr * F^{-1} * ∇L
let new_logits = ng.step_logits;
Key Features:
- Conjugate gradient solver for F^{-1} * v
- Diagonal approximation for speed
- SIMD-accelerated matrix-vector operations
Theory 5: Information Bottleneck
Attention compression via the Information Bottleneck principle.
use ;
// Information bottleneck layer
let ib = new;
// Compute KL divergence between Gaussian and unit normal
let gaussian = DiagonalGaussian ;
let kl = gaussian_to_unit;
// Compress attention weights
let = ib.compress_attention_weights;
// Reparameterized sampling
let z = ib.sample;
Key Features:
- KL divergence: Gaussian→Unit, Categorical, Jensen-Shannon
- Variational Information Bottleneck (VIB)
- Temperature annealing for curriculum learning
Theory 6: PDE/Diffusion Attention
Attention as heat diffusion on the key similarity graph.
use ;
// Build graph Laplacian from keys
let laplacian = from_keys;
// Diffusion attention with heat equation
let config = DiffusionConfig ;
let diffusion = new;
// Compute diffused attention
let output = diffusion.compute_diffusion?;
// Multi-scale diffusion (captures different granularities)
let scales = diffusion.compute_multiscale;
Laplacian Types:
| Type | Formula | Properties |
|---|---|---|
Unnormalized |
D - W | Graph spectrum analysis |
SymmetricNormalized |
I - D{-1/2}WD{-1/2} | Symmetric, eigenvalues in [0,2] |
RandomWalk |
I - D^{-1}W | Probability transitions |
Theory 7: Unified Geometry Report
Diagnostic dashboard combining all metrics for intelligent attention mode selection.
use ;
// Build comprehensive geometry report
let report = new
.with_ot_distance
.with_topology_coherence
.with_ib_kl
.with_diffusion_energy
.with_attention_entropy
.build;
// Get health score (0-1)
println!;
// Get automatic attention mode recommendation
match report.recommendation
// Check individual metrics
for metric in &report.metrics
Metrics Tracked:
| Metric | Healthy Range | Warning | Critical |
|---|---|---|---|
| OT Distance | 0.0 - 0.5 | > 0.3 | > 0.7 |
| Topology Coherence | 0.5 - 1.0 | < 0.3 | < 0.1 |
| IB KL | 0.0 - 0.2 | > 0.5 | > 1.0 |
| Diffusion Energy | 0.0 - 1.0 | > 2.0 | > 5.0 |
| Attention Entropy | 1.0 - 4.0 | < 0.5 | < 0.1 |
Quick Start
use *;
// Simple multi-head attention
let attention = multi_head
.dropout
.causal
.build?;
// Use preset configurations
let bert = Bert.builder.build?;
let gpt = Gpt.builder.build?;
// Build pipelines with normalization
let pipeline = new
.add_attention
.add_norm
.add_residual;
// Compute attention
let query = vec!;
let keys = vec!;
let values = vec!;
let output = pipeline.run?;
Installation
Add to your Cargo.toml:
[]
= "0.1"
Or with specific features:
[]
= { = "0.1", = ["simd", "wasm"] }
SDK Overview
Builder API
The builder provides a fluent interface for configuring attention:
use *;
// Flash attention for long sequences
let flash = flash // dim, block_size
.causal
.dropout
.build?;
// Linear attention for O(n) complexity
let linear = linear // dim, num_features
.build?;
// MoE attention with 8 experts
let moe = moe // dim, num_experts, top_k
.expert_capacity
.jitter_noise
.build?;
// Hyperbolic attention for hierarchies
let hyperbolic = hyperbolic // dim, curvature
.build?;
Pipeline API
Compose attention with pre/post processing:
use *;
let attention = multi_head.build?;
let pipeline = new
.add_norm // Pre-normalization
.add_attention // Attention layer
.add_dropout // Dropout
.add_residual // Residual connection
.add_norm; // Post-normalization
let output = pipeline.run?;
Preset Configurations
Pre-configured attention for popular models:
use *;
// Model-specific presets
let bert = Bert.builder.build?;
let gpt = Gpt.builder.build?;
let longformer = Longformer.builder.build?;
let flash = FlashOptimized.builder.build?;
let t5 = T5.builder.build?;
let vit = ViT.builder.build?;
// Smart selection based on use case
let attention = for_sequences.build?; // Auto-select by length
let graph_attn = for_graphs.build?; // Graph attention
let fast_attn = for_large_scale.build?; // Flash attention
// By model name
let bert = from_model_name?;
let gpt2 = from_model_name?;
Architecture
ruvector-attention/
├── src/
│ ├── lib.rs # Main crate entry
│ ├── error.rs # Error types
│ ├── traits.rs # Core attention traits
│ │
│ ├── attention/ # Standard attention
│ │ ├── scaled_dot_product.rs
│ │ └── multi_head.rs
│ │
│ ├── sparse/ # Sparse attention (O(n) memory)
│ │ ├── flash.rs # Flash attention (tiled)
│ │ ├── linear.rs # Kernel approximation
│ │ └── local_global.rs # Longformer-style
│ │
│ ├── graph/ # Graph attention
│ │ ├── edge_featured.rs # GAT with edge features
│ │ ├── dual_space.rs # Dual-space attention
│ │ └── rope.rs # Rotary embeddings
│ │
│ ├── hyperbolic/ # Hyperbolic geometry
│ │ ├── hyperbolic_attention.rs
│ │ ├── mixed_curvature.rs
│ │ └── poincare.rs
│ │
│ ├── moe/ # Mixture-of-Experts
│ │ ├── expert.rs # Expert modules
│ │ ├── router.rs # Top-k routing
│ │ └── moe_attention.rs
│ │
│ ├── transport/ # [Theory 1] Optimal Transport
│ │ ├── sliced_wasserstein.rs # Sliced OT attention
│ │ ├── centroid_ot.rs # Centroid-based OT
│ │ └── cached_projections.rs # Projection caching
│ │
│ ├── curvature/ # [Theory 2] Mixed Curvature
│ │ ├── tangent_space.rs # Tangent space mapping
│ │ ├── fused_attention.rs # Fused E+H+S kernel
│ │ └── component_quantizer.rs # 8-bit/5-bit quantization
│ │
│ ├── topology/ # [Theory 3] Topology Gating
│ │ ├── coherence.rs # Window coherence metrics
│ │ ├── policy.rs # 3-mode policy (Stable/Cautious/Freeze)
│ │ └── gated_attention.rs # Adaptive gated attention
│ │
│ ├── info_geometry/ # [Theory 4] Information Geometry
│ │ ├── fisher.rs # Fisher information matrix
│ │ └── natural_gradient.rs # Natural gradient descent
│ │
│ ├── info_bottleneck/ # [Theory 5] Information Bottleneck
│ │ ├── kl_divergence.rs # KL, JS divergences
│ │ └── bottleneck.rs # VIB layer
│ │
│ ├── pde_attention/ # [Theory 6] PDE/Diffusion
│ │ ├── laplacian.rs # Graph Laplacian construction
│ │ └── diffusion.rs # Heat equation attention
│ │
│ ├── unified_report/ # [Theory 7] Unified Diagnostics
│ │ ├── metrics.rs # Metric types and values
│ │ ├── report.rs # Geometry report builder
│ │ └── recommendation.rs # Attention mode recommendations
│ │
│ ├── training/ # Training utilities
│ │ ├── loss.rs # InfoNCE, contrastive losses
│ │ ├── optimizer.rs # SGD, Adam, AdamW
│ │ └── curriculum.rs # Curriculum scheduling
│ │
│ └── sdk/ # High-level SDK
│ ├── builder.rs # Fluent builder API
│ ├── pipeline.rs # Composable pipelines
│ └── presets.rs # Model presets (BERT, GPT, etc.)
Examples
Transformer Block
use *;
Long Context Processing
use *;
Graph Neural Network
use *;
Performance
Complexity Comparison
| Mechanism | Time | Memory | Use Case |
|---|---|---|---|
| Scaled Dot-Product | O(n²) | O(n²) | Short sequences |
| Multi-Head | O(n²) | O(n²) | Standard transformers |
| Flash Attention | O(n²) | O(n) | Long sequences |
| Linear Attention | O(n) | O(n) | Very long sequences |
| Local-Global | O(n·w) | O(n·w) | Document processing |
| Hyperbolic | O(n²) | O(n²) | Hierarchical data |
| MoE | O(n²/E) | O(n²) | Specialized tasks |
Advanced Mechanisms Complexity
| Theory | Mechanism | Time | Memory | Notes |
|---|---|---|---|---|
| OT | Sliced Wasserstein | O(n·P·log n) | O(n·P) | P = num projections |
| OT | Centroid OT | O(n + M²) | O(M·d) | M = num centroids |
| Curvature | Mixed Curvature | O(n²) | O(n²) | Fused E+H+S kernel |
| Topology | Gated Attention | O(n²) | O(n²) | + O(n) coherence |
| Info Geo | Natural Gradient | O(n²) | O(n) | CG solver |
| Info Bottle | VIB | O(n·z) | O(z) | z = bottleneck dim |
| PDE | Diffusion | O(n²·T) | O(n²) | T = diffusion steps |
Where:
n= sequence lengthw= local window sizeE= number of expertsP= number of random projections (typically 8-16)M= number of centroids (typically 16-32)z= bottleneck dimensionT= number of diffusion time steps
Benchmarks
On a typical workload (batch_size=32, seq_len=512, dim=768):
- Flash Attention: 2.3x faster, 5x less memory than standard
- Linear Attention: O(n) scaling for sequences >4096
- Local-Global: 60% of standard attention cost for w=256
- Sliced Wasserstein: 1.8x slower than standard, but better distribution matching
- Mixed Curvature: ~1.3x standard with tangent space optimization
- Diffusion Attention: 2-10x slower depending on T, but captures multi-scale structure
Tutorials
Tutorial 1: Building a Geometry-Aware Transformer
Combine multiple geometric attention mechanisms for hierarchical data.
use *;
use *;
Tutorial 2: Adaptive Attention with Unified Report
Use the unified report to automatically select the best attention mode.
use *;
Tutorial 3: Information Bottleneck for Attention Compression
Use VIB to learn compressed attention representations.
use *;
Tutorial 4: Multi-Scale Diffusion for Document Understanding
Use diffusion attention at multiple scales for long documents.
use *;
Tutorial 5: Natural Gradient Training Loop
Train attention parameters with geometry-aware optimization.
use *;
Features
simd- SIMD acceleration (default, enabled)wasm- WebAssembly supportnapi- Node.js bindings
Documentation
- SDK Guide - Comprehensive SDK usage guide
- API Documentation - Full API reference
- Examples - Working code examples
Contributing
Contributions are welcome! Please see CONTRIBUTING.md.
License
Licensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE)
- MIT License (LICENSE-MIT)
at your option.
Citation
If you use this crate in your research, please cite:
Related Projects
- ruvector - Core vector search engine
- ruvector-graph - Graph neural networks
- ruvector-gnn - Geometric neural networks