tensorlogic-trustformers
Transformer architectures as TensorLogic einsum graphs
This crate provides implementations of transformer components (self-attention, multi-head attention, feed-forward networks) as einsum operations that compile to TensorLogic IR and execute on any TensorLogic backend.
Features
- Self-Attention - Scaled dot-product attention as einsum operations
- Multi-Head Attention - Parallel attention heads with automatic head splitting/merging
- Feed-Forward Networks - Position-wise FFN with configurable activations (GELU, ReLU, etc.)
- Gated FFN - GLU-style gated feed-forward networks
- Position Encodings - Sinusoidal, learned, relative, RoPE, and ALiBi position encodings
- Layer Normalization - Standard LayerNorm and RMSNorm implementations
- Encoder Layers - Complete transformer encoder layers with pre/post-norm variants
- Decoder Layers - Complete transformer decoder layers with masked self-attention
- Encoder/Decoder Stacks - Multi-layer transformer stacks with flexible configuration
- Rule-Based Attention - Logical rules guiding attention patterns (hard/soft/gated)
- Sparse Attention - Efficient attention for long sequences (strided, local, block-sparse)
- Flash Attention - Memory-efficient O(1) attention with tiled SRAM computation
- Grouped-Query Attention (GQA) - Reduce KV cache memory (MHA/GQA/MQA support)
- Sliding Window Attention - Efficient long-context with O(n*w) complexity
- LoRA - Low-Rank Adaptation for parameter-efficient fine-tuning
- Mixture-of-Experts (MoE) - Sparse expert routing (TopK, Softmax, Switch, ExpertChoice)
- Vision Transformers (ViT) - Patch embedding and ViT configurations (Tiny/Small/Base/Large/Huge)
- Gradient Checkpointing - Memory-efficient training with uniform/selective/dynamic strategies
- KV-Cache - Efficient autoregressive inference with 10-1000x speedup
- TrustformeRS Integration - Bidirectional conversion with TrustformeRS ecosystem
- Utility Functions - Parameter counting, FLOP calculations, model presets
- Performance Benchmarks - Criterion-based benchmark suite with HTML reports
- Type-Safe Configuration - Builder pattern with validation
- Einsum-Native - All operations expressed as einsum for maximum flexibility
- Zero Warnings - Strict code quality enforcement
- 346 Tests - Comprehensive test coverage (100% passing)
Quick Start
use ;
use EinsumGraph;
// Configure and build self-attention
let attn_config = new.unwrap;
let self_attn = new.unwrap;
let mut graph = new;
graph.add_tensor;
graph.add_tensor;
graph.add_tensor;
let outputs = self_attn.build_attention_graph.unwrap;
// Configure feed-forward network
let ffn_config = new
.with_activation
.with_dropout;
let ffn = new.unwrap;
Architecture
Self-Attention Formula
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
Einsum breakdown:
- Query-Key scores:
einsum("bqd,bkd->bqk", Q, K) - Scale:
scores / sqrt(d_k) - Softmax:
softmax(scores, axis=-1) - Attention-Value:
einsum("bqk,bkv->bqv", attn, V)
Multi-Head Attention
1. Reshape: [B, S, D] -> [B, H, S, D_k] where D_k = D/H
2. Attention per head: einsum("bhqd,bhkd->bhqk", Q, K)
3. Scale and softmax
4. Apply to values: einsum("bhqk,bhkv->bhqv", attn, V)
5. Concatenate heads: [B, H, S, D_k] -> [B, S, D]
Configuration
Attention Configuration
use AttentionConfig;
let config = new?
.with_causal // Enable causal masking
.with_dropout; // Set dropout probability
assert_eq!;
assert_eq!;
assert_eq!; // Automatically computed
Complete Transformer Layer
use TransformerLayerConfig;
let config = new?
.with_pre_norm; // Use pre-layer normalization
assert!;
Position Encodings
Five types of position encodings for sequence modeling:
use ;
// Sinusoidal (fixed) encoding
let config = sinusoidal;
let pe = new.unwrap;
// Rotary Position Embedding (RoPE) - used in LLaMA
// Attention with Linear Biases (ALiBi) - used in BLOOM
Flash Attention
Memory-efficient attention with tiled SRAM computation:
use ;
// A100 GPU preset
let config = a100;
let flash = new?;
// Custom tiling
let config = new
.with_block_size_q
.with_block_size_kv
.with_causal;
Grouped-Query Attention (GQA)
Reduce KV cache memory for efficient inference:
use ;
// LLaMA 2 70B style (8 KV heads, 64 query heads)
let config = llama2_70b;
let gqa = new?;
// Memory savings compared to MHA
println!;
Sliding Window Attention
Efficient long-context handling:
use ;
// Mistral 7B style
let config = mistral_7b;
let swa = new?;
// O(n*w) complexity instead of O(n^2)
println!;
LoRA (Low-Rank Adaptation)
Parameter-efficient fine-tuning:
use ;
// Standard LoRA configuration
let config = standard?;
let lora_attn = new?;
// Compression ratio
println!;
Mixture-of-Experts (MoE)
Sparse conditional computation:
use ;
// Mixtral 8x7B style
let config = mixtral_8x7b;
let moe = new?;
// Custom MoE
let config = new?
.with_load_balancing;
Vision Transformers (ViT)
Image recognition with transformer architecture:
use ;
// ViT-Base/16 configuration
let config = base;
let vit = new?;
println!;
Available presets: Tiny (5.7M), Small (22M), Base (86M), Large (307M), Huge (632M)
Gradient Checkpointing
Memory-efficient training for large models:
use ;
let config = new?;
// Uniform checkpointing: checkpoint every 2 layers
let checkpoint = uniform;
println!;
println!;
// Selective checkpointing: checkpoint specific layers
let checkpoint = selective;
// Dynamic checkpointing: automatically balance memory vs. compute
let checkpoint = dynamic?;
KV-Cache for Fast Inference
Enable efficient autoregressive generation with dramatic speedups:
use ;
// Create cache for 12-layer model (GPT-2 small)
let mut cache = new;
// Monitor cache usage
let stats = cache.stats;
println!;
Benefits:
- 10-1000x speedup depending on sequence length
- Minimal memory cost: ~2-10 MB for typical models
- Essential for production text generation
Rule-Based Attention
Integrate logical rules with attention mechanisms:
use ;
// Hard constraint: only attend where rule is satisfied
let base_attn = new?;
let config = hard;
// Soft constraint: bias attention towards rule-satisfying positions
let config = soft;
// Gated: interpolate between content and rule attention
let config = gated;
TrustformeRS Integration
Bidirectional conversion with the TrustformeRS ecosystem:
use ;
// Convert TrustformeRS architectures (BERT, GPT, T5) to TLExpr
let converter = new?;
let tlexpr = converter.convert_bert_encoder?;
// Load pretrained weights
let loader = new;
let weights = loader.load_checkpoint?;
Model Presets
use ;
// Standard presets
let gpt2 = gpt2_small;
let bert = bert_base;
let = transformer_base;
// Get model statistics
let stats = encoder_stack_stats;
println!;
// ModelStats:
// Total params: 117.00M
// Trainable: 117.00M
// Layers: 12
// d_model: 768
// Memory: 468 MB
Integration with TensorLogic
The einsum graphs produced by this crate integrate seamlessly with the TensorLogic ecosystem:
use CompilerContext;
use Scirs2Executor;
// Compile the transformer graph
let mut ctx = new;
// ... compile transformer einsum graph
// Execute on SciRS2 backend
let executor = new;
// ... execute the graph
Design Philosophy
- Backend Independence: Same graph works on CPU, GPU, TPU
- Einsum-Native: Clear mathematical semantics
- Composability: Mix transformer layers with logical rules
- Type Safety: Compile-time dimension checking where possible
- Zero Cost Abstractions: No runtime overhead
Examples
See the examples directory for 10 complete examples:
01_basic_encoder.rs- Basic transformer encoder usage02_trustformers_integration.rs- TrustformeRS integration03_rule_based_attention.rs- Rule-based attention patterns04_sparse_attention.rs- Sparse attention for long sequences05_gradient_checkpointing.rs- Memory-efficient training strategies06_kv_cache_inference.rs- Fast autoregressive generation with KV-cache07_vision_transformers.rs- Vision Transformer (ViT) for image classification08_mixture_of_experts.rs- Mixture-of-Experts for sparse models09_modern_llm_optimizations.rs- GQA, Sliding Window, LoRA10_modern_llm_complete.rs- Complete modern LLM configurations
Testing
# 346 tests, all passing, zero warnings
Benchmarking
This generates HTML reports in target/criterion/ with detailed performance metrics.
Performance
The einsum-based approach enables:
- Operation Fusion: Compiler can fuse consecutive operations
- Memory Efficiency: Minimal intermediate tensors
- Parallelization: Natural SIMD/GPU mapping
- Optimization: Graph-level optimizations
References
- Attention Is All You Need - Original transformer paper
- Tensor Logic Paper - TensorLogic framework
- Flash Attention - Memory-efficient attention
- LoRA - Low-rank adaptation
License
Apache-2.0
Status: Stable (v0.1.0) Last Updated: 2026-04-06 Tests: 346/346 passing (100%) Examples: 10 comprehensive examples Benchmarks: Criterion suite with HTML reports Features: Complete transformer implementation with modern LLM optimizations Part of: TensorLogic Ecosystem