tensorlogic-trustformers 0.1.0

Transformer-as-rules: Self-attention and FFN layers as einsum expressions
Documentation

tensorlogic-trustformers

Transformer architectures as TensorLogic einsum graphs

Crate Documentation Tests Production

This crate provides implementations of transformer components (self-attention, multi-head attention, feed-forward networks) as einsum operations that compile to TensorLogic IR and execute on any TensorLogic backend.

Features

  • Self-Attention - Scaled dot-product attention as einsum operations
  • Multi-Head Attention - Parallel attention heads with automatic head splitting/merging
  • Feed-Forward Networks - Position-wise FFN with configurable activations (GELU, ReLU, etc.)
  • Gated FFN - GLU-style gated feed-forward networks
  • Position Encodings - Sinusoidal, learned, relative, RoPE, and ALiBi position encodings
  • Layer Normalization - Standard LayerNorm and RMSNorm implementations
  • Encoder Layers - Complete transformer encoder layers with pre/post-norm variants
  • Decoder Layers - Complete transformer decoder layers with masked self-attention
  • Encoder/Decoder Stacks - Multi-layer transformer stacks with flexible configuration
  • Rule-Based Attention - Logical rules guiding attention patterns (hard/soft/gated)
  • Sparse Attention - Efficient attention for long sequences (strided, local, block-sparse)
  • Flash Attention - Memory-efficient O(1) attention with tiled SRAM computation
  • Grouped-Query Attention (GQA) - Reduce KV cache memory (MHA/GQA/MQA support)
  • Sliding Window Attention - Efficient long-context with O(n*w) complexity
  • LoRA - Low-Rank Adaptation for parameter-efficient fine-tuning
  • Mixture-of-Experts (MoE) - Sparse expert routing (TopK, Softmax, Switch, ExpertChoice)
  • Vision Transformers (ViT) - Patch embedding and ViT configurations (Tiny/Small/Base/Large/Huge)
  • Gradient Checkpointing - Memory-efficient training with uniform/selective/dynamic strategies
  • KV-Cache - Efficient autoregressive inference with 10-1000x speedup
  • TrustformeRS Integration - Bidirectional conversion with TrustformeRS ecosystem
  • Utility Functions - Parameter counting, FLOP calculations, model presets
  • Performance Benchmarks - Criterion-based benchmark suite with HTML reports
  • Type-Safe Configuration - Builder pattern with validation
  • Einsum-Native - All operations expressed as einsum for maximum flexibility
  • Zero Warnings - Strict code quality enforcement
  • 346 Tests - Comprehensive test coverage (100% passing)

Quick Start

use tensorlogic_trustformers::{
    AttentionConfig, SelfAttention, MultiHeadAttention,
    FeedForwardConfig, FeedForward,
};
use tensorlogic_ir::EinsumGraph;

// Configure and build self-attention
let attn_config = AttentionConfig::new(512, 8).unwrap();
let self_attn = SelfAttention::new(attn_config).unwrap();

let mut graph = EinsumGraph::new();
graph.add_tensor("Q");
graph.add_tensor("K");
graph.add_tensor("V");

let outputs = self_attn.build_attention_graph(&mut graph).unwrap();

// Configure feed-forward network
let ffn_config = FeedForwardConfig::new(512, 2048)
    .with_activation("gelu")
    .with_dropout(0.1);
let ffn = FeedForward::new(ffn_config).unwrap();

Architecture

Self-Attention Formula

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

Einsum breakdown:

  1. Query-Key scores: einsum("bqd,bkd->bqk", Q, K)
  2. Scale: scores / sqrt(d_k)
  3. Softmax: softmax(scores, axis=-1)
  4. Attention-Value: einsum("bqk,bkv->bqv", attn, V)

Multi-Head Attention

1. Reshape: [B, S, D] -> [B, H, S, D_k] where D_k = D/H
2. Attention per head: einsum("bhqd,bhkd->bhqk", Q, K)
3. Scale and softmax
4. Apply to values: einsum("bhqk,bhkv->bhqv", attn, V)
5. Concatenate heads: [B, H, S, D_k] -> [B, S, D]

Configuration

Attention Configuration

use tensorlogic_trustformers::AttentionConfig;

let config = AttentionConfig::new(512, 8)?
    .with_causal(true)      // Enable causal masking
    .with_dropout(0.1);      // Set dropout probability

assert_eq!(config.d_model, 512);
assert_eq!(config.n_heads, 8);
assert_eq!(config.d_k, 64);  // Automatically computed

Complete Transformer Layer

use tensorlogic_trustformers::TransformerLayerConfig;

let config = TransformerLayerConfig::new(512, 8, 2048)?
    .with_pre_norm(true);   // Use pre-layer normalization

assert!(config.validate().is_ok());

Position Encodings

Five types of position encodings for sequence modeling:

use tensorlogic_trustformers::{
    SinusoidalPositionEncoding, PositionEncodingConfig,
    RotaryPositionEncoding, AlibiPositionEncoding,
};

// Sinusoidal (fixed) encoding
let config = PositionEncodingConfig::sinusoidal(512, 2048);
let pe = SinusoidalPositionEncoding::new(config).unwrap();

// Rotary Position Embedding (RoPE) - used in LLaMA
// Attention with Linear Biases (ALiBi) - used in BLOOM

Flash Attention

Memory-efficient attention with tiled SRAM computation:

use tensorlogic_trustformers::{FlashAttention, FlashAttentionConfig, FlashAttentionPreset};

// A100 GPU preset
let config = FlashAttentionPreset::a100();
let flash = FlashAttention::new(config)?;

// Custom tiling
let config = FlashAttentionConfig::new(512, 8)
    .with_block_size_q(64)
    .with_block_size_kv(64)
    .with_causal(true);

Grouped-Query Attention (GQA)

Reduce KV cache memory for efficient inference:

use tensorlogic_trustformers::{GroupedQueryAttention, GQAConfig, GQAPreset};

// LLaMA 2 70B style (8 KV heads, 64 query heads)
let config = GQAPreset::llama2_70b();
let gqa = GroupedQueryAttention::new(config)?;

// Memory savings compared to MHA
println!("KV cache memory: {:.1}x of MHA", config.memory_factor());

Sliding Window Attention

Efficient long-context handling:

use tensorlogic_trustformers::{SlidingWindowAttention, SlidingWindowPreset};

// Mistral 7B style
let config = SlidingWindowPreset::mistral_7b();
let swa = SlidingWindowAttention::new(config)?;

// O(n*w) complexity instead of O(n^2)
println!("Complexity reduction: {:.1}x", config.complexity_reduction(4096));

LoRA (Low-Rank Adaptation)

Parameter-efficient fine-tuning:

use tensorlogic_trustformers::{LoRAConfig, LoRAAttention, LoRAPreset};

// Standard LoRA configuration
let config = LoRAPreset::standard(512, 8)?;
let lora_attn = LoRAAttention::new(config)?;

// Compression ratio
println!("Parameter reduction: {:.0}x", config.compression_ratio());

Mixture-of-Experts (MoE)

Sparse conditional computation:

use tensorlogic_trustformers::{MoeConfig, MoeLayer, MoePreset, RouterType};

// Mixtral 8x7B style
let config = MoePreset::mixtral_8x7b();
let moe = MoeLayer::new(config)?;

// Custom MoE
let config = MoeConfig::new(512, 8, RouterType::TopK(2))?
    .with_load_balancing(0.01);

Vision Transformers (ViT)

Image recognition with transformer architecture:

use tensorlogic_trustformers::{VisionTransformer, ViTPreset};

// ViT-Base/16 configuration
let config = ViTPreset::base();
let vit = VisionTransformer::new(config)?;

println!("Parameters: {:.1}M", config.num_parameters() as f64 / 1e6);

Available presets: Tiny (5.7M), Small (22M), Base (86M), Large (307M), Huge (632M)

Gradient Checkpointing

Memory-efficient training for large models:

use tensorlogic_trustformers::{CheckpointConfig, EncoderStackConfig};

let config = EncoderStackConfig::new(12, 768, 12, 3072, 512)?;

// Uniform checkpointing: checkpoint every 2 layers
let checkpoint = CheckpointConfig::uniform(2);
println!("Memory savings: {:.1}%", checkpoint.memory_savings(12) * 100.0);
println!("Compute overhead: {:.2}x", checkpoint.compute_overhead(12));

// Selective checkpointing: checkpoint specific layers
let checkpoint = CheckpointConfig::selective(vec![0, 3, 6, 9]);

// Dynamic checkpointing: automatically balance memory vs. compute
let checkpoint = CheckpointConfig::dynamic(12, 0.3)?;

KV-Cache for Fast Inference

Enable efficient autoregressive generation with dramatic speedups:

use tensorlogic_trustformers::{KVCache, KVCacheConfig};

// Create cache for 12-layer model (GPT-2 small)
let mut cache = KVCache::new(12, 12, 64);

// Monitor cache usage
let stats = cache.stats();
println!("{}", stats.summary());

Benefits:

  • 10-1000x speedup depending on sequence length
  • Minimal memory cost: ~2-10 MB for typical models
  • Essential for production text generation

Rule-Based Attention

Integrate logical rules with attention mechanisms:

use tensorlogic_trustformers::{RuleAttentionConfig, RuleBasedAttention};

// Hard constraint: only attend where rule is satisfied
let base_attn = AttentionConfig::new(512, 8)?;
let config = RuleAttentionConfig::hard(base_attn);

// Soft constraint: bias attention towards rule-satisfying positions
let config = RuleAttentionConfig::soft(base_attn, 0.7);

// Gated: interpolate between content and rule attention
let config = RuleAttentionConfig::gated(base_attn, 0.5);

TrustformeRS Integration

Bidirectional conversion with the TrustformeRS ecosystem:

use tensorlogic_trustformers::{TrustformersConverter, IntegrationConfig};

// Convert TrustformeRS architectures (BERT, GPT, T5) to TLExpr
let converter = TrustformersConverter::new(config)?;
let tlexpr = converter.convert_bert_encoder(bert_config)?;

// Load pretrained weights
let loader = TrustformersWeightLoader::new();
let weights = loader.load_checkpoint("model.bin")?;

Model Presets

use tensorlogic_trustformers::{presets, utils::encoder_stack_stats};

// Standard presets
let gpt2 = presets::gpt2_small();
let bert = presets::bert_base();
let (encoder, decoder) = presets::transformer_base();

// Get model statistics
let stats = encoder_stack_stats(&gpt2);
println!("{}", stats.summary());
// ModelStats:
//   Total params: 117.00M
//   Trainable: 117.00M
//   Layers: 12
//   d_model: 768
//   Memory: 468 MB

Integration with TensorLogic

The einsum graphs produced by this crate integrate seamlessly with the TensorLogic ecosystem:

use tensorlogic_compiler::CompilerContext;
use tensorlogic_scirs_backend::Scirs2Executor;

// Compile the transformer graph
let mut ctx = CompilerContext::new();
// ... compile transformer einsum graph

// Execute on SciRS2 backend
let executor = Scirs2Executor::new();
// ... execute the graph

Design Philosophy

  1. Backend Independence: Same graph works on CPU, GPU, TPU
  2. Einsum-Native: Clear mathematical semantics
  3. Composability: Mix transformer layers with logical rules
  4. Type Safety: Compile-time dimension checking where possible
  5. Zero Cost Abstractions: No runtime overhead

Examples

See the examples directory for 10 complete examples:

  • 01_basic_encoder.rs - Basic transformer encoder usage
  • 02_trustformers_integration.rs - TrustformeRS integration
  • 03_rule_based_attention.rs - Rule-based attention patterns
  • 04_sparse_attention.rs - Sparse attention for long sequences
  • 05_gradient_checkpointing.rs - Memory-efficient training strategies
  • 06_kv_cache_inference.rs - Fast autoregressive generation with KV-cache
  • 07_vision_transformers.rs - Vision Transformer (ViT) for image classification
  • 08_mixture_of_experts.rs - Mixture-of-Experts for sparse models
  • 09_modern_llm_optimizations.rs - GQA, Sliding Window, LoRA
  • 10_modern_llm_complete.rs - Complete modern LLM configurations

Testing

cargo nextest run -p tensorlogic-trustformers
# 346 tests, all passing, zero warnings

Benchmarking

cargo bench --bench model_benchmarks

This generates HTML reports in target/criterion/ with detailed performance metrics.

Performance

The einsum-based approach enables:

  • Operation Fusion: Compiler can fuse consecutive operations
  • Memory Efficiency: Minimal intermediate tensors
  • Parallelization: Natural SIMD/GPU mapping
  • Optimization: Graph-level optimizations

References

License

Apache-2.0


Status: Stable (v0.1.0) Last Updated: 2026-04-06 Tests: 346/346 passing (100%) Examples: 10 comprehensive examples Benchmarks: Criterion suite with HTML reports Features: Complete transformer implementation with modern LLM optimizations Part of: TensorLogic Ecosystem