tensorlogic-trustformers 0.1.0

Transformer-as-rules: Self-attention and FFN layers as einsum expressions
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
# tensorlogic-trustformers

**Transformer architectures as TensorLogic einsum graphs**

[![Crate](https://img.shields.io/badge/crates.io-tensorlogic--trustformers-orange)](https://crates.io/crates/tensorlogic-trustformers)
[![Documentation](https://img.shields.io/badge/docs-latest-blue)](https://docs.rs/tensorlogic-trustformers)
[![Tests](https://img.shields.io/badge/tests-346%2F346-brightgreen)](#)
[![Production](https://img.shields.io/badge/status-stable-success)](#)

This crate provides implementations of transformer components (self-attention, multi-head attention, feed-forward networks) as einsum operations that compile to TensorLogic IR and execute on any TensorLogic backend.

## Features

- **Self-Attention** - Scaled dot-product attention as einsum operations
- **Multi-Head Attention** - Parallel attention heads with automatic head splitting/merging
- **Feed-Forward Networks** - Position-wise FFN with configurable activations (GELU, ReLU, etc.)
- **Gated FFN** - GLU-style gated feed-forward networks
- **Position Encodings** - Sinusoidal, learned, relative, RoPE, and ALiBi position encodings
- **Layer Normalization** - Standard LayerNorm and RMSNorm implementations
- **Encoder Layers** - Complete transformer encoder layers with pre/post-norm variants
- **Decoder Layers** - Complete transformer decoder layers with masked self-attention
- **Encoder/Decoder Stacks** - Multi-layer transformer stacks with flexible configuration
- **Rule-Based Attention** - Logical rules guiding attention patterns (hard/soft/gated)
- **Sparse Attention** - Efficient attention for long sequences (strided, local, block-sparse)
- **Flash Attention** - Memory-efficient O(1) attention with tiled SRAM computation
- **Grouped-Query Attention (GQA)** - Reduce KV cache memory (MHA/GQA/MQA support)
- **Sliding Window Attention** - Efficient long-context with O(n*w) complexity
- **LoRA** - Low-Rank Adaptation for parameter-efficient fine-tuning
- **Mixture-of-Experts (MoE)** - Sparse expert routing (TopK, Softmax, Switch, ExpertChoice)
- **Vision Transformers (ViT)** - Patch embedding and ViT configurations (Tiny/Small/Base/Large/Huge)
- **Gradient Checkpointing** - Memory-efficient training with uniform/selective/dynamic strategies
- **KV-Cache** - Efficient autoregressive inference with 10-1000x speedup
- **TrustformeRS Integration** - Bidirectional conversion with TrustformeRS ecosystem
- **Utility Functions** - Parameter counting, FLOP calculations, model presets
- **Performance Benchmarks** - Criterion-based benchmark suite with HTML reports
- **Type-Safe Configuration** - Builder pattern with validation
- **Einsum-Native** - All operations expressed as einsum for maximum flexibility
- **Zero Warnings** - Strict code quality enforcement
- **346 Tests** - Comprehensive test coverage (100% passing)

## Quick Start

```rust
use tensorlogic_trustformers::{
    AttentionConfig, SelfAttention, MultiHeadAttention,
    FeedForwardConfig, FeedForward,
};
use tensorlogic_ir::EinsumGraph;

// Configure and build self-attention
let attn_config = AttentionConfig::new(512, 8).unwrap();
let self_attn = SelfAttention::new(attn_config).unwrap();

let mut graph = EinsumGraph::new();
graph.add_tensor("Q");
graph.add_tensor("K");
graph.add_tensor("V");

let outputs = self_attn.build_attention_graph(&mut graph).unwrap();

// Configure feed-forward network
let ffn_config = FeedForwardConfig::new(512, 2048)
    .with_activation("gelu")
    .with_dropout(0.1);
let ffn = FeedForward::new(ffn_config).unwrap();
```

## Architecture

### Self-Attention Formula

```
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
```

**Einsum breakdown:**
1. Query-Key scores: `einsum("bqd,bkd->bqk", Q, K)`
2. Scale: `scores / sqrt(d_k)`
3. Softmax: `softmax(scores, axis=-1)`
4. Attention-Value: `einsum("bqk,bkv->bqv", attn, V)`

### Multi-Head Attention

```
1. Reshape: [B, S, D] -> [B, H, S, D_k] where D_k = D/H
2. Attention per head: einsum("bhqd,bhkd->bhqk", Q, K)
3. Scale and softmax
4. Apply to values: einsum("bhqk,bhkv->bhqv", attn, V)
5. Concatenate heads: [B, H, S, D_k] -> [B, S, D]
```

## Configuration

### Attention Configuration

```rust
use tensorlogic_trustformers::AttentionConfig;

let config = AttentionConfig::new(512, 8)?
    .with_causal(true)      // Enable causal masking
    .with_dropout(0.1);      // Set dropout probability

assert_eq!(config.d_model, 512);
assert_eq!(config.n_heads, 8);
assert_eq!(config.d_k, 64);  // Automatically computed
```

### Complete Transformer Layer

```rust
use tensorlogic_trustformers::TransformerLayerConfig;

let config = TransformerLayerConfig::new(512, 8, 2048)?
    .with_pre_norm(true);   // Use pre-layer normalization

assert!(config.validate().is_ok());
```

## Position Encodings

Five types of position encodings for sequence modeling:

```rust
use tensorlogic_trustformers::{
    SinusoidalPositionEncoding, PositionEncodingConfig,
    RotaryPositionEncoding, AlibiPositionEncoding,
};

// Sinusoidal (fixed) encoding
let config = PositionEncodingConfig::sinusoidal(512, 2048);
let pe = SinusoidalPositionEncoding::new(config).unwrap();

// Rotary Position Embedding (RoPE) - used in LLaMA
// Attention with Linear Biases (ALiBi) - used in BLOOM
```

## Flash Attention

Memory-efficient attention with tiled SRAM computation:

```rust
use tensorlogic_trustformers::{FlashAttention, FlashAttentionConfig, FlashAttentionPreset};

// A100 GPU preset
let config = FlashAttentionPreset::a100();
let flash = FlashAttention::new(config)?;

// Custom tiling
let config = FlashAttentionConfig::new(512, 8)
    .with_block_size_q(64)
    .with_block_size_kv(64)
    .with_causal(true);
```

## Grouped-Query Attention (GQA)

Reduce KV cache memory for efficient inference:

```rust
use tensorlogic_trustformers::{GroupedQueryAttention, GQAConfig, GQAPreset};

// LLaMA 2 70B style (8 KV heads, 64 query heads)
let config = GQAPreset::llama2_70b();
let gqa = GroupedQueryAttention::new(config)?;

// Memory savings compared to MHA
println!("KV cache memory: {:.1}x of MHA", config.memory_factor());
```

## Sliding Window Attention

Efficient long-context handling:

```rust
use tensorlogic_trustformers::{SlidingWindowAttention, SlidingWindowPreset};

// Mistral 7B style
let config = SlidingWindowPreset::mistral_7b();
let swa = SlidingWindowAttention::new(config)?;

// O(n*w) complexity instead of O(n^2)
println!("Complexity reduction: {:.1}x", config.complexity_reduction(4096));
```

## LoRA (Low-Rank Adaptation)

Parameter-efficient fine-tuning:

```rust
use tensorlogic_trustformers::{LoRAConfig, LoRAAttention, LoRAPreset};

// Standard LoRA configuration
let config = LoRAPreset::standard(512, 8)?;
let lora_attn = LoRAAttention::new(config)?;

// Compression ratio
println!("Parameter reduction: {:.0}x", config.compression_ratio());
```

## Mixture-of-Experts (MoE)

Sparse conditional computation:

```rust
use tensorlogic_trustformers::{MoeConfig, MoeLayer, MoePreset, RouterType};

// Mixtral 8x7B style
let config = MoePreset::mixtral_8x7b();
let moe = MoeLayer::new(config)?;

// Custom MoE
let config = MoeConfig::new(512, 8, RouterType::TopK(2))?
    .with_load_balancing(0.01);
```

## Vision Transformers (ViT)

Image recognition with transformer architecture:

```rust
use tensorlogic_trustformers::{VisionTransformer, ViTPreset};

// ViT-Base/16 configuration
let config = ViTPreset::base();
let vit = VisionTransformer::new(config)?;

println!("Parameters: {:.1}M", config.num_parameters() as f64 / 1e6);
```

Available presets: Tiny (5.7M), Small (22M), Base (86M), Large (307M), Huge (632M)

## Gradient Checkpointing

Memory-efficient training for large models:

```rust
use tensorlogic_trustformers::{CheckpointConfig, EncoderStackConfig};

let config = EncoderStackConfig::new(12, 768, 12, 3072, 512)?;

// Uniform checkpointing: checkpoint every 2 layers
let checkpoint = CheckpointConfig::uniform(2);
println!("Memory savings: {:.1}%", checkpoint.memory_savings(12) * 100.0);
println!("Compute overhead: {:.2}x", checkpoint.compute_overhead(12));

// Selective checkpointing: checkpoint specific layers
let checkpoint = CheckpointConfig::selective(vec![0, 3, 6, 9]);

// Dynamic checkpointing: automatically balance memory vs. compute
let checkpoint = CheckpointConfig::dynamic(12, 0.3)?;
```

## KV-Cache for Fast Inference

Enable efficient autoregressive generation with dramatic speedups:

```rust
use tensorlogic_trustformers::{KVCache, KVCacheConfig};

// Create cache for 12-layer model (GPT-2 small)
let mut cache = KVCache::new(12, 12, 64);

// Monitor cache usage
let stats = cache.stats();
println!("{}", stats.summary());
```

Benefits:
- **10-1000x speedup** depending on sequence length
- Minimal memory cost: ~2-10 MB for typical models
- Essential for production text generation

## Rule-Based Attention

Integrate logical rules with attention mechanisms:

```rust
use tensorlogic_trustformers::{RuleAttentionConfig, RuleBasedAttention};

// Hard constraint: only attend where rule is satisfied
let base_attn = AttentionConfig::new(512, 8)?;
let config = RuleAttentionConfig::hard(base_attn);

// Soft constraint: bias attention towards rule-satisfying positions
let config = RuleAttentionConfig::soft(base_attn, 0.7);

// Gated: interpolate between content and rule attention
let config = RuleAttentionConfig::gated(base_attn, 0.5);
```

## TrustformeRS Integration

Bidirectional conversion with the TrustformeRS ecosystem:

```rust
use tensorlogic_trustformers::{TrustformersConverter, IntegrationConfig};

// Convert TrustformeRS architectures (BERT, GPT, T5) to TLExpr
let converter = TrustformersConverter::new(config)?;
let tlexpr = converter.convert_bert_encoder(bert_config)?;

// Load pretrained weights
let loader = TrustformersWeightLoader::new();
let weights = loader.load_checkpoint("model.bin")?;
```

## Model Presets

```rust
use tensorlogic_trustformers::{presets, utils::encoder_stack_stats};

// Standard presets
let gpt2 = presets::gpt2_small();
let bert = presets::bert_base();
let (encoder, decoder) = presets::transformer_base();

// Get model statistics
let stats = encoder_stack_stats(&gpt2);
println!("{}", stats.summary());
// ModelStats:
//   Total params: 117.00M
//   Trainable: 117.00M
//   Layers: 12
//   d_model: 768
//   Memory: 468 MB
```

## Integration with TensorLogic

The einsum graphs produced by this crate integrate seamlessly with the TensorLogic ecosystem:

```rust
use tensorlogic_compiler::CompilerContext;
use tensorlogic_scirs_backend::Scirs2Executor;

// Compile the transformer graph
let mut ctx = CompilerContext::new();
// ... compile transformer einsum graph

// Execute on SciRS2 backend
let executor = Scirs2Executor::new();
// ... execute the graph
```

## Design Philosophy

1. **Backend Independence**: Same graph works on CPU, GPU, TPU
2. **Einsum-Native**: Clear mathematical semantics
3. **Composability**: Mix transformer layers with logical rules
4. **Type Safety**: Compile-time dimension checking where possible
5. **Zero Cost Abstractions**: No runtime overhead

## Examples

See the [examples directory](examples/) for 10 complete examples:

- `01_basic_encoder.rs` - Basic transformer encoder usage
- `02_trustformers_integration.rs` - TrustformeRS integration
- `03_rule_based_attention.rs` - Rule-based attention patterns
- `04_sparse_attention.rs` - Sparse attention for long sequences
- `05_gradient_checkpointing.rs` - Memory-efficient training strategies
- `06_kv_cache_inference.rs` - Fast autoregressive generation with KV-cache
- `07_vision_transformers.rs` - Vision Transformer (ViT) for image classification
- `08_mixture_of_experts.rs` - Mixture-of-Experts for sparse models
- `09_modern_llm_optimizations.rs` - GQA, Sliding Window, LoRA
- `10_modern_llm_complete.rs` - Complete modern LLM configurations

## Testing

```bash
cargo nextest run -p tensorlogic-trustformers
# 346 tests, all passing, zero warnings
```

## Benchmarking

```bash
cargo bench --bench model_benchmarks
```

This generates HTML reports in `target/criterion/` with detailed performance metrics.

## Performance

The einsum-based approach enables:

- **Operation Fusion**: Compiler can fuse consecutive operations
- **Memory Efficiency**: Minimal intermediate tensors
- **Parallelization**: Natural SIMD/GPU mapping
- **Optimization**: Graph-level optimizations

## References

- [Attention Is All You Need]https://arxiv.org/abs/1706.03762 - Original transformer paper
- [Tensor Logic Paper]https://arxiv.org/abs/2510.12269 - TensorLogic framework
- [Flash Attention]https://arxiv.org/abs/2205.14135 - Memory-efficient attention
- [LoRA]https://arxiv.org/abs/2106.09685 - Low-rank adaptation

## License

Apache-2.0

---

**Status**: Stable (v0.1.0)
**Last Updated**: 2026-04-06
**Tests**: 346/346 passing (100%)
**Examples**: 10 comprehensive examples
**Benchmarks**: Criterion suite with HTML reports
**Features**: Complete transformer implementation with modern LLM optimizations
**Part of**: [TensorLogic Ecosystem](https://github.com/cool-japan/tensorlogic)