tenrso-exec
Unified execution API for TenRSo tensor operations.
Overview
tenrso-exec provides the main user-facing API for executing tensor operations:
einsum_ex- Unified einsum contraction interface- TenrsoExecutor trait - Backend abstraction (CPU, GPU)
- Execution hints - Control representation, tiling, masking
- Auto-optimization - Automatic planner integration
All tensor operations (dense, sparse, low-rank) go through this unified interface.
Features
- Single API for all tensor representations
- Automatic optimization via planner
- Memory pooling and device management
- Parallel execution
- Custom execution hints
Usage
[]
= "0.1"
Basic Einsum (TODO: M4)
use ;
// Simple matrix multiplication
let C =
.inputs
.run?;
With Hints (TODO: M4)
// Tensor contraction with optimization hints
let result =
.inputs
.hints
.run?;
Element-wise & Reductions
use ;
let mut exec = new;
// Element-wise operation
let abs_tensor = exec.elem_op?;
// Reduction
let sum = exec.reduce?;
Performance Configuration
tenrso-exec includes advanced optimization features that can be configured per executor:
use CpuExecutor;
// Default: all optimizations enabled
let mut exec = new;
// Custom configuration with selective optimizations
let mut exec = new
.with_simd // SIMD-accelerated operations
.with_tiled_reductions // Cache-friendly blocked reductions
.with_vectorized_broadcast; // Optimized broadcasting patterns
// Disable all optimizations (for debugging or baseline comparison)
let mut exec = unoptimized;
Optimization Features
-
SIMD Operations (
enable_simd):- Vectorized element-wise operations (neg, abs, exp, log, sin, cos, etc.)
- Vectorized binary operations (add, sub, mul, div, etc.)
- Automatically activated for tensors ≥1024 elements
- Typical speedup: 2-4× for simple ops, up to 8× for expensive ops (exp, sin)
-
Tiled Reductions (
enable_tiled_reductions):- Cache-friendly blocked reductions using 4KB tiles
- Optimizes sum, mean, max, min operations
- Automatically activated for tensors ≥100K elements
- Typical speedup: 1.5-3× for large tensors (reduces cache misses)
-
Vectorized Broadcasting (
enable_vectorized_broadcast):- Pattern-aware broadcasting with specialized kernels
- Detects common patterns (scalar, same-shape, axis-specific)
- Parallel execution for large operations
- Typical speedup: 1.5-2× for broadcast-heavy workloads
When to Use Each Optimization
Enable SIMD when:
- Working with large vectors/tensors (>1K elements)
- Performing many element-wise operations
- Using expensive math functions (exp, log, trigonometric)
Enable Tiled Reductions when:
- Reducing very large tensors (>100K elements)
- Memory bandwidth is a bottleneck
- Working with multi-dimensional reductions
Disable optimizations when:
- Debugging numerical differences
- Profiling baseline performance
- Working with very small tensors (<1K elements)
- Comparing against reference implementations
Performance Tuning Guidelines
-
Default configuration is optimal for most workloads:
let mut exec = new; // All optimizations enabled -
For debugging or numerical verification:
let mut exec = unoptimized; -
For memory-constrained environments:
let mut exec = new .with_tiled_reductions; // Reduce memory footprint -
For maximum throughput on modern CPUs:
let mut exec = new; // All optimizations enabled by default
Benchmarking
Run comprehensive benchmarks to measure optimization impact:
# Run all benchmarks
# Run optimization-specific benchmarks
# Compare optimized vs unoptimized performance
Benchmark results include:
- SIMD element-wise operations at various tensor sizes
- Tiled reductions vs standard reductions
- Combined optimization pipeline performance
- Automatic threshold detection verification
API Reference
Einsum Builder
Execution Hints
Executor Trait
Dependencies
- tenrso-core - Tensor types
- tenrso-kernels - Tensor kernels
- tenrso-sparse - Sparse operations
- tenrso-decomp - Decompositions
- tenrso-planner - Contraction planning
- tenrso-ooc (optional) - Out-of-core support
License
Apache-2.0