TenfloweRS Core
The foundational crate of TenfloweRS, providing core tensor operations, device management, and the computational infrastructure for machine learning in Rust.
Stable (v0.1.0 -- 2026-03-20) | 675 tests passing | 0 clippy warnings
Overview
tenflowers-core implements:
- Multi-dimensional tensor operations with CPU and GPU support
- Device abstraction for heterogeneous computing (CPU, WGPU, CUDA, Metal, ROCm)
- Efficient memory management and zero-copy operations where possible
- Integration with the NumRS2/SciRS2 ecosystem
- Operation registry with shape inference and kernel fusion
- Autocast, sparse tensors, fused ops, and advanced math functions
Features
- Device Management: Seamless CPU/GPU tensor operations with automatic device placement
- Data Types: Support for
f32,f64,i32,i64,u8, and more - Operations: Comprehensive set of tensor operations including:
- Arithmetic: element-wise and broadcasting operations
- Linear Algebra: matrix multiplication, decompositions, eigenvalues
- Neural Network: convolutions, pooling, activations
- Reductions: sum, mean, max, argmax along axes
- Manipulation: reshape, transpose, concatenate, slice
- Advanced Math: logsumexp, GELU, Mish, Swish, and more
- GPU Acceleration: WGPU-based compute shaders for cross-platform GPU support
- Operation Registry: Extensible dispatch registry with shape inference
- Kernel Fusion: Automatic fusion of eligible operation sequences
- Autocast: Automatic dtype promotion for mixed-precision workflows
- Sparse Tensors: COO and CSR sparse tensor support
- Fused Ops: Pre-fused compound operations for performance
- BLAS Integration: Optional acceleration via OxiBLAS
Usage
Basic Tensor Operations
use ;
// Create tensors
let a = from_vec?;
let b = ones?;
// Arithmetic operations
let c = &a + &b; // Element-wise addition
let d = a.matmul?; // Matrix multiplication
// Reductions
let sum = c.sum?; // Sum all elements
let mean = c.mean?; // Mean along axis 0
GPU Operations
Computation Graphs
use ;
// Build a computation graph
let mut graph = new;
let x = graph.placeholder;
let w = graph.variable;
let b = graph.variable;
let logits = graph.matmul?;
let output = graph.add?;
// Execute with session
let mut session = new;
let result = session.run?;
Architecture
Core Components
- Tensor: The fundamental data structure, wrapping device-specific storage
- Device: Abstraction over CPU and GPU devices with placement strategies
- TensorStorage: Internal storage handling CPU (ndarray) and GPU buffers
- Operations: Modular operation system with device-specific implementations
- Graph/Session: Static graph construction and optimized execution
- DispatchRegistry: Extensible operation dispatch with kernel selection
- ShapeInferenceRegistry: Automatic output shape computation
Integration with NumRS2/SciRS2
This crate is designed to work seamlessly with the broader Rust scientific computing ecosystem:
use Array2;
use Tensor;
// Convert from NumRS2 arrays
let array = from_shape_vec?;
let tensor = from_numrs2?;
// Convert to NumRS2 arrays
let array_back: = tensor.to_numrs2?;
Feature Flags
std(default): Standard library supportparallel(default): Parallel CPU operations via Rayongpu: Enable GPU support via WGPUcuda: CUDA backend supportmetal: Metal backend support (macOS)rocm: ROCm backend support (AMD GPUs)blas-oxiblas: Use OxiBLAS for accelerated linear algebrasimd: SIMD vectorization optimizationsserialize: Enable serialization support via serde
Performance Considerations
- Tensors use reference counting for efficient memory management
- Operations are lazily evaluated when using computation graphs
- GPU operations are asynchronous and batched for efficiency
- Broadcasting follows NumPy semantics for compatibility
- Zero-copy views are used where possible (slicing, transposition)
- Kernel fusion reduces memory bandwidth pressure for eligible op sequences
Dependencies
Core dependencies:
ndarray: CPU tensor storage and operationsnum-traits: Numeric trait boundsrayon: Parallel CPU operationswgpu(optional): GPU compute support
License
Licensed under Apache-2.0