tenflowers-core 0.1.0

Core tensor operations and execution engine for TenfloweRS
Documentation

TenfloweRS Core

The foundational crate of TenfloweRS, providing core tensor operations, device management, and the computational infrastructure for machine learning in Rust.

Stable (v0.1.0 -- 2026-03-20) | 675 tests passing | 0 clippy warnings

Overview

tenflowers-core implements:

  • Multi-dimensional tensor operations with CPU and GPU support
  • Device abstraction for heterogeneous computing (CPU, WGPU, CUDA, Metal, ROCm)
  • Efficient memory management and zero-copy operations where possible
  • Integration with the NumRS2/SciRS2 ecosystem
  • Operation registry with shape inference and kernel fusion
  • Autocast, sparse tensors, fused ops, and advanced math functions

Features

  • Device Management: Seamless CPU/GPU tensor operations with automatic device placement
  • Data Types: Support for f32, f64, i32, i64, u8, and more
  • Operations: Comprehensive set of tensor operations including:
    • Arithmetic: element-wise and broadcasting operations
    • Linear Algebra: matrix multiplication, decompositions, eigenvalues
    • Neural Network: convolutions, pooling, activations
    • Reductions: sum, mean, max, argmax along axes
    • Manipulation: reshape, transpose, concatenate, slice
    • Advanced Math: logsumexp, GELU, Mish, Swish, and more
  • GPU Acceleration: WGPU-based compute shaders for cross-platform GPU support
  • Operation Registry: Extensible dispatch registry with shape inference
  • Kernel Fusion: Automatic fusion of eligible operation sequences
  • Autocast: Automatic dtype promotion for mixed-precision workflows
  • Sparse Tensors: COO and CSR sparse tensor support
  • Fused Ops: Pre-fused compound operations for performance
  • BLAS Integration: Optional acceleration via OxiBLAS

Usage

Basic Tensor Operations

use tenflowers_core::{Tensor, Device, DType};

// Create tensors
let a = Tensor::from_vec(vec![1.0, 2.0, 3.0, 4.0], &[2, 2], Device::Cpu)?;
let b = Tensor::ones(&[2, 2], DType::F32, Device::Cpu)?;

// Arithmetic operations
let c = &a + &b;  // Element-wise addition
let d = a.matmul(&b)?;  // Matrix multiplication

// Reductions
let sum = c.sum(None)?;  // Sum all elements
let mean = c.mean(Some(&[0]))?;  // Mean along axis 0

GPU Operations

#[cfg(feature = "gpu")]
{
    let gpu_device = Device::Gpu(0);
    let a_gpu = a.to_device(&gpu_device)?;
    let b_gpu = b.to_device(&gpu_device)?;

    // Operations automatically dispatch to GPU kernels
    let c_gpu = a_gpu.matmul(&b_gpu)?;

    // Transfer back to CPU if needed
    let c_cpu = c_gpu.to_device(&Device::Cpu)?;
}

Computation Graphs

use tenflowers_core::{Graph, Session};

// Build a computation graph
let mut graph = Graph::new();
let x = graph.placeholder("x", DType::F32, Some(&[None, 784]));
let w = graph.variable("w", Tensor::randn(&[784, 10], DType::F32, Device::Cpu)?);
let b = graph.variable("b", Tensor::zeros(&[10], DType::F32, Device::Cpu)?);

let logits = graph.matmul(&x, &w)?;
let output = graph.add(&logits, &b)?;

// Execute with session
let mut session = Session::new(&graph);
let result = session.run(
    &[output],
    &[("x", input_tensor)],
)?;

Architecture

Core Components

  • Tensor: The fundamental data structure, wrapping device-specific storage
  • Device: Abstraction over CPU and GPU devices with placement strategies
  • TensorStorage: Internal storage handling CPU (ndarray) and GPU buffers
  • Operations: Modular operation system with device-specific implementations
  • Graph/Session: Static graph construction and optimized execution
  • DispatchRegistry: Extensible operation dispatch with kernel selection
  • ShapeInferenceRegistry: Automatic output shape computation

Integration with NumRS2/SciRS2

This crate is designed to work seamlessly with the broader Rust scientific computing ecosystem:

use numrs2::array::Array2;
use tenflowers_core::Tensor;

// Convert from NumRS2 arrays
let array = Array2::from_shape_vec((3, 3), vec![1.0; 9])?;
let tensor = Tensor::from_numrs2(array, Device::Cpu)?;

// Convert to NumRS2 arrays
let array_back: Array2<f32> = tensor.to_numrs2()?;

Feature Flags

  • std (default): Standard library support
  • parallel (default): Parallel CPU operations via Rayon
  • gpu: Enable GPU support via WGPU
  • cuda: CUDA backend support
  • metal: Metal backend support (macOS)
  • rocm: ROCm backend support (AMD GPUs)
  • blas-oxiblas: Use OxiBLAS for accelerated linear algebra
  • simd: SIMD vectorization optimizations
  • serialize: Enable serialization support via serde

Performance Considerations

  • Tensors use reference counting for efficient memory management
  • Operations are lazily evaluated when using computation graphs
  • GPU operations are asynchronous and batched for efficiency
  • Broadcasting follows NumPy semantics for compatibility
  • Zero-copy views are used where possible (slicing, transposition)
  • Kernel fusion reduces memory bandwidth pressure for eligible op sequences

Dependencies

Core dependencies:

  • ndarray: CPU tensor storage and operations
  • num-traits: Numeric trait bounds
  • rayon: Parallel CPU operations
  • wgpu (optional): GPU compute support

License

Licensed under Apache-2.0