torsh-tensor 0.1.2

Tensor implementation for ToRSh with PyTorch-compatible API
Documentation

torsh-tensor

version

PyTorch-compatible tensor implementation for ToRSh, built on top of scirs2.

Overview

This crate provides the core Tensor type with a familiar PyTorch-like API, wrapping scirs2's powerful autograd functionality.

Features

  • PyTorch-compatible tensor operations
  • Automatic differentiation support
  • Broadcasting and shape manipulation
  • Comprehensive indexing and slicing
  • Integration with scirs2 for optimized computation
  • simd_ops_f32 module (v0.1.2): zero-allocation SIMD f32 arithmetic (add_into_f32, add_assign_f32, etc.) and activation functions with PyTorch NaN semantics
  • Real SIMD dispatch (v0.1.2): Tensor::add/sub/mul/div automatically use AVX2/NEON acceleration via scirs2_core for f32 tensors with ≥ 1024 elements
  • Zero-allocation in-place arithmetic (v0.1.2): add_/sub_/mul_/div_ dispatch through simd_*_inplace — no temporary buffers
  • In-place activation SIMD (v0.1.2): relu_/leaky_relu_/clamp_ route to SIMD helpers for maximum throughput
  • True buffer pool reuse (v0.1.2): GlobalMemoryPool::acquire_uninit::<T>() returns ReusedBuffer<T> with zero copy on pool hit
  • simd and parallel features enabled by default — no --features flag required
  • Allocation tracking benchmark (v0.1.2): benches/alloc_tracking.rs (harness=false, dhat) proves GlobalMemoryPool achieves 100% alloc reduction — 10,000 blocks in naive path vs 0 in pooled path

Usage

Basic Tensor Creation

use torsh_tensor::prelude::*;

// Create tensors using the tensor! macro
let a = tensor![1.0, 2.0, 3.0];
let b = tensor![[1.0, 2.0], [3.0, 4.0]];

// Create tensors with specific shapes
let zeros = zeros::<f32>(&[3, 4]);
let ones = ones::<f32>(&[2, 3]);
let eye = eye::<f32>(5);

// Random tensors
let uniform = rand::<f32>(&[3, 3]);
let normal = randn::<f32>(&[2, 4]);

Tensor Operations

// Element-wise operations
let c = a.add(&b)?;
let d = a.mul(&b)?;

// Matrix multiplication
let e = a.matmul(&b)?;

// Reductions
let sum = a.sum();
let mean = a.mean();
let max = a.max();

// Activation functions
let relu = a.relu();
let sigmoid = a.sigmoid();

Shape Manipulation

// Reshape
let reshaped = a.view(&[2, 3])?;

// Transpose
let transposed = a.t()?;

// Squeeze and unsqueeze
let squeezed = a.squeeze();
let unsqueezed = a.unsqueeze(0)?;

Automatic Differentiation

// Enable gradient computation
let x = tensor![2.0].requires_grad_(true);

// Forward pass
let y = x.pow(2.0)?.add(&x.mul(&tensor![3.0])?)?;

// Backward pass
y.backward()?;

// Access gradient
let grad = x.grad().unwrap();

Indexing and Slicing

// Basic indexing
let element = tensor.get(0)?;
let element_2d = tensor.get_2d(1, 2)?;

// Slicing with macros
let slice = tensor.index(&[s![1..5], s![..], s![0..10; 2]])?;

// Boolean masking
let mask = tensor.gt(&zeros)?;
let selected = tensor.masked_select(&mask)?;

Performance

torsh-tensor routes hot arithmetic paths through SIMD automatically when the simd feature is active (default since v0.1.2).

  • Element-wise arithmetic (add, sub, mul, div) on f32 tensors with ≥ 1024 elements dispatches through scirs2_core's AVX2 (x86-64) or NEON (AArch64) kernels.
  • In-place variants (add_, sub_, mul_, div_) use simd_*_inplace — no intermediate allocation occurs at any tensor size.
  • Activation functions (relu_, leaky_relu_, clamp_) take the same in-place SIMD path.
  • The global memory pool (GlobalMemoryPool) returns slabs without copying when the requested size matches an existing free buffer (acquire_uninit::<T>()).

No special build flags are needed on supported targets; the feature-detection is done at runtime by scirs2_core.

Recent Changes

v0.1.2 — 2026-04-26

  • Added simd_ops_f32 module with zero-allocation SIMD f32 arithmetic and activations (PyTorch NaN semantics).
  • Wired real SIMD dispatch into Tensor::add/sub/mul/div for f32 tensors ≥ 1024 elements (AVX2/NEON via scirs2_core).
  • add_/sub_/mul_/div_ now call simd_*_inplace — zero extra allocations.
  • relu_/leaky_relu_/clamp_ dispatch to SIMD helpers.
  • GlobalMemoryPool::acquire_uninit::<T>() returns ReusedBuffer<T> with no copy on pool hit.
  • simd and parallel features promoted to default features.

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.