simdly 0.1.0

🚀 High-performance Rust library leveraging SIMD and Rayon for fast computations.
Documentation

Simdly

Simdly is a Rust project focused on exploring and utilizing SIMD (Single Instruction, Multiple Data) operations for high-performance computing. Leveraging modern Rust features and libraries like Rayon and Criterion for benching, Simdly aims to provide or demonstrate efficient numerical computations.

Table of Contents

Project Status

Note: This project is currently under active development. The API, features, and benchmark results may change as development progresses. It is primarily intended for exploration and demonstration purposes at this stage.

Platform Support

This project is currently being developed and tested on Linux and macOS systems. While Rust is cross-platform, specific SIMD intrinsics or performance characteristics might vary on other operating systems like Windows. The benchmarking and development efforts are focused on Unix-like environments.

SIMD, Alignment, and Rust's Vec<T>/Slices

SIMD (Single Instruction, Multiple Data) allows a single instruction to operate on multiple data elements simultaneously. To do this efficiently, SIMD hardware often prefers or requires data to be loaded into its wide registers from memory locations that are aligned to specific boundaries (e.g., 16-byte, 32-byte, or 64-byte, depending on the SIMD instruction set like SSE, AVX, AVX-512).

  • Aligned vs. Unaligned Accesses:

    • Aligned accesses are generally faster because the CPU can load/store the entire SIMD vector's worth of data in a single memory operation.
    • Unaligned accesses can incur performance penalties. The CPU might need multiple memory operations or internal fix-ups to handle data that crosses alignment boundaries. In some older or stricter SIMD instruction sets, unaligned accesses using aligned instructions could even cause a program to crash.
  • Rust's Vec<T> and Slices (&[T], &mut [T]):

    • Standard Rust Vec<T> allocates memory with an alignment suitable for the type T itself. For example, a Vec<f32> will have its elements aligned to 4 bytes.
    • However, Vec<T> does not inherently guarantee that the start of its data buffer will be aligned to larger boundaries (like 16 or 32 bytes) that are optimal for SIMD operations.
    • Slices created from a Vec<T> inherit this alignment characteristic.
  • Simdly's Approach:

    • The intention for Simdly, particularly for its custom SIMD functions like simd_add (simdly), is to operate directly on standard Rust Vec<T> and slices without requiring or modifying their alignment.
    • This means Simdly's SIMD code must be written to correctly handle potentially unaligned data. This typically involves:
      1. Processing any initial unaligned elements at the beginning of a slice using scalar operations (or specific unaligned SIMD loads if careful) until a point where the remaining data is aligned relative to the SIMD vector width.
      2. Performing the bulk of the operations using SIMD instructions on the aligned portion.
      3. Processing any remaining unaligned elements at the end of the slice using scalar operations.
    • Alternatively, SIMD intrinsics that explicitly support unaligned loads and stores (e.g., _mm_loadu_ps in SSE, _mm256_loadu_ps in AVX) can be used throughout, offering safety at a potential performance cost compared to their aligned counterparts (_mm_load_ps, _mm256_load_ps).
    • The goal is to provide SIMD acceleration that is broadly applicable to common Rust data structures without imposing special alignment burdens on the user.
  • Benchmarking Context:

    • The current benchmarks use ndarray. ndarray itself might employ strategies to manage or prefer aligned data for its internal operations, which could influence its performance characteristics.
    • The performance of simd_add (simdly) and par_simd_add (avx2) in the benchmarks reflects their execution on data as prepared by the benchmark setup (which uses ndarray for input generation). For a thorough understanding, specific benchmarks on data explicitly known to be unaligned (e.g., from Vecs with offset slices) would be beneficial.

Features

  • SIMD Exploration: Focuses on demonstrating and benchmarking SIMD-accelerated computations.
  • Parallel Processing: Utilizes Rayon for data parallelism, enabling efficient use of multi-core processors.
  • Numerical Computing: Integrates with Ndarray for n-dimensional array operations, common in scientific computing.
  • Comprehensive Benchmarking: Employs Criterion.rs for detailed performance analysis of different vector addition strategies.
  • Optimized Release Builds: Cargo.toml is configured for "fat" LTO and codegen-units = 1 in release profiles for maximum performance.

Getting Started

Cloning the Repository

  1. Clone the repository to your local machine:
    git clone https://github.com/mtantaoui/simdly.git
    cd simdly
    

Building the Project

Once you are in the project directory, you can build the project using Cargo:

cargo build

Usage

Currently, Simdly primarily serves as a demonstration and benchmarking platform for SIMD and parallel computation techniques in Rust. The core logic for these demonstrations can be found within the benchmark files (e.g., benches/add.rs).

To use any library components (if developed in src/), you would typically add simdly as a dependency in your Cargo.toml and import its modules.

Building

Release Build

For a production-ready, optimized build:

cargo build --release

The project's Cargo.toml specifies aggressive optimization settings for release builds:

[profile.release]
lto = "fat"
codegen-units = 1

This configuration aims to maximize runtime performance. The compiled artifacts will be located in target/release/.

Benchmarking Environment

The performance results detailed in the "Benchmark Reports and Performance" section were obtained on the following system:

  • OS: Ubuntu 22.04 jammy
  • CPU: AMD EPYC 7571 @ 8x 2.2GHz
  • RAM: 31828 MiB (Total)

Performance can vary significantly across different hardware and software configurations. The provided benchmark data should be considered relative to this specific environment.

Running Benchmarks

Overview

This project uses Criterion.rs for robust statistical benchmarking. Benchmarks are crucial for evaluating the performance of SIMD implementations, parallelization strategies, and comparing them against standard library or other crate functionalities.

Available Benchmarks

The Cargo.toml defines one benchmark suite:

  • add:
    • Source File: benches/add.rs (assumed, based on [[bench]] name = "add")
    • Description: This benchmark suite evaluates the performance of vector addition using various approaches. It is parameterized by vector size, allowing comparison across different scales. The benchmarked functions within this suite include:
      • ndarray: Vector addition using ndarray's built-in operators.
      • scalar_add: A naive, element-by-element scalar addition.
      • simd_add (simdly): A custom SIMD implementation for vector addition provided by this simdly project.
      • par_simd_add (avx2): A parallelized SIMD implementation (likely using Rayon and targeting AVX2 instruction set).
    • Harness: false (Criterion's main macro is used directly in the benchmark source file).

How to Run

You can run benchmarks using Cargo:

  • Run all benchmarks (in this case, the add suite with all its parameterized groups):
    cargo bench
    
  • Run a specific benchmark group or function by name/filter: Criterion allows filtering. For example, to run only tests related to "ndarray" for a specific size:
    cargo bench -- "VectorAddition/30000/ndarray"
    
    Or to run all benchmarks for the 30000 element vector size:
    cargo bench -- "VectorAddition/30000"
    

For more options, run with --help:

cargo bench -- --help

Benchmark Reports and Performance

After running the benchmarks, Criterion generates detailed HTML reports. These reports can be found in the target/criterion/ directory.

  • Summary Reports for each Parameterized Group: Located at target/criterion/VectorAddition_<size>/report/index.html (e.g., target/criterion/VectorAddition_30000/report/index.html). These provide an overview and violin plots comparing all functions for that specific vector size.
  • Detailed Reports for each Function: Located at target/criterion/VectorAddition_<size>/<function_name>/report/index.html (e.g., target/criterion/VectorAddition_30000/ndarray/report/index.html). These offer in-depth statistics, PDF plots, and regression analysis for a specific function at a particular size.

Summary of add Benchmark Results

The following tables summarize the performance (typical time per operation and throughput) for the add benchmark across different vector sizes and implementations, based on the Criterion report data generated on the Benchmarking Environment specified above. "Time" refers to the estimated time per iteration (Slope estimate from Criterion for linear sampling, or Mean for flat sampling).

Vector Size: 30,000 elements (VectorAddition/30000)

Function Time (Mean)
simd_add (simdly) 6.1150 µs
scalar_add 6.4880 µs
ndarray 6.6200 µs
par_simd_add (avx2) 76.642 µs

Vector Size: 150,000 elements (VectorAddition/150000)

Function Time (Mean)
simd_add (simdly) 32.194 µs
scalar_add 35.082 µs
ndarray 35.384 µs
par_simd_add (avx2) 187.10 µs

Vector Size: 1,048,576 elements (VectorAddition/1048576)

Function Time (Mean)
par_simd_add (avx2) 550.38 µs
simd_add (simdly) 669.06 µs
scalar_add 867.40 µs
ndarray 886.10 µs

Vector Size: 1,073,741,824 elements (VectorAddition/1073741824) (Note: ndarray results were not present in the provided benchmark files for this size. Time is based on Mean as sampling mode was Flat.)

Function Time (Mean)
par_simd_add (avx2) 1.4298 s
scalar_add 4.9750 s
simd_add (simdly) 4.1620 s
ndarray 4.4635 s

Observations from Benchmarks:

  • For smaller vector sizes (30k, 150k elements), the simd_add (simdly) implementation shows the best performance in terms of raw execution time per operation. The overhead of parallelism (par_simd_add) can make it slower for these smaller inputs.
  • ndarray and scalar_add (likely auto-vectorized by the compiler) perform competitively at smaller sizes.
  • As the vector size increases significantly (1M elements), par_simd_add (avx2) starts to show its strength due to parallel execution, outperforming the purely sequential SIMD and scalar versions. simd_add (simdly) is still faster than scalar_add and ndarray.
  • For very large vectors (1B elements), par_simd_add (avx2) is substantially faster. The simd_add (simdly) implementation, being sequential, becomes much slower than the parallel version, and even slower than scalar_add in this specific run, which might indicate memory bandwidth limitations or other architectural effects at extreme scales for purely sequential processing.

These results highlight the trade-offs between different optimization strategies (explicit SIMD, parallelism, library abstractions) and how their effectiveness can vary with input size and the specific hardware capabilities (like AVX2 support and number of cores).

Dependencies

Simdly relies on several high-quality Rust crates:

Main Dependencies

  • rayon (1.10.0): A data parallelism library for Rust, making it easy to convert sequential computations into parallel ones.

Development Dependencies

These dependencies are used for development tasks like benchmarking:

  • criterion (0.6.0): A statistics-driven benchmarking framework.
  • ndarray (0.16.1): An N-dimensional array-like (alternative to Vec<Vec<T>>) for Rust, essential for numerical computing.

For a full list of all direct and transitive dependencies, please refer to the Cargo.lock file.

License

This project is licensed under the MIT License. Copyright (c) 2025 Mahdi Tantaoui.

See the LICENSE file for the full license text.

Contributing

Contributions are welcome! If you'd like to contribute, please feel free to:

  1. Fork the repository.
  2. Create a new branch for your feature or bug fix (e.g., feature/my-new-feature or fix/issue-tracker-bug).
  3. Make your changes. Ensure code is well-formatted using cargo fmt.
  4. Add or update relevant tests or benchmarks for your changes.
  5. Write clear and descriptive commit messages.
  6. Push your branch to your fork and open a Pull Request against the main repository.

Please consider opening an issue first to discuss any significant changes or new features you plan to implement.