Simdly
Simdly is a Rust project focused on exploring and utilizing SIMD (Single Instruction, Multiple Data) operations for high-performance computing. Leveraging modern Rust features and libraries like Rayon and Criterion for benching, Simdly aims to provide or demonstrate efficient numerical computations.
Table of Contents
- Project Status
- Platform Support
- SIMD, Alignment, and Rust's
Vec<T>/Slices - Features
- Prerequisites
- Getting Started
- Usage
- Building
- Benchmarking Environment
- Running Benchmarks
- Dependencies
- License
- Contributing
Project Status
Note: This project is currently under active development. The API, features, and benchmark results may change as development progresses. It is primarily intended for exploration and demonstration purposes at this stage.
Platform Support
This project is currently being developed and tested on Linux and macOS systems. While Rust is cross-platform, specific SIMD intrinsics or performance characteristics might vary on other operating systems like Windows. The benchmarking and development efforts are focused on Unix-like environments.
SIMD, Alignment, and Rust's Vec<T>/Slices
SIMD (Single Instruction, Multiple Data) allows a single instruction to operate on multiple data elements simultaneously. To do this efficiently, SIMD hardware often prefers or requires data to be loaded into its wide registers from memory locations that are aligned to specific boundaries (e.g., 16-byte, 32-byte, or 64-byte, depending on the SIMD instruction set like SSE, AVX, AVX-512).
-
Aligned vs. Unaligned Accesses:
- Aligned accesses are generally faster because the CPU can load/store the entire SIMD vector's worth of data in a single memory operation.
- Unaligned accesses can incur performance penalties. The CPU might need multiple memory operations or internal fix-ups to handle data that crosses alignment boundaries. In some older or stricter SIMD instruction sets, unaligned accesses using aligned instructions could even cause a program to crash.
-
Rust's
Vec<T>and Slices (&[T],&mut [T]):- Standard Rust
Vec<T>allocates memory with an alignment suitable for the typeTitself. For example, aVec<f32>will have its elements aligned to 4 bytes. - However,
Vec<T>does not inherently guarantee that the start of its data buffer will be aligned to larger boundaries (like 16 or 32 bytes) that are optimal for SIMD operations. - Slices created from a
Vec<T>inherit this alignment characteristic.
- Standard Rust
-
Simdly's Approach:
- The intention for Simdly, particularly for its custom SIMD functions like
simd_add (simdly), is to operate directly on standard RustVec<T>and slices without requiring or modifying their alignment. - This means Simdly's SIMD code must be written to correctly handle potentially unaligned data. This typically involves:
- Processing any initial unaligned elements at the beginning of a slice using scalar operations (or specific unaligned SIMD loads if careful) until a point where the remaining data is aligned relative to the SIMD vector width.
- Performing the bulk of the operations using SIMD instructions on the aligned portion.
- Processing any remaining unaligned elements at the end of the slice using scalar operations.
- Alternatively, SIMD intrinsics that explicitly support unaligned loads and stores (e.g.,
_mm_loadu_psin SSE,_mm256_loadu_psin AVX) can be used throughout, offering safety at a potential performance cost compared to their aligned counterparts (_mm_load_ps,_mm256_load_ps). - The goal is to provide SIMD acceleration that is broadly applicable to common Rust data structures without imposing special alignment burdens on the user.
- The intention for Simdly, particularly for its custom SIMD functions like
-
Benchmarking Context:
- The current benchmarks use
ndarray.ndarrayitself might employ strategies to manage or prefer aligned data for its internal operations, which could influence its performance characteristics. - The performance of
simd_add (simdly)andpar_simd_add (avx2)in the benchmarks reflects their execution on data as prepared by the benchmark setup (which usesndarrayfor input generation). For a thorough understanding, specific benchmarks on data explicitly known to be unaligned (e.g., fromVecs with offset slices) would be beneficial.
- The current benchmarks use
Features
- SIMD Exploration: Focuses on demonstrating and benchmarking SIMD-accelerated computations.
- Parallel Processing: Utilizes Rayon for data parallelism, enabling efficient use of multi-core processors.
- Numerical Computing: Integrates with Ndarray for n-dimensional array operations, common in scientific computing.
- Comprehensive Benchmarking: Employs Criterion.rs for detailed performance analysis of different vector addition strategies.
- Optimized Release Builds:
Cargo.tomlis configured for "fat" LTO andcodegen-units = 1in release profiles for maximum performance.
Getting Started
Cloning the Repository
- Clone the repository to your local machine:
Building the Project
Once you are in the project directory, you can build the project using Cargo:
Usage
Currently, Simdly primarily serves as a demonstration and benchmarking platform for SIMD and parallel computation techniques in Rust. The core logic for these demonstrations can be found within the benchmark files (e.g., benches/add.rs).
To use any library components (if developed in src/), you would typically add simdly as a dependency in your Cargo.toml and import its modules.
Building
Release Build
For a production-ready, optimized build:
The project's Cargo.toml specifies aggressive optimization settings for release builds:
[]
= "fat"
= 1
This configuration aims to maximize runtime performance. The compiled artifacts will be located in target/release/.
Benchmarking Environment
The performance results detailed in the "Benchmark Reports and Performance" section were obtained on the following system:
- OS: Ubuntu 22.04 jammy
- CPU: AMD EPYC 7571 @ 8x 2.2GHz
- RAM: 31828 MiB (Total)
Performance can vary significantly across different hardware and software configurations. The provided benchmark data should be considered relative to this specific environment.
Running Benchmarks
Overview
This project uses Criterion.rs for robust statistical benchmarking. Benchmarks are crucial for evaluating the performance of SIMD implementations, parallelization strategies, and comparing them against standard library or other crate functionalities.
Available Benchmarks
The Cargo.toml defines one benchmark suite:
add:- Source File:
benches/add.rs(assumed, based on[[bench]] name = "add") - Description: This benchmark suite evaluates the performance of vector addition using various approaches. It is parameterized by vector size, allowing comparison across different scales. The benchmarked functions within this suite include:
ndarray: Vector addition usingndarray's built-in operators.scalar_add: A naive, element-by-element scalar addition.simd_add (simdly): A custom SIMD implementation for vector addition provided by thissimdlyproject.par_simd_add (avx2): A parallelized SIMD implementation (likely using Rayon and targeting AVX2 instruction set).
- Harness:
false(Criterion'smainmacro is used directly in the benchmark source file).
- Source File:
How to Run
You can run benchmarks using Cargo:
- Run all benchmarks (in this case, the
addsuite with all its parameterized groups): - Run a specific benchmark group or function by name/filter:
Criterion allows filtering. For example, to run only tests related to "ndarray" for a specific size:
Or to run all benchmarks for the 30000 element vector size:
For more options, run with --help:
Benchmark Reports and Performance
After running the benchmarks, Criterion generates detailed HTML reports. These reports can be found in the target/criterion/ directory.
- Summary Reports for each Parameterized Group: Located at
target/criterion/VectorAddition_<size>/report/index.html(e.g.,target/criterion/VectorAddition_30000/report/index.html). These provide an overview and violin plots comparing all functions for that specific vector size. - Detailed Reports for each Function: Located at
target/criterion/VectorAddition_<size>/<function_name>/report/index.html(e.g.,target/criterion/VectorAddition_30000/ndarray/report/index.html). These offer in-depth statistics, PDF plots, and regression analysis for a specific function at a particular size.
Summary of add Benchmark Results
The following tables summarize the performance (typical time per operation and throughput) for the add benchmark across different vector sizes and implementations, based on the Criterion report data generated on the Benchmarking Environment specified above. "Time" refers to the estimated time per iteration (Slope estimate from Criterion for linear sampling, or Mean for flat sampling).
Vector Size: 30,000 elements (VectorAddition/30000)
| Function | Time (Mean) |
|---|---|
simd_add (simdly) |
6.1150 µs |
scalar_add |
6.4880 µs |
ndarray |
6.6200 µs |
par_simd_add (avx2) |
76.642 µs |
Vector Size: 150,000 elements (VectorAddition/150000)
| Function | Time (Mean) |
|---|---|
simd_add (simdly) |
32.194 µs |
scalar_add |
35.082 µs |
ndarray |
35.384 µs |
par_simd_add (avx2) |
187.10 µs |
Vector Size: 1,048,576 elements (VectorAddition/1048576)
| Function | Time (Mean) |
|---|---|
par_simd_add (avx2) |
550.38 µs |
simd_add (simdly) |
669.06 µs |
scalar_add |
867.40 µs |
ndarray |
886.10 µs |
Vector Size: 1,073,741,824 elements (VectorAddition/1073741824)
(Note: ndarray results were not present in the provided benchmark files for this size. Time is based on Mean as sampling mode was Flat.)
| Function | Time (Mean) |
|---|---|
par_simd_add (avx2) |
1.4298 s |
scalar_add |
4.9750 s |
simd_add (simdly) |
4.1620 s |
ndarray |
4.4635 s |
Observations from Benchmarks:
- For smaller vector sizes (30k, 150k elements), the
simd_add (simdly)implementation shows the best performance in terms of raw execution time per operation. The overhead of parallelism (par_simd_add) can make it slower for these smaller inputs. ndarrayandscalar_add(likely auto-vectorized by the compiler) perform competitively at smaller sizes.- As the vector size increases significantly (1M elements),
par_simd_add (avx2)starts to show its strength due to parallel execution, outperforming the purely sequential SIMD and scalar versions.simd_add (simdly)is still faster thanscalar_addandndarray. - For very large vectors (1B elements),
par_simd_add (avx2)is substantially faster. Thesimd_add (simdly)implementation, being sequential, becomes much slower than the parallel version, and even slower thanscalar_addin this specific run, which might indicate memory bandwidth limitations or other architectural effects at extreme scales for purely sequential processing.
These results highlight the trade-offs between different optimization strategies (explicit SIMD, parallelism, library abstractions) and how their effectiveness can vary with input size and the specific hardware capabilities (like AVX2 support and number of cores).
Dependencies
Simdly relies on several high-quality Rust crates:
Main Dependencies
rayon(1.10.0): A data parallelism library for Rust, making it easy to convert sequential computations into parallel ones.
Development Dependencies
These dependencies are used for development tasks like benchmarking:
criterion(0.6.0): A statistics-driven benchmarking framework.ndarray(0.16.1): An N-dimensional array-like (alternative toVec<Vec<T>>) for Rust, essential for numerical computing.
For a full list of all direct and transitive dependencies, please refer to the Cargo.lock file.
License
This project is licensed under the MIT License. Copyright (c) 2025 Mahdi Tantaoui.
See the LICENSE file for the full license text.
Contributing
Contributions are welcome! If you'd like to contribute, please feel free to:
- Fork the repository.
- Create a new branch for your feature or bug fix (e.g.,
feature/my-new-featureorfix/issue-tracker-bug). - Make your changes. Ensure code is well-formatted using
cargo fmt. - Add or update relevant tests or benchmarks for your changes.
- Write clear and descriptive commit messages.
- Push your branch to your fork and open a Pull Request against the main repository.
Please consider opening an issue first to discuss any significant changes or new features you plan to implement.