Simdly
🚀 A high-performance Rust library that leverages SIMD (Single Instruction, Multiple Data) instructions for fast vectorized computations. This library provides efficient implementations of mathematical operations using modern CPU features.
✨ Features
- 🚀 SIMD Optimized: Leverages AVX2 (256-bit) and NEON (128-bit) instructions for vector operations
- 🧠 Intelligent Algorithm Selection: Automatic choice between scalar, SIMD, and parallel algorithms based on data size
- 💾 Memory Efficient: Supports both aligned and unaligned memory access patterns with cache-aware chunking
- 🔧 Generic Traits: Provides consistent interfaces across different SIMD implementations
- 🛡️ Safe Abstractions: Wraps unsafe SIMD operations in safe, ergonomic APIs with robust error handling
- 🧮 Rich Math Library: Extensive mathematical functions (trig, exp, log, sqrt, etc.) with SIMD acceleration
- ⚡ Performance: Optimized thresholds prevent overhead while maximizing throughput gains
🏗️ Architecture Support
Currently Supported
- x86/x86_64 with AVX2 (256-bit vectors)
- ARM/AArch64 with NEON (128-bit vectors)
Planned Support
- SSE (128-bit vectors for older x86 processors)
📦 Installation
Add simdly to your Cargo.toml:
[]
= "0.1.7"
For optimal performance, enable AVX2 support:
[]
= ["-C", "target-feature=+avx2"]
🚀 Quick Start
Simple Vector Addition with Multiple Algorithms
use SimdAdd;
Working with SIMD Vectors Directly
use F32x8;
use F32x4;
use ;
Working with Partial Data
use F32x8;
use F32x4;
use ;
Mathematical Operations
High-Level Mathematical Operations
use SimdMath;
📊 Performance
simdly provides significant performance improvements for numerical computations with multiple algorithm options:
Algorithm Selection
The SimdAdd trait provides multiple algorithms that you can choose based on your data size:
| Array Size Range | Recommended Method | Algorithm | Rationale |
|---|---|---|---|
| < 128 elements | scalar_add() |
Scalar | Avoids SIMD setup overhead |
| 128 - 262,143 elements | simd_add() |
SIMD | Optimal vectorization benefits |
| ≥ 262,144 elements | par_simd_add() |
Parallel SIMD | Memory bandwidth + multi-core scaling |
Performance Characteristics
- Mathematical Operations: SIMD shows 4x-13x speedup for complex operations like cosine
- Simple Operations: Intelligent thresholds prevent performance regression on small arrays
- Memory Hierarchy: Optimized chunk sizes (16 KiB) for L1 cache efficiency
- Cross-Platform: Thresholds work optimally on Intel AVX2 and ARM NEON architectures
Benchmark Results (Addition)
Performance measurements on modern x64 with AVX2:
| Vector Size | Elements | Recommended Method | Performance Benefit |
|---|---|---|---|
| 512 B | 128 | scalar_add() |
Baseline (no overhead) |
| 20 KiB | 5,000 | simd_add() |
~4-8x throughput |
| 1 MiB | 262,144 | par_simd_add() |
~4-8x × cores |
| 4 MiB | 1,048,576 | par_simd_add() |
Memory bandwidth limited |
Mathematical Functions Performance
Complex mathematical operations benefit from SIMD across all sizes:
| Function | Array Size | SIMD Speedup | Notes |
|---|---|---|---|
cos() |
4 KiB | 4.4x | Immediate benefit |
cos() |
64 KiB | 11.7x | Peak efficiency |
cos() |
1 MiB | 13.3x | Best performance |
cos() |
128 MiB | 9.2x | Memory-bound |
Key Features
- Manual Optimization: Choose the best algorithm for your specific use case
- Zero-Cost Abstraction: Direct method calls with no runtime overhead
- Memory Efficiency: Cache-aware chunking and aligned memory access
- Scalable Performance: Near-linear scaling with available CPU cores
Compilation Flags
For maximum performance, compile with:
RUSTFLAGS="-C target-feature=+avx2"
Or add to your Cargo.toml:
[]
= "fat"
= 1
🔧 Usage Examples
Manual Algorithm Selection with SimdAdd
simdly provides multiple algorithms that you can choose based on your specific needs:
use SimdAdd;
Manual Algorithm Selection
For fine-grained control, you can manually select the algorithm:
use SimdAdd;
Mathematical Operations with SIMD
use SimdMath;
Processing Large Arrays
use F32x8;
use F32x4;
use ;
Memory-Aligned Operations
use F32x8;
use F32x4;
use ;
use ;
📚 Documentation
- 📖 API Documentation - Complete API reference
- 🚀 Getting Started Guide - Detailed usage examples and tutorials
- ⚡ Performance Tips - Optimization strategies and best practices
🛠️ Development
Prerequisites
- Rust 1.77 or later
- x86/x86_64 processor with AVX2 support
- Linux, macOS, or Windows
Building
Testing
Performance Benchmarks
The crate includes comprehensive benchmarks showing real-world performance improvements:
# Run benchmarks to measure performance on your hardware
# View detailed benchmark reports
Key Findings from Benchmarks:
- Mathematical operations (
cos,sin,exp, etc.) show significant SIMD acceleration - Parallel methods automatically optimize based on array size using
PARALLEL_SIMD_THRESHOLD - Performance varies by CPU architecture - benchmarks show actual improvements on your hardware
🤝 Contributing
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
Areas for Contribution
- Additional SIMD instruction set support (SSE)
- Advanced mathematical operations implementation
- Performance optimizations and micro-benchmarks
- Documentation improvements and examples
- Testing coverage and edge case validation
- WebAssembly SIMD support
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Built with Rust's excellent SIMD intrinsics
- Inspired by high-performance computing libraries
- Thanks to the Rust community for their valuable feedback
📈 Roadmap
- ARM NEON support for ARM/AArch64 - ✅ Complete with full mathematical operations
- Additional mathematical operations - ✅ Power, 2D/3D/4D hypotenuse, and more
- SSE support for older x86 processors
- Automatic SIMD instruction set detection
- WebAssembly SIMD support
- Additional mathematical functions (bessel, gamma, etc.)
- Complex number SIMD operations
Made with ❤️ and ⚡ by Mahdi Tantaoui