simdly 0.1.7

🚀 High-performance Rust library leveraging SIMD and Rayon for fast computations.
Documentation

Simdly

🚀 A high-performance Rust library that leverages SIMD (Single Instruction, Multiple Data) instructions for fast vectorized computations. This library provides efficient implementations of mathematical operations using modern CPU features.

Crates.io Documentation License: MIT Rust

✨ Features

  • 🚀 SIMD Optimized: Leverages AVX2 (256-bit) and NEON (128-bit) instructions for vector operations
  • 🧠 Intelligent Algorithm Selection: Automatic choice between scalar, SIMD, and parallel algorithms based on data size
  • 💾 Memory Efficient: Supports both aligned and unaligned memory access patterns with cache-aware chunking
  • 🔧 Generic Traits: Provides consistent interfaces across different SIMD implementations
  • 🛡️ Safe Abstractions: Wraps unsafe SIMD operations in safe, ergonomic APIs with robust error handling
  • 🧮 Rich Math Library: Extensive mathematical functions (trig, exp, log, sqrt, etc.) with SIMD acceleration
  • ⚡ Performance: Optimized thresholds prevent overhead while maximizing throughput gains

🏗️ Architecture Support

Currently Supported

  • x86/x86_64 with AVX2 (256-bit vectors)
  • ARM/AArch64 with NEON (128-bit vectors)

Planned Support

  • SSE (128-bit vectors for older x86 processors)

📦 Installation

Add simdly to your Cargo.toml:

[dependencies]
simdly = "0.1.7"

For optimal performance, enable AVX2 support:

[build]
rustflags = ["-C", "target-feature=+avx2"]

🚀 Quick Start

Simple Vector Addition with Multiple Algorithms

use simdly::SimdAdd;

fn main() {
    // Create two vectors
    let a = vec![1.0, 2.0, 3.0, 4.0, 5.0];
    let b = vec![2.0, 3.0, 4.0, 5.0, 6.0];
    
    // Choose the appropriate algorithm based on your needs:
    
    // For small arrays (< 128 elements)
    let result = a.as_slice().scalar_add(b.as_slice());
    
    // For medium arrays (128+ elements) - uses SIMD
    let result = a.as_slice().simd_add(b.as_slice());
    
    // For large arrays (262,144+ elements) - uses parallel SIMD
    let result = a.as_slice().par_simd_add(b.as_slice());
    
    println!("Result: {:?}", result); // [3.0, 5.0, 7.0, 9.0, 11.0]
}

Working with SIMD Vectors Directly

#[cfg(target_arch = "x86_64")]
use simdly::simd::avx2::f32x8::F32x8;
#[cfg(target_arch = "aarch64")]
use simdly::simd::neon::f32x4::F32x4;
use simdly::simd::{SimdLoad, SimdStore};

fn main() {
    #[cfg(target_arch = "x86_64")]
    {
        // Load 8 f32 values into AVX2 SIMD vector
        let data = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
        let vec = F32x8::from(&data[..]);
        
        // Store results using platform-appropriate method
        let mut output = [0.0f32; 8];
        unsafe {
            vec.store_at(output.as_mut_ptr());
        }
        
        println!("Processed {} elements with AVX2 SIMD", vec.size);
    }
    
    #[cfg(target_arch = "aarch64")]
    {
        // Load 4 f32 values into NEON SIMD vector
        let data = [1.0, 2.0, 3.0, 4.0];
        let vec = F32x4::from(&data[..]);
        
        // Store results
        let mut output = [0.0f32; 4];
        unsafe {
            vec.store_at(output.as_mut_ptr());
        }
        
        println!("Processed {} elements with NEON SIMD", vec.size);
    }
}

Working with Partial Data

#[cfg(target_arch = "x86_64")]
use simdly::simd::avx2::f32x8::F32x8;
#[cfg(target_arch = "aarch64")]
use simdly::simd::neon::f32x4::F32x4;
use simdly::simd::{SimdLoad, SimdStore};

fn main() {
    #[cfg(target_arch = "x86_64")]
    {
        // Handle arrays smaller than 8 elements
        let data = [1.0, 2.0, 3.0]; // Only 3 elements
        let vec = F32x8::from(&data[..]);

        let mut output = [0.0f32; 8];
        unsafe {
            vec.store_at_partial(output.as_mut_ptr());
        }
        // Only first 3 elements are written
        println!("Partial AVX2: {:?}", &output[..3]);
    }
    
    #[cfg(target_arch = "aarch64")]
    {
        // Handle arrays smaller than 4 elements
        let data = [1.0, 2.0]; // Only 2 elements
        let vec = F32x4::from(&data[..]);

        let mut output = [0.0f32; 4];
        unsafe {
            vec.store_at_partial(output.as_mut_ptr());
        }
        // Only first 2 elements are written
        println!("Partial NEON: {:?}", &output[..2]);
    }
}

Mathematical Operations

#[cfg(target_arch = "x86_64")]
{
    use simdly::simd::avx2::math::{_mm256_sin_ps, _mm256_hypot_ps};
    use std::arch::x86_64::_mm256_set1_ps;

    // 8 parallel sine calculations
    let input = _mm256_set1_ps(1.0);
    let result = unsafe { _mm256_sin_ps(input) };

    // 2D Euclidean distance for 8 point pairs
    let x = _mm256_set1_ps(3.0);
    let y = _mm256_set1_ps(4.0);
    let distance = unsafe { _mm256_hypot_ps(x, y) }; // sqrt(3² + 4²) = 5.0
}

High-Level Mathematical Operations

use simdly::simd::SimdMath;

fn main() {
    let data = vec![1.0, 2.0, 3.0, 4.0];
    
    // All mathematical operations use SIMD automatically
    let cosines = data.cos();              // Vectorized cosine
    let sines = data.sin();                // Vectorized sine
    let exponentials = data.exp();         // Vectorized exponential
    let square_roots = data.sqrt();        // Vectorized square root
    
    // Power and distance operations
    let base = vec![2.0, 3.0, 4.0, 5.0];
    let exp = vec![2.0, 2.0, 2.0, 2.0];
    let powers = base.pow(exp);            // Powers: [4.0, 9.0, 16.0, 25.0]
    
    let x = vec![3.0, 5.0, 8.0, 7.0];
    let y = vec![4.0, 12.0, 15.0, 24.0];
    let distances = x.hypot(y);            // 2D distances: [5.0, 13.0, 17.0, 25.0]
    
    println!("Results computed with SIMD acceleration!");
}

📊 Performance

simdly provides significant performance improvements for numerical computations with multiple algorithm options:

Algorithm Selection

The SimdAdd trait provides multiple algorithms that you can choose based on your data size:

Array Size Range Recommended Method Algorithm Rationale
< 128 elements scalar_add() Scalar Avoids SIMD setup overhead
128 - 262,143 elements simd_add() SIMD Optimal vectorization benefits
≥ 262,144 elements par_simd_add() Parallel SIMD Memory bandwidth + multi-core scaling

Performance Characteristics

  • Mathematical Operations: SIMD shows 4x-13x speedup for complex operations like cosine
  • Simple Operations: Intelligent thresholds prevent performance regression on small arrays
  • Memory Hierarchy: Optimized chunk sizes (16 KiB) for L1 cache efficiency
  • Cross-Platform: Thresholds work optimally on Intel AVX2 and ARM NEON architectures

Benchmark Results (Addition)

Performance measurements on modern x64 with AVX2:

Vector Size Elements Recommended Method Performance Benefit
512 B 128 scalar_add() Baseline (no overhead)
20 KiB 5,000 simd_add() ~4-8x throughput
1 MiB 262,144 par_simd_add() ~4-8x × cores
4 MiB 1,048,576 par_simd_add() Memory bandwidth limited

Mathematical Functions Performance

Complex mathematical operations benefit from SIMD across all sizes:

Function Array Size SIMD Speedup Notes
cos() 4 KiB 4.4x Immediate benefit
cos() 64 KiB 11.7x Peak efficiency
cos() 1 MiB 13.3x Best performance
cos() 128 MiB 9.2x Memory-bound

Key Features

  • Manual Optimization: Choose the best algorithm for your specific use case
  • Zero-Cost Abstraction: Direct method calls with no runtime overhead
  • Memory Efficiency: Cache-aware chunking and aligned memory access
  • Scalable Performance: Near-linear scaling with available CPU cores

Compilation Flags

For maximum performance, compile with:

RUSTFLAGS="-C target-feature=+avx2" cargo build --release

Or add to your Cargo.toml:

[profile.release]
lto = "fat"
codegen-units = 1

🔧 Usage Examples

Manual Algorithm Selection with SimdAdd

simdly provides multiple algorithms that you can choose based on your specific needs:

use simdly::SimdAdd;

fn main() {
    // Small arrays (< 128 elements) - use scalar addition
    let small_a = vec![1.0; 100];
    let small_b = vec![2.0; 100];
    let result = small_a.as_slice().scalar_add(small_b.as_slice());
    
    // Medium arrays (128 - 262,143 elements) - use SIMD
    let medium_a = vec![1.0; 5_000];
    let medium_b = vec![2.0; 5_000];
    let result = medium_a.as_slice().simd_add(medium_b.as_slice());
    
    // Large arrays (≥ 262,144 elements) - use parallel SIMD
    let large_a = vec![1.0; 300_000];
    let large_b = vec![2.0; 300_000];
    let result = large_a.as_slice().par_simd_add(large_b.as_slice());
}

Manual Algorithm Selection

For fine-grained control, you can manually select the algorithm:

use simdly::SimdAdd;

fn main() {
    let a = vec![1.0; 10_000];
    let b = vec![2.0; 10_000];
    
    // Force scalar addition
    let scalar_result = a.as_slice().scalar_add(b.as_slice());
    
    // Force SIMD addition
    let simd_result = a.as_slice().simd_add(b.as_slice());
    
    // Force parallel SIMD addition
    let parallel_result = a.as_slice().par_simd_add(b.as_slice());
}

Mathematical Operations with SIMD

use simdly::simd::SimdMath;

fn main() {
    // Vectorized cosine computation
    let angles = vec![0.0, std::f32::consts::PI / 4.0, std::f32::consts::PI / 2.0];
    let cosines = angles.as_slice().cos(); // Uses SIMD automatically
    
    println!("cos(0) = {}", cosines[0]);        // ≈ 1.0
    println!("cos(π/4) = {}", cosines[1]);      // ≈ 0.707
    println!("cos(π/2) = {}", cosines[2]);      // ≈ 0.0
}

Processing Large Arrays

#[cfg(target_arch = "x86_64")]
use simdly::simd::avx2::f32x8::F32x8;
#[cfg(target_arch = "aarch64")]
use simdly::simd::neon::f32x4::F32x4;
use simdly::simd::{SimdLoad, SimdStore, SimdMath};

fn process_array(input: &[f32]) -> Vec<f32> {
    // For real applications, use high-level SIMD operations
    input.cos() // Vectorized cosine computation
}

#[cfg(target_arch = "x86_64")]
fn manual_avx2_processing(input: &[f32]) -> Vec<f32> {
    let mut output = vec![0.0; input.len()];
    
    // Process full chunks of 8 elements
    for (i, chunk) in input.chunks_exact(8).enumerate() {
        let vec = F32x8::from(chunk);
        
        // Example: compute cosine using SIMD
        let result = vec.cos();
        
        unsafe {
            result.store_at(output[i * 8..].as_mut_ptr());
        }
    }
    
    // Handle remaining elements
    let remainder_start = (input.len() / 8) * 8;
    if remainder_start < input.len() {
        let vec = F32x8::from(&input[remainder_start..]);
        let result = vec.cos();
        
        unsafe {
            result.store_at_partial(output[remainder_start..].as_mut_ptr());
        }
    }
    
    output
}

#[cfg(target_arch = "aarch64")]
fn manual_neon_processing(input: &[f32]) -> Vec<f32> {
    let mut output = vec![0.0; input.len()];
    
    // Process full chunks of 4 elements  
    for (i, chunk) in input.chunks_exact(4).enumerate() {
        let vec = F32x4::from(chunk);
        
        // Example: compute cosine using SIMD
        let result = vec.cos();
        
        unsafe {
            result.store_at(output[i * 4..].as_mut_ptr());
        }
    }
    
    // Handle remaining elements
    let remainder_start = (input.len() / 4) * 4;
    if remainder_start < input.len() {
        let vec = F32x4::from(&input[remainder_start..]);
        let result = vec.cos();
        
        unsafe {
            result.store_at_partial(output[remainder_start..].as_mut_ptr());
        }
    }
    
    output
}

Memory-Aligned Operations

#[cfg(target_arch = "x86_64")]
use simdly::simd::avx2::f32x8::F32x8;
#[cfg(target_arch = "aarch64")]
use simdly::simd::neon::f32x4::F32x4;
use simdly::simd::{Alignment, SimdLoad, SimdStore};
use std::alloc::{alloc, dealloc, Layout};

fn main() {
    #[cfg(target_arch = "x86_64")]
    {
        // Allocate 32-byte aligned memory for AVX2
        let layout = Layout::from_size_align(8 * std::mem::size_of::<f32>(), 32).unwrap();
        let aligned_ptr = unsafe { alloc(layout) as *mut f32 };

        // Verify alignment
        assert!(F32x8::is_aligned(aligned_ptr));

        // Use standard load/store (AVX2 handles alignment automatically)
        let data = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
        unsafe {
            std::ptr::copy_nonoverlapping(data.as_ptr(), aligned_ptr, 8);
            
            let vec = F32x8::from(std::slice::from_raw_parts(aligned_ptr, 8));
            vec.store_at(aligned_ptr);
        }

        // Clean up
        unsafe { dealloc(aligned_ptr as *mut u8, layout) };
    }
    
    #[cfg(target_arch = "aarch64")]
    {
        // NEON doesn't require special alignment handling
        let data = [1.0, 2.0, 3.0, 4.0];
        let vec = F32x4::from(&data[..]);
        
        let mut output = [0.0f32; 4];
        unsafe {
            vec.store_at(output.as_mut_ptr());
        }
        
        println!("NEON handles alignment automatically");
    }
}

📚 Documentation

🛠️ Development

Prerequisites

  • Rust 1.77 or later
  • x86/x86_64 processor with AVX2 support
  • Linux, macOS, or Windows

Building

git clone https://github.com/mtantaoui/simdly.git
cd simdly
cargo build --release

Testing

cargo test

Performance Benchmarks

The crate includes comprehensive benchmarks showing real-world performance improvements:

# Run benchmarks to measure performance on your hardware
cargo bench

# View detailed benchmark reports
open target/criterion/report/index.html

Key Findings from Benchmarks:

  • Mathematical operations (cos, sin, exp, etc.) show significant SIMD acceleration
  • Parallel methods automatically optimize based on array size using PARALLEL_SIMD_THRESHOLD
  • Performance varies by CPU architecture - benchmarks show actual improvements on your hardware

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Areas for Contribution

  • Additional SIMD instruction set support (SSE)
  • Advanced mathematical operations implementation
  • Performance optimizations and micro-benchmarks
  • Documentation improvements and examples
  • Testing coverage and edge case validation
  • WebAssembly SIMD support

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Built with Rust's excellent SIMD intrinsics
  • Inspired by high-performance computing libraries
  • Thanks to the Rust community for their valuable feedback

📈 Roadmap

  • ARM NEON support for ARM/AArch64 - ✅ Complete with full mathematical operations
  • Additional mathematical operations - ✅ Power, 2D/3D/4D hypotenuse, and more
  • SSE support for older x86 processors
  • Automatic SIMD instruction set detection
  • WebAssembly SIMD support
  • Additional mathematical functions (bessel, gamma, etc.)
  • Complex number SIMD operations

Made with ❤️ and ⚡ by Mahdi Tantaoui