simdly 0.1.6

🚀 High-performance Rust library leveraging SIMD and Rayon for fast computations.
Documentation

simdly

🚀 A high-performance Rust library that leverages SIMD (Single Instruction, Multiple Data) instructions for fast vectorized computations. This library provides efficient implementations of mathematical operations using modern CPU features.

Crates.io Documentation License: MIT Rust

✨ Features

  • 🚀 SIMD Optimized: Leverages AVX2 instructions for 256-bit vector operations
  • 💾 Memory Efficient: Supports both aligned and unaligned memory access patterns
  • 🔧 Generic Traits: Provides consistent interfaces across different SIMD implementations
  • 🛡️ Safe Abstractions: Wraps unsafe SIMD operations in safe, ergonomic APIs
  • ⚡ Performance: Optimized for high-throughput numerical computations

🏗️ Architecture Support

Currently Supported

  • x86/x86_64 with AVX2 (256-bit vectors)

Planned Support

  • SSE (128-bit vectors for older x86 processors)
  • ARM NEON (128-bit vectors for ARM/AArch64)

📦 Installation

Add simdly to your Cargo.toml:

[dependencies]
simdly = "0.1.3"

For optimal performance, enable AVX2 support:

[build]
rustflags = ["-C", "target-feature=+avx2"]

🚀 Quick Start

use simdly::simd::avx2::f32x8::F32x8;
use simdly::simd::{SimdLoad, SimdStore};

fn main() {
    // Load 8 f32 values into SIMD vector
    let data = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
    let vec = F32x8::from_slice(&data);
    
    // Store results
    let mut output = [0.0f32; 8];
    unsafe {
        vec.store_unaligned_at(output.as_mut_ptr());
    }
    
    println!("Processed {} elements with SIMD", vec.size);
}

Working with Partial Data

use simdly::simd::avx2::f32x8::F32x8;
use simdly::simd::{SimdLoad, SimdStore};

// Handle arrays smaller than 8 elements
let data = [1.0, 2.0, 3.0]; // Only 3 elements
let vec = F32x8::from_slice(&data);

let mut output = [0.0f32; 8];
unsafe {
    vec.store_at_partial(output.as_mut_ptr());
}
// Only first 3 elements are written

📊 Performance

simdly can provide significant performance improvements for numerical computations:

  • Up to 8x faster operations using AVX2 256-bit vectors
  • Memory bandwidth optimization through aligned memory access
  • Cache-friendly processing patterns

Compilation Flags

For maximum performance, compile with:

RUSTFLAGS="-C target-feature=+avx2" cargo build --release

Or add to your Cargo.toml:

[profile.release]
lto = "fat"
codegen-units = 1

🔧 Usage Examples

Processing Large Arrays

use simdly::simd::avx2::f32x8::F32x8;
use simdly::simd::{SimdLoad, SimdStore};

fn process_array(input: &[f32]) -> Vec<f32> {
    let mut output = vec![0.0; input.len()];
    
    // Process full chunks of 8 elements
    for (i, chunk) in input.chunks_exact(8).enumerate() {
        let vec = F32x8::from_slice(chunk);
        
        // Your SIMD operations here...
        
        unsafe {
            vec.store_unaligned_at(output[i * 8..].as_mut_ptr());
        }
    }
    
    // Handle remaining elements
    let remainder_start = (input.len() / 8) * 8;
    if remainder_start < input.len() {
        let vec = F32x8::from_slice(&input[remainder_start..]);
        
        unsafe {
            vec.store_at_partial(output[remainder_start..].as_mut_ptr());
        }
    }
    
    output
}

Memory-Aligned Operations

use simdly::simd::avx2::f32x8::F32x8;
use simdly::simd::{Alignment, SimdLoad, SimdStore};
use std::alloc::{alloc, dealloc, Layout};

// Allocate 32-byte aligned memory for optimal performance
let layout = Layout::from_size_align(8 * std::mem::size_of::<f32>(), 32).unwrap();
let aligned_ptr = unsafe { alloc(layout) as *mut f32 };

// Verify alignment
assert!(F32x8::is_aligned(aligned_ptr));

// Use aligned operations for best performance
let data = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
unsafe {
    std::ptr::copy_nonoverlapping(data.as_ptr(), aligned_ptr, 8);
    
    let vec = F32x8::load_aligned(aligned_ptr);
    vec.store_aligned_at(aligned_ptr);
}

// Clean up
unsafe { dealloc(aligned_ptr as *mut u8, layout) };

📚 Documentation

🛠️ Development

Prerequisites

  • Rust 1.77 or later
  • x86/x86_64 processor with AVX2 support
  • Linux, macOS, or Windows

Building

git clone https://github.com/mtantaoui/simdly.git
cd simdly
cargo build --release

Testing

cargo test

Benchmarking

cargo bench

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Areas for Contribution

  • Additional SIMD instruction set support (SSE, ARM NEON)
  • Mathematical operations implementation
  • Performance optimizations
  • Documentation improvements
  • Testing and benchmarks

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Built with Rust's excellent SIMD intrinsics
  • Inspired by high-performance computing libraries
  • Thanks to the Rust community for their valuable feedback

📈 Roadmap

  • SSE support for older x86 processors
  • ARM NEON support for ARM/AArch64
  • Additional mathematical operations
  • Automatic SIMD instruction set detection
  • WebAssembly SIMD support

Made with ❤️ and ⚡ by Mahdi Tantaoui