simdly

🚀 A high-performance Rust library that leverages SIMD (Single Instruction, Multiple Data) instructions for fast vectorized computations. This library provides efficient implementations of mathematical operations using modern CPU features.

✨ Features

🚀 SIMD Optimized: Leverages AVX2 instructions for 256-bit vector operations
💾 Memory Efficient: Supports both aligned and unaligned memory access patterns
🔧 Generic Traits: Provides consistent interfaces across different SIMD implementations
🛡️ Safe Abstractions: Wraps unsafe SIMD operations in safe, ergonomic APIs
⚡ Performance: Optimized for high-throughput numerical computations

🏗️ Architecture Support

Currently Supported

x86/x86_64 with AVX2 (256-bit vectors)

Planned Support

SSE (128-bit vectors for older x86 processors)
ARM NEON (128-bit vectors for ARM/AArch64)

📦 Installation

Add simdly to your Cargo.toml:

[dependencies]
simdly = "0.1.3"

For optimal performance, enable AVX2 support:

[build]
rustflags = ["-C", "target-feature=+avx2"]

🚀 Quick Start

use simdly::simd::avx2::f32x8::F32x8;
use simdly::simd::{SimdLoad, SimdStore};

fn main() {
    // Load 8 f32 values into SIMD vector
    let data = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
    let vec = F32x8::from_slice(&data);
    
    // Store results
    let mut output = [0.0f32; 8];
    unsafe {
        vec.store_unaligned_at(output.as_mut_ptr());
    }
    
    println!("Processed {} elements with SIMD", vec.size);
}

Working with Partial Data

use simdly::simd::avx2::f32x8::F32x8;
use simdly::simd::{SimdLoad, SimdStore};

// Handle arrays smaller than 8 elements
let data = [1.0, 2.0, 3.0]; // Only 3 elements
let vec = F32x8::from_slice(&data);

let mut output = [0.0f32; 8];
unsafe {
    vec.store_at_partial(output.as_mut_ptr());
}
// Only first 3 elements are written

📊 Performance

simdly can provide significant performance improvements for numerical computations:

Up to 8x faster operations using AVX2 256-bit vectors
Memory bandwidth optimization through aligned memory access
Cache-friendly processing patterns

Compilation Flags

For maximum performance, compile with:

RUSTFLAGS="-C target-feature=+avx2" cargo build --release

Or add to your Cargo.toml:

[profile.release]
lto = "fat"
codegen-units = 1

🔧 Usage Examples

Processing Large Arrays

use simdly::simd::avx2::f32x8::F32x8;
use simdly::simd::{SimdLoad, SimdStore};

fn process_array(input: &[f32]) -> Vec<f32> {
    let mut output = vec![0.0; input.len()];
    
    // Process full chunks of 8 elements
    for (i, chunk) in input.chunks_exact(8).enumerate() {
        let vec = F32x8::from_slice(chunk);
        
        // Your SIMD operations here...
        
        unsafe {
            vec.store_unaligned_at(output[i * 8..].as_mut_ptr());
        }
    }
    
    // Handle remaining elements
    let remainder_start = (input.len() / 8) * 8;
    if remainder_start < input.len() {
        let vec = F32x8::from_slice(&input[remainder_start..]);
        
        unsafe {
            vec.store_at_partial(output[remainder_start..].as_mut_ptr());
        }
    }
    
    output
}

Memory-Aligned Operations

use simdly::simd::avx2::f32x8::F32x8;
use simdly::simd::{Alignment, SimdLoad, SimdStore};
use std::alloc::{alloc, dealloc, Layout};

// Allocate 32-byte aligned memory for optimal performance
let layout = Layout::from_size_align(8 * std::mem::size_of::<f32>(), 32).unwrap();
let aligned_ptr = unsafe { alloc(layout) as *mut f32 };

// Verify alignment
assert!(F32x8::is_aligned(aligned_ptr));

// Use aligned operations for best performance
let data = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0];
unsafe {
    std::ptr::copy_nonoverlapping(data.as_ptr(), aligned_ptr, 8);
    
    let vec = F32x8::load_aligned(aligned_ptr);
    vec.store_aligned_at(aligned_ptr);
}

// Clean up
unsafe { dealloc(aligned_ptr as *mut u8, layout) };

📚 Documentation

📖 API Documentation - Complete API reference
🚀 Getting Started Guide - Detailed usage examples and tutorials
⚡ Performance Tips - Optimization strategies and best practices

🛠️ Development

Prerequisites

Rust 1.77 or later
x86/x86_64 processor with AVX2 support
Linux, macOS, or Windows

Building

git clone https://github.com/mtantaoui/simdly.git
cd simdly
cargo build --release

Testing

cargo test

Benchmarking

cargo bench

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Areas for Contribution

Additional SIMD instruction set support (SSE, ARM NEON)
Mathematical operations implementation
Performance optimizations
Documentation improvements
Testing and benchmarks

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with Rust's excellent SIMD intrinsics
Inspired by high-performance computing libraries
Thanks to the Rust community for their valuable feedback

📈 Roadmap

SSE support for older x86 processors
ARM NEON support for ARM/AArch64
Additional mathematical operations
Automatic SIMD instruction set detection
WebAssembly SIMD support

Made with ❤️ and ⚡ by Mahdi Tantaoui

simdly 0.1.5