avx-parallel

A zero-dependency parallel computation library for Rust with true parallel execution and advanced performance features.

📚 Documentation

Quick Start - Get started in 5 minutes
API Documentation - Full API reference
Optimization Guide - Performance tuning tips
Contributing - How to contribute
Changelog - Version history

✨ Features

Core Features

🚀 True Parallel Execution: Real multi-threaded processing using std::thread::scope
📦 Zero Dependencies: Only uses Rust standard library (std::thread, std::sync)
🔒 Thread Safe: All operations use proper synchronization primitives
📊 Order Preservation: Results maintain original element order
⚡ Smart Optimization: Automatically falls back to sequential for small datasets
🎯 Rich API: Familiar iterator-style methods

Advanced Features (v0.3.0)

⚙️ Work Stealing Scheduler: Dynamic load balancing across threads
🔢 SIMD Operations: Optimized vectorized operations for numeric types
🎛️ Advanced Configuration: Customize thread pools, chunk sizes, and more
🔄 Parallel Sorting: High-performance merge sort with custom comparators
🧩 Element-wise Operations: Zip, chunk, and partition with parallel execution

🆕 Revolutionary Features (v0.4.0)

🔓 Lock-Free Operations: Zero-contention atomic algorithms
🔄 Pipeline Processing: Functional composition with MapReduce patterns
🧠 Adaptive Execution: Self-optimizing algorithms that learn optimal parameters
💾 Memory-Efficient: Zero-copy operations and in-place transformations

📋 Quick Start

Add to your Cargo.toml:

[dependencies]

avx-parallel = "0.4.0"

Basic Usage

use avx_parallel::prelude::*;

fn main() {
    // Parallel iteration
    let data = vec![1, 2, 3, 4, 5];
    let sum: i32 = data.par_iter()
        .map(|x| x * 2)
        .sum();
    println!("Sum: {}", sum); // Sum: 30

    // High-performance par_vec API
    let results: Vec<i32> = data.par_vec()
        .map(|&x| x * x)
        .collect();
    println!("{:?}", results); // [1, 4, 9, 16, 25]

    // Lock-free counting (v0.4.0)
    let count = lockfree_count(&data, |x| x > &2);
    println!("Count: {}", count); // Count: 3
}

🎯 Available Operations

Basic Operations

map - Transform each element
filter - Keep elements matching predicate
cloned - Clone elements (for reference iterators)

Aggregation

sum - Sum all elements
reduce - Reduce with custom operation
fold - Fold with identity and operation
count - Count elements matching predicate

Search

find_any - Find any element matching predicate
all - Check if all elements match
any - Check if any element matches

Advanced Operations (v0.2.0+)

parallel_sort - Parallel merge sort
parallel_sort_by - Sort with custom comparator
parallel_zip - Combine two slices element-wise
parallel_chunks - Process data in fixed-size chunks
partition - Split into two vectors based on predicate

Work Stealing & SIMD (v0.3.0)

work_stealing_map - Map with dynamic load balancing
WorkStealingPool - Thread pool with work stealing
simd_sum_* - SIMD-accelerated sum operations
simd_dot_* - SIMD dot product
ThreadPoolConfig - Advanced thread pool configuration

🆕 Lock-Free & Adaptive (v0.4.0)

lockfree_count - Atomic-based counting without locks
lockfree_any / lockfree_all - Lock-free search with early exit
AdaptiveExecutor - Learning executor that optimizes chunk sizes
speculative_execute - Auto-select parallel vs sequential
cache_aware_map - Cache-line optimized transformations
parallel_transform_inplace - Zero-allocation transformations

📊 Performance

The library automatically:

Detects CPU core count (default: all available cores)
Distributes work efficiently with configurable chunk sizes (default: 1024)
Falls back to sequential execution for small datasets
Maintains result order with indexed chunks
Uses work stealing for dynamic load balancing
NEW: Adapts chunk sizes based on workload characteristics
NEW: Zero-lock algorithms for maximum concurrency

Benchmark Results (Updated for v0.4.0)

Operation	Dataset	Sequential	Parallel (v0.3.0)	Parallel (v0.4.0)	Speedup
Sum	1M	2.5ms	1.1ms	0.9ms	2.78x
Filter	1M	45ms	15ms	12ms	3.75x
Count (lock-free)	1M	8ms	4ms	2.5ms	3.20x
Sort	1M	82ms	25ms	25ms	3.28x
Complex Compute	100K	230ms	75ms	65ms	3.54x

Note: For simple operations (<100µs per element), sequential may be faster due to thread overhead.

🔧 Advanced Usage

Lock-Free Operations (v0.4.0)

use avx_parallel::prelude::*;

let data = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];

// Lock-free counting with atomics
let count = lockfree_count(&data, |x| x > &5);

// Lock-free search with early exit
let has_large = lockfree_any(&data, |x| x > &100);
let all_positive = lockfree_all(&data, |x| x > &0);

Adaptive Execution (v0.4.0)

use avx_parallel::adaptive::AdaptiveExecutor;

// Executor learns optimal chunk size over time
let mut executor = AdaptiveExecutor::new();

// First run: learns optimal parameters
let result1 = executor.execute(&data, |x| expensive_op(x));

// Subsequent runs: uses learned optimal chunk size
let result2 = executor.execute(&data, |x| expensive_op(x));

Memory-Efficient Operations (v0.4.0)

use avx_parallel::memory::parallel_transform_inplace;

// Zero-allocation in-place transformation
let mut data = vec![1, 2, 3, 4, 5];
parallel_transform_inplace(&mut data, |x| *x *= 2);
// data is now [2, 4, 6, 8, 10] without any allocations

Work Stealing (v0.3.0)

use avx_parallel::{work_stealing_map, WorkStealingPool};

// Dynamic load balancing
let data = vec![1, 2, 3, 4, 5];
let results = work_stealing_map(&data, |x| expensive_computation(x));

// Custom work stealing pool
let pool = WorkStealingPool::new(4);
pool.execute(tasks);

SIMD Operations (v0.3.0)

use avx_parallel::simd;

let data: Vec<i32> = (1..=1_000_000).collect();
let sum = simd::parallel_simd_sum_i32(&data);

let a: Vec<f32> = vec![1.0, 2.0, 3.0];
let b: Vec<f32> = vec![4.0, 5.0, 6.0];
let dot = simd::simd_dot_f32(&a, &b);

Thread Pool Configuration (v0.3.0)

use avx_parallel::{ThreadPoolConfig, set_global_config};

let config = ThreadPoolConfig::new()
    .num_threads(8)
    .min_chunk_size(2048)
    .thread_name("my-worker");

set_global_config(config);

Parallel Sorting (v0.2.0+)

use avx_parallel::parallel_sort;

let mut data = vec![5, 2, 8, 1, 9];
parallel_sort(&mut data);
// data is now [1, 2, 5, 8, 9]

Using Executor Functions Directly

use avx_parallel::executor::*;

let data = vec![1, 2, 3, 4, 5];

// Parallel map
let results = parallel_map(&data, |x| x * 2);

// Parallel filter
let evens = parallel_filter(&data, |x| *x % 2 == 0);

// Parallel reduce
let sum = parallel_reduce(&data, |a, b| a + b);

// Parallel partition
let (evens, odds) = parallel_partition(&data, |x| *x % 2 == 0);

// Find first matching
let found = parallel_find(&data, |x| *x > 3);

// Count matching
let count = parallel_count(&data, |x| *x % 2 == 0);

Mutable Iteration

use avx_parallel::prelude::*;

let mut data = vec![1, 2, 3, 4, 5];
data.par_iter_mut()
    .for_each(|x| *x *= 2);
println!("{:?}", data); // [2, 4, 6, 8, 10]

🏗️ Architecture

Thread Management

Uses std::thread::scope for lifetime-safe thread spawning
Automatic CPU detection via std::thread::available_parallelism()
Chunk-based work distribution with adaptive sizing

Synchronization

Arc<Mutex<>> for safe result collection
No unsafe code in public API
Order preservation through indexed chunks

Performance Tuning

Default Configuration:

const MIN_CHUNK_SIZE: usize = 1024;  // Optimized based on benchmarks
const MAX_CHUNKS_PER_THREAD: usize = 8;

Environment Variables:

# Customize minimum chunk size (useful for tuning specific workloads)

export avx_MIN_CHUNK_SIZE=2048


# Run your program

cargo run --release

When to Adjust:

Increase (2048+): Very expensive operations (>1ms per element)
Decrease (512): Light operations but large datasets
Keep default (1024): Most use cases

🧪 Examples

CPU-Intensive Computation

use avx_parallel::prelude::*;

let data: Vec<i32> = (0..10_000_000).collect();

// Perform expensive computation in parallel
let results = data.par_vec()
    .map(|&x| {
        // Simulate expensive operation
        let mut result = x;
        for _ in 0..100 {
            result = (result * 13 + 7) % 1_000_000;
        }
        result
    })
    .collect();

Data Analysis

use avx_parallel::prelude::*;

let data: Vec<f64> = vec![1.0, 2.0, 3.0, 4.0, 5.0];

// Calculate statistics in parallel
let sum: f64 = data.par_iter().sum();
let count = data.len();
let mean = sum / count as f64;

let variance = data.par_vec()
    .map(|&x| (x - mean).powi(2))
    .into_iter()
    .sum::<f64>() / count as f64;

🔍 When to Use

✅ Good Use Cases

CPU-bound operations (image processing, calculations, etc.)
Large datasets (>10,000 elements)
Independent computations per element
Expensive operations (>100µs per element)

❌ Not Ideal For

I/O-bound operations (use async instead)
Very small datasets (<1,000 elements)
Simple operations (<10µs per element)
Operations requiring shared mutable state

🛠️ Building from Source

git clone https://github.com/your-org/avx-parallel

cd avx-parallel

cargo build --release

cargo test

📝 License

MIT License - see LICENSE file for details

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📚 Documentation

Full API documentation is available at docs.rs/avx-parallel

🔗 Related Projects

Rayon - Full-featured data parallelism library
crossbeam - Concurrent programming tools

⭐ Star History

If you find this project useful, consider giving it a star!

avx-parallel 0.4.0

avx-parallel

📚 Documentation

✨ Features

Core Features

Advanced Features (v0.3.0)

🆕 Revolutionary Features (v0.4.0)

📋 Quick Start

Basic Usage

🎯 Available Operations

Basic Operations

Aggregation

Search

Advanced Operations (v0.2.0+)

Work Stealing & SIMD (v0.3.0)

🆕 Lock-Free & Adaptive (v0.4.0)

📊 Performance

Benchmark Results (Updated for v0.4.0)

🔧 Advanced Usage

Lock-Free Operations (v0.4.0)

Adaptive Execution (v0.4.0)

Memory-Efficient Operations (v0.4.0)

Work Stealing (v0.3.0)

SIMD Operations (v0.3.0)

Thread Pool Configuration (v0.3.0)

Parallel Sorting (v0.2.0+)

Using Executor Functions Directly

Mutable Iteration

🏗️ Architecture

Thread Management

Synchronization

Performance Tuning

🧪 Examples

CPU-Intensive Computation

Data Analysis

🔍 When to Use

✅ Good Use Cases

❌ Not Ideal For

🛠️ Building from Source

📝 License

🤝 Contributing

📚 Documentation

🔗 Related Projects

⭐ Star History