avx-parallel 0.4.0

Zero-dependency parallel library with work stealing, SIMD, lock-free operations, adaptive execution, and memory-efficient algorithms
Documentation
  • Coverage
  • 100%
    158 out of 158 items documented1 out of 131 items with examples
  • Size
  • Source code size: 255.04 kB This is the summed size of all the files inside the crates.io package for this release.
  • Documentation size: 9.92 MB This is the summed size of all files generated by rustdoc for all configured targets
  • Links
  • Homepage
  • Repository
  • crates.io
  • Dependencies
  • Versions
  • Owners
  • avilaops

avx-parallel

Crates.io Documentation License Rust CI

A zero-dependency parallel computation library for Rust with true parallel execution and advanced performance features.

๐Ÿ“š Documentation

โœจ Features

Core Features

  • ๐Ÿš€ True Parallel Execution: Real multi-threaded processing using std::thread::scope
  • ๐Ÿ“ฆ Zero Dependencies: Only uses Rust standard library (std::thread, std::sync)
  • ๐Ÿ”’ Thread Safe: All operations use proper synchronization primitives
  • ๐Ÿ“Š Order Preservation: Results maintain original element order
  • โšก Smart Optimization: Automatically falls back to sequential for small datasets
  • ๐ŸŽฏ Rich API: Familiar iterator-style methods

Advanced Features (v0.3.0)

  • โš™๏ธ Work Stealing Scheduler: Dynamic load balancing across threads
  • ๐Ÿ”ข SIMD Operations: Optimized vectorized operations for numeric types
  • ๐ŸŽ›๏ธ Advanced Configuration: Customize thread pools, chunk sizes, and more
  • ๐Ÿ”„ Parallel Sorting: High-performance merge sort with custom comparators
  • ๐Ÿงฉ Element-wise Operations: Zip, chunk, and partition with parallel execution

๐Ÿ†• Revolutionary Features (v0.4.0)

  • ๐Ÿ”“ Lock-Free Operations: Zero-contention atomic algorithms
  • ๐Ÿ”„ Pipeline Processing: Functional composition with MapReduce patterns
  • ๐Ÿง  Adaptive Execution: Self-optimizing algorithms that learn optimal parameters
  • ๐Ÿ’พ Memory-Efficient: Zero-copy operations and in-place transformations

๐Ÿ“‹ Quick Start

Add to your Cargo.toml:

[dependencies]

avx-parallel = "0.4.0"

Basic Usage

use avx_parallel::prelude::*;

fn main() {
    // Parallel iteration
    let data = vec![1, 2, 3, 4, 5];
    let sum: i32 = data.par_iter()
        .map(|x| x * 2)
        .sum();
    println!("Sum: {}", sum); // Sum: 30

    // High-performance par_vec API
    let results: Vec<i32> = data.par_vec()
        .map(|&x| x * x)
        .collect();
    println!("{:?}", results); // [1, 4, 9, 16, 25]

    // Lock-free counting (v0.4.0)
    let count = lockfree_count(&data, |x| x > &2);
    println!("Count: {}", count); // Count: 3
}

๐ŸŽฏ Available Operations

Basic Operations

  • map - Transform each element
  • filter - Keep elements matching predicate
  • cloned - Clone elements (for reference iterators)

Aggregation

  • sum - Sum all elements
  • reduce - Reduce with custom operation
  • fold - Fold with identity and operation
  • count - Count elements matching predicate

Search

  • find_any - Find any element matching predicate
  • all - Check if all elements match
  • any - Check if any element matches

Advanced Operations (v0.2.0+)

  • parallel_sort - Parallel merge sort
  • parallel_sort_by - Sort with custom comparator
  • parallel_zip - Combine two slices element-wise
  • parallel_chunks - Process data in fixed-size chunks
  • partition - Split into two vectors based on predicate

Work Stealing & SIMD (v0.3.0)

  • work_stealing_map - Map with dynamic load balancing
  • WorkStealingPool - Thread pool with work stealing
  • simd_sum_* - SIMD-accelerated sum operations
  • simd_dot_* - SIMD dot product
  • ThreadPoolConfig - Advanced thread pool configuration

๐Ÿ†• Lock-Free & Adaptive (v0.4.0)

  • lockfree_count - Atomic-based counting without locks
  • lockfree_any / lockfree_all - Lock-free search with early exit
  • AdaptiveExecutor - Learning executor that optimizes chunk sizes
  • speculative_execute - Auto-select parallel vs sequential
  • cache_aware_map - Cache-line optimized transformations
  • parallel_transform_inplace - Zero-allocation transformations

๐Ÿ“Š Performance

The library automatically:

  • Detects CPU core count (default: all available cores)
  • Distributes work efficiently with configurable chunk sizes (default: 1024)
  • Falls back to sequential execution for small datasets
  • Maintains result order with indexed chunks
  • Uses work stealing for dynamic load balancing
  • NEW: Adapts chunk sizes based on workload characteristics
  • NEW: Zero-lock algorithms for maximum concurrency

Benchmark Results (Updated for v0.4.0)

Operation Dataset Sequential Parallel (v0.3.0) Parallel (v0.4.0) Speedup
Sum 1M 2.5ms 1.1ms 0.9ms 2.78x
Filter 1M 45ms 15ms 12ms 3.75x
Count (lock-free) 1M 8ms 4ms 2.5ms 3.20x
Sort 1M 82ms 25ms 25ms 3.28x
Complex Compute 100K 230ms 75ms 65ms 3.54x

Note: For simple operations (<100ยตs per element), sequential may be faster due to thread overhead.

๐Ÿ”ง Advanced Usage

Lock-Free Operations (v0.4.0)

use avx_parallel::prelude::*;

let data = vec![1, 2, 3, 4, 5, 6, 7, 8, 9, 10];

// Lock-free counting with atomics
let count = lockfree_count(&data, |x| x > &5);

// Lock-free search with early exit
let has_large = lockfree_any(&data, |x| x > &100);
let all_positive = lockfree_all(&data, |x| x > &0);

Adaptive Execution (v0.4.0)

use avx_parallel::adaptive::AdaptiveExecutor;

// Executor learns optimal chunk size over time
let mut executor = AdaptiveExecutor::new();

// First run: learns optimal parameters
let result1 = executor.execute(&data, |x| expensive_op(x));

// Subsequent runs: uses learned optimal chunk size
let result2 = executor.execute(&data, |x| expensive_op(x));

Memory-Efficient Operations (v0.4.0)

use avx_parallel::memory::parallel_transform_inplace;

// Zero-allocation in-place transformation
let mut data = vec![1, 2, 3, 4, 5];
parallel_transform_inplace(&mut data, |x| *x *= 2);
// data is now [2, 4, 6, 8, 10] without any allocations

Work Stealing (v0.3.0)

use avx_parallel::{work_stealing_map, WorkStealingPool};

// Dynamic load balancing
let data = vec![1, 2, 3, 4, 5];
let results = work_stealing_map(&data, |x| expensive_computation(x));

// Custom work stealing pool
let pool = WorkStealingPool::new(4);
pool.execute(tasks);

SIMD Operations (v0.3.0)

use avx_parallel::simd;

let data: Vec<i32> = (1..=1_000_000).collect();
let sum = simd::parallel_simd_sum_i32(&data);

let a: Vec<f32> = vec![1.0, 2.0, 3.0];
let b: Vec<f32> = vec![4.0, 5.0, 6.0];
let dot = simd::simd_dot_f32(&a, &b);

Thread Pool Configuration (v0.3.0)

use avx_parallel::{ThreadPoolConfig, set_global_config};

let config = ThreadPoolConfig::new()
    .num_threads(8)
    .min_chunk_size(2048)
    .thread_name("my-worker");

set_global_config(config);

Parallel Sorting (v0.2.0+)

use avx_parallel::parallel_sort;

let mut data = vec![5, 2, 8, 1, 9];
parallel_sort(&mut data);
// data is now [1, 2, 5, 8, 9]

Using Executor Functions Directly

use avx_parallel::executor::*;

let data = vec![1, 2, 3, 4, 5];

// Parallel map
let results = parallel_map(&data, |x| x * 2);

// Parallel filter
let evens = parallel_filter(&data, |x| *x % 2 == 0);

// Parallel reduce
let sum = parallel_reduce(&data, |a, b| a + b);

// Parallel partition
let (evens, odds) = parallel_partition(&data, |x| *x % 2 == 0);

// Find first matching
let found = parallel_find(&data, |x| *x > 3);

// Count matching
let count = parallel_count(&data, |x| *x % 2 == 0);

Mutable Iteration

use avx_parallel::prelude::*;

let mut data = vec![1, 2, 3, 4, 5];
data.par_iter_mut()
    .for_each(|x| *x *= 2);
println!("{:?}", data); // [2, 4, 6, 8, 10]

๐Ÿ—๏ธ Architecture

Thread Management

  • Uses std::thread::scope for lifetime-safe thread spawning
  • Automatic CPU detection via std::thread::available_parallelism()
  • Chunk-based work distribution with adaptive sizing

Synchronization

  • Arc<Mutex<>> for safe result collection
  • No unsafe code in public API
  • Order preservation through indexed chunks

Performance Tuning

Default Configuration:

const MIN_CHUNK_SIZE: usize = 1024;  // Optimized based on benchmarks
const MAX_CHUNKS_PER_THREAD: usize = 8;

Environment Variables:

# Customize minimum chunk size (useful for tuning specific workloads)

export avx_MIN_CHUNK_SIZE=2048


# Run your program

cargo run --release

When to Adjust:

  • Increase (2048+): Very expensive operations (>1ms per element)
  • Decrease (512): Light operations but large datasets
  • Keep default (1024): Most use cases

๐Ÿงช Examples

CPU-Intensive Computation

use avx_parallel::prelude::*;

let data: Vec<i32> = (0..10_000_000).collect();

// Perform expensive computation in parallel
let results = data.par_vec()
    .map(|&x| {
        // Simulate expensive operation
        let mut result = x;
        for _ in 0..100 {
            result = (result * 13 + 7) % 1_000_000;
        }
        result
    })
    .collect();

Data Analysis

use avx_parallel::prelude::*;

let data: Vec<f64> = vec![1.0, 2.0, 3.0, 4.0, 5.0];

// Calculate statistics in parallel
let sum: f64 = data.par_iter().sum();
let count = data.len();
let mean = sum / count as f64;

let variance = data.par_vec()
    .map(|&x| (x - mean).powi(2))
    .into_iter()
    .sum::<f64>() / count as f64;

๐Ÿ” When to Use

โœ… Good Use Cases

  • CPU-bound operations (image processing, calculations, etc.)
  • Large datasets (>10,000 elements)
  • Independent computations per element
  • Expensive operations (>100ยตs per element)

โŒ Not Ideal For

  • I/O-bound operations (use async instead)
  • Very small datasets (<1,000 elements)
  • Simple operations (<10ยตs per element)
  • Operations requiring shared mutable state

๐Ÿ› ๏ธ Building from Source

git clone https://github.com/your-org/avx-parallel

cd avx-parallel

cargo build --release

cargo test

๐Ÿ“ License

MIT License - see LICENSE file for details

๐Ÿค Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

๐Ÿ“š Documentation

Full API documentation is available at docs.rs/avx-parallel

๐Ÿ”— Related Projects

  • Rayon - Full-featured data parallelism library
  • crossbeam - Concurrent programming tools

โญ Star History

If you find this project useful, consider giving it a star!