avx-parallel
A zero-dependency parallel computation library for Rust with true parallel execution and advanced performance features.
๐ Documentation
- Quick Start - Get started in 5 minutes
- API Documentation - Full API reference
- Optimization Guide - Performance tuning tips
- Contributing - How to contribute
- Changelog - Version history
โจ Features
Core Features
- ๐ True Parallel Execution: Real multi-threaded processing using
std::thread::scope - ๐ฆ Zero Dependencies: Only uses Rust standard library (
std::thread,std::sync) - ๐ Thread Safe: All operations use proper synchronization primitives
- ๐ Order Preservation: Results maintain original element order
- โก Smart Optimization: Automatically falls back to sequential for small datasets
- ๐ฏ Rich API: Familiar iterator-style methods
Advanced Features (v0.3.0)
- โ๏ธ Work Stealing Scheduler: Dynamic load balancing across threads
- ๐ข SIMD Operations: Optimized vectorized operations for numeric types
- ๐๏ธ Advanced Configuration: Customize thread pools, chunk sizes, and more
- ๐ Parallel Sorting: High-performance merge sort with custom comparators
- ๐งฉ Element-wise Operations: Zip, chunk, and partition with parallel execution
๐ Revolutionary Features (v0.4.0)
- ๐ Lock-Free Operations: Zero-contention atomic algorithms
- ๐ Pipeline Processing: Functional composition with MapReduce patterns
- ๐ง Adaptive Execution: Self-optimizing algorithms that learn optimal parameters
- ๐พ Memory-Efficient: Zero-copy operations and in-place transformations
๐ Quick Start
Add to your Cargo.toml:
[]
= "0.4.0"
Basic Usage
use *;
๐ฏ Available Operations
Basic Operations
map- Transform each elementfilter- Keep elements matching predicatecloned- Clone elements (for reference iterators)
Aggregation
sum- Sum all elementsreduce- Reduce with custom operationfold- Fold with identity and operationcount- Count elements matching predicate
Search
find_any- Find any element matching predicateall- Check if all elements matchany- Check if any element matches
Advanced Operations (v0.2.0+)
parallel_sort- Parallel merge sortparallel_sort_by- Sort with custom comparatorparallel_zip- Combine two slices element-wiseparallel_chunks- Process data in fixed-size chunkspartition- Split into two vectors based on predicate
Work Stealing & SIMD (v0.3.0)
work_stealing_map- Map with dynamic load balancingWorkStealingPool- Thread pool with work stealingsimd_sum_*- SIMD-accelerated sum operationssimd_dot_*- SIMD dot productThreadPoolConfig- Advanced thread pool configuration
๐ Lock-Free & Adaptive (v0.4.0)
lockfree_count- Atomic-based counting without lockslockfree_any/lockfree_all- Lock-free search with early exitAdaptiveExecutor- Learning executor that optimizes chunk sizesspeculative_execute- Auto-select parallel vs sequentialcache_aware_map- Cache-line optimized transformationsparallel_transform_inplace- Zero-allocation transformations
๐ Performance
The library automatically:
- Detects CPU core count (default: all available cores)
- Distributes work efficiently with configurable chunk sizes (default: 1024)
- Falls back to sequential execution for small datasets
- Maintains result order with indexed chunks
- Uses work stealing for dynamic load balancing
- NEW: Adapts chunk sizes based on workload characteristics
- NEW: Zero-lock algorithms for maximum concurrency
Benchmark Results (Updated for v0.4.0)
| Operation | Dataset | Sequential | Parallel (v0.3.0) | Parallel (v0.4.0) | Speedup |
|---|---|---|---|---|---|
| Sum | 1M | 2.5ms | 1.1ms | 0.9ms | 2.78x |
| Filter | 1M | 45ms | 15ms | 12ms | 3.75x |
| Count (lock-free) | 1M | 8ms | 4ms | 2.5ms | 3.20x |
| Sort | 1M | 82ms | 25ms | 25ms | 3.28x |
| Complex Compute | 100K | 230ms | 75ms | 65ms | 3.54x |
Note: For simple operations (<100ยตs per element), sequential may be faster due to thread overhead.
๐ง Advanced Usage
Lock-Free Operations (v0.4.0)
use *;
let data = vec!;
// Lock-free counting with atomics
let count = lockfree_count;
// Lock-free search with early exit
let has_large = lockfree_any;
let all_positive = lockfree_all;
Adaptive Execution (v0.4.0)
use AdaptiveExecutor;
// Executor learns optimal chunk size over time
let mut executor = new;
// First run: learns optimal parameters
let result1 = executor.execute;
// Subsequent runs: uses learned optimal chunk size
let result2 = executor.execute;
Memory-Efficient Operations (v0.4.0)
use parallel_transform_inplace;
// Zero-allocation in-place transformation
let mut data = vec!;
parallel_transform_inplace;
// data is now [2, 4, 6, 8, 10] without any allocations
Work Stealing (v0.3.0)
use ;
// Dynamic load balancing
let data = vec!;
let results = work_stealing_map;
// Custom work stealing pool
let pool = new;
pool.execute;
SIMD Operations (v0.3.0)
use simd;
let data: = .collect;
let sum = parallel_simd_sum_i32;
let a: = vec!;
let b: = vec!;
let dot = simd_dot_f32;
Thread Pool Configuration (v0.3.0)
use ;
let config = new
.num_threads
.min_chunk_size
.thread_name;
set_global_config;
Parallel Sorting (v0.2.0+)
use parallel_sort;
let mut data = vec!;
parallel_sort;
// data is now [1, 2, 5, 8, 9]
Using Executor Functions Directly
use *;
let data = vec!;
// Parallel map
let results = parallel_map;
// Parallel filter
let evens = parallel_filter;
// Parallel reduce
let sum = parallel_reduce;
// Parallel partition
let = parallel_partition;
// Find first matching
let found = parallel_find;
// Count matching
let count = parallel_count;
Mutable Iteration
use *;
let mut data = vec!;
data.par_iter_mut
.for_each;
println!; // [2, 4, 6, 8, 10]
๐๏ธ Architecture
Thread Management
- Uses
std::thread::scopefor lifetime-safe thread spawning - Automatic CPU detection via
std::thread::available_parallelism() - Chunk-based work distribution with adaptive sizing
Synchronization
Arc<Mutex<>>for safe result collection- No unsafe code in public API
- Order preservation through indexed chunks
Performance Tuning
Default Configuration:
const MIN_CHUNK_SIZE: usize = 1024; // Optimized based on benchmarks
const MAX_CHUNKS_PER_THREAD: usize = 8;
Environment Variables:
# Customize minimum chunk size (useful for tuning specific workloads)
# Run your program
When to Adjust:
- Increase (2048+): Very expensive operations (>1ms per element)
- Decrease (512): Light operations but large datasets
- Keep default (1024): Most use cases
๐งช Examples
CPU-Intensive Computation
use *;
let data: = .collect;
// Perform expensive computation in parallel
let results = data.par_vec
.map
.collect;
Data Analysis
use *;
let data: = vec!;
// Calculate statistics in parallel
let sum: f64 = data.par_iter.sum;
let count = data.len;
let mean = sum / count as f64;
let variance = data.par_vec
.map
.into_iter
. / count as f64;
๐ When to Use
โ Good Use Cases
- CPU-bound operations (image processing, calculations, etc.)
- Large datasets (>10,000 elements)
- Independent computations per element
- Expensive operations (>100ยตs per element)
โ Not Ideal For
- I/O-bound operations (use async instead)
- Very small datasets (<1,000 elements)
- Simple operations (<10ยตs per element)
- Operations requiring shared mutable state
๐ ๏ธ Building from Source
๐ License
MIT License - see LICENSE file for details
๐ค Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
๐ Documentation
Full API documentation is available at docs.rs/avx-parallel
๐ Related Projects
โญ Star History
If you find this project useful, consider giving it a star!