key-paths-iter
Query builder for iterating over Vec<Item> collections accessed via rust-key-paths KpType.
Usage
Add to Cargo.toml:
[]
= "2"
= { = "../key-paths-iter" } # or from crates.io when published
For parallel iteration and Rayon tuning helpers, enable the rayon feature:
= { = "../key-paths-iter", = ["rayon"] }
Rayon version: the crate uses Rayon 1.10 (optional dependency). It is expected to work with Rayon 1.10.x and newer 1.x; if you need a different Rayon version, use a patch in your workspace or fork. Rayon 1.x is stable and API-compatible across minor updates.
Use with a keypath whose value type is Vec<Item>:
use ;
use Kp;
let users_kp: KpType = new;
// Chain filters, limit, offset, then execute
let results = users_kp
.query
.filter
.filter
.limit
.offset
.execute;
// Or use count / exists / first
let n = users_kp.query.filter.count;
let any = users_kp.query.filter.exists;
let first = users_kp.query.filter.first;
The keypath and the root reference share the same lifetime; use a type annotation like KpType<'_, Root, Vec<Item>> so the compiler infers the scope correctly.
Rayon performance tuning
With the rayon feature, the crate exposes a Rayon optimization module: thread pool presets, chunk sizing, cache-friendly patterns, profiling helpers, and workload-specific guides. Use these with parallel keypath collection ops (e.g. query_par).
Examples (from the workspace root): rayon_config_example, adaptive_pool_example, chunk_size_example, memory_optimized_example, rayon_profiler_example, rayon_patterns_example, rayon_env_example, optimization_guide_example, performance_monitor_example. Run with cargo run --example <name> (requires key-paths-iter with rayon in dev-dependencies).
Performance benefits of parallel (par)
- Throughput: On multi-core machines, parallel iteration spreads work across cores, so total time can drop by roughly a factor of the number of cores (for CPU-bound work with good load balance).
- When you gain the most: Large collections (e.g. > 10k items), CPU-heavy per-item work (math, encoding, parsing), and batch operations (map, filter, count, sort, fold). Typical speedups are ~2–8× on 2–8 cores when the workload is uniform and not memory-bound.
- When
parmay not help (or can hurt): Very small collections (overhead dominates), very cheap per-item work (< ~1 μs), or when the bottleneck is memory bandwidth or a single shared resource. UseRayonProfiler::compare_parallel_vs_sequentialto measure.
Where you can use par
In this crate (keypath collections) — use the query_par module and the ParallelCollectionKeyPath trait on KpType<'static, Root, Vec<Item>> (e.g. from #[derive(Kp)]):
| Category | Methods |
|---|---|
| Map / transform | par_map, par_filter, par_filter_map, par_flat_map, par_map_with_index |
| Reduce / aggregate | par_fold, par_reduce, par_count, par_count_by |
| Search | par_find, par_find_any, par_any, par_all, par_contains |
| Min / max | par_min, par_max, par_min_by_key, par_max_by_key |
| Partition / group | par_partition, par_group_by |
| Ordering | par_sort, par_sort_by_key |
| Side effects | par_for_each |
Example: employees_kp.par_map(&company, |e| e.salary), employees_kp.par_count_by(&company, |e| e.active).
With raw slices and Rayon — on any &[T] or Vec<T> you can use Rayon’s par_iter(), par_chunks(), par_chunks_mut(), and the rest of the rayon::prelude API. The rayon_optimizations helpers (chunk sizing, pool config, profiling) work with both keypath-based and raw-slice parallel code.
Thread count rules of thumb
- CPU-bound: use all cores →
RAYON_NUM_THREADS = num_cpus::get() - I/O-bound: oversubscribe 2× →
RAYON_NUM_THREADS = num_cpus::get() * 2 - Memory-intensive: use half →
RAYON_NUM_THREADS = num_cpus::get() / 2 - Latency-sensitive: physical cores only →
RAYON_NUM_THREADS = num_cpus::get_physical()
Chunk size formulas
- Uniform work: ~8 chunks per thread →
chunk_size = total_items / (num_threads * 8) - Variable work: ~16 chunks per thread →
chunk_size = total_items / (num_threads * 16) - Expensive work: ~32 chunks per thread →
chunk_size = total_items / (num_threads * 32) - Cheap work: ~2 chunks per thread →
chunk_size = total_items / (num_threads * 2)
Helpers: ChunkSizeOptimizer::uniform, variable, expensive, cheap, and auto_detect(items, sample_size, work_fn).
When to use parallel
Use parallel iteration when:
items.len() > 1000and- cost per item is non-trivial (e.g. > ~1 μs).
Otherwise prefer sequential to avoid overhead. Use RayonPatterns::small_collection_optimization(items, min_len, f) to switch automatically.
Cache-friendly chunk sizes
- L1 (~32 KB):
chunk_size = 32KB / sizeof(T)→MemoryOptimizedConfig::l1_cache_friendly - L2 (~256 KB):
chunk_size = 256KB / sizeof(T)→MemoryOptimizedConfig::l2_cache_friendly - L3 (~8 MB shared):
chunk_size = (8MB / num_threads) / sizeof(T)→MemoryOptimizedConfig::l3_cache_friendly
Anti-patterns to avoid
- Multiple collects: avoid
let a = data.par_iter().map(...).collect(); let b = a.par_iter().filter(...).collect();. Prefer chaining:data.par_iter().map(...).filter(...).collect(). - Shared mutex: avoid a single
Mutex<Vec<_>>withpar_iter().for_each(|x| results.lock().unwrap().push(...)). Prefer local accumulation then combine, e.g.par_chunks(...).map(|chunk| ...).collect()or fold/reduce. SeeRayonPatterns::reduce_lock_contention.
Configuration file
Create rayon.conf:
RAYON_NUM_THREADS=16
RAYON_STACK_SIZE=2097152
Load in code:
load_from_file?;
Save current suggested config: RayonEnvConfig::save_to_file("rayon.conf")?.
Quick benchmark (parallel vs sequential)
use Instant;
let start = now;
data.par_iter.for_each;
println!;
let start = now;
data.iter.for_each;
println!;
Or use RayonProfiler::compare_parallel_vs_sequential(sequential_fn, parallel_fn, iterations) for averaged timings and speedup.
Optimal settings by workload
| Workload | Threads | Stack size | Breadth-first | Chunk size |
|---|---|---|---|---|
| CPU-bound | All cores | 2 MB | No | Medium (8×) |
| I/O-bound | 2× cores | 1 MB | Yes | Small (16×) |
| Memory-heavy | Half cores | 4 MB | No | Large (2×) |
| Latency | Physical only | 2 MB | Yes | Very small (32×) |
| Real-time | Half cores | 2 MB | Yes | Adaptive |
Preset pools: OptimizationGuide::data_pipeline(), web_server(), scientific_computing(), real_time(), machine_learning(). Config builder: RayonConfig::cpu_bound(), io_bound(), memory_intensive(), latency_sensitive(), physical_cores_only(), then .build().