key-paths-iter

Query builder for iterating over Vec<Item> collections accessed via rust-key-paths KpType.

Usage

Add to Cargo.toml:

[dependencies]
rust-key-paths = "2"
key-paths-iter = { path = "../key-paths-iter" }  # or from crates.io when published

For parallel iteration and Rayon tuning helpers, enable the rayon feature:

key-paths-iter = { path = "../key-paths-iter", features = ["rayon"] }

Rayon version: the crate uses Rayon 1.10 (optional dependency). It is expected to work with Rayon 1.10.x and newer 1.x; if you need a different Rayon version, use a patch in your workspace or fork. Rayon 1.x is stable and API-compatible across minor updates.

Use with a keypath whose value type is Vec<Item>:

use key_paths_iter::{CollectionQuery, QueryableCollection};
use rust_key_paths::Kp;

let users_kp: rust_key_paths::KpType<'_, Database, Vec<User>> = Kp::new(
    |db: &Database| Some(&db.users),
    |db: &mut Database| Some(&mut db.users),
);

// Chain filters, limit, offset, then execute
let results = users_kp
    .query()
    .filter(|u| u.active)
    .filter(|u| u.age > 26)
    .limit(2)
    .offset(0)
    .execute(&db);

// Or use count / exists / first
let n = users_kp.query().filter(|u| u.active).count(&db);
let any = users_kp.query().filter(|u| u.active).exists(&db);
let first = users_kp.query().filter(|u| u.active).first(&db);

The keypath and the root reference share the same lifetime; use a type annotation like KpType<'_, Root, Vec<Item>> so the compiler infers the scope correctly.

GPU-scale parallel validation and calculation (`scale_par`)

With the rayon feature, the scale_par module provides types and helpers for GPU-style pipelines: parallel validation, data extraction for transfer, pre/post processing, and adaptive dispatch — all driven by keypaths.

Benefits of using scale_par with keypaths:

Benefit	Description
Zero-cost + composable	Keypaths use `fn` pointers that inline to direct field access; no boxed closures. Compose keypaths (e.g. into nested `reduction_net.nodes`) with no runtime overhead.
Parallel-first	`par_all`, `par_count_by`, `par_flat_map`, etc. run over the collection at the keypath. Validate large node/pair arrays in parallel before GPU transfer.
GPU data pipeline	Extract via keypath → validate in parallel → transfer; then write back results via `get_mut` and keypath. Same pattern works for CPU-only “GPU-style” batches.
First-class field access	Pass keypaths as arguments (e.g. `validate_for_gpu(..., nodes_kp, pairs_kp)`). Generic code can work over any root/collection type.
Adaptive dispatch	Use keypath queries (`par_count`, `par_count_by`) to choose workgroup size, batch size, or local memory usage from data shape.
No boilerplate	One keypath per field; validation and calculation functions take keypath references. Works with `#[derive(Kp)]` or manual `Kp::new`.

Example: validate a compute pipeline, extract data for GPU, run parallel buffer scaling, then write back results:

use key_paths_iter::scale_par::{validate_for_gpu, extract_gpu_data, par_scale_buffers, ...};
use rust_key_paths::Kp;

let nodes_kp = Kp::new(|p: &GpuComputePipeline| Some(&p.reduction_net.nodes), ...);
let pairs_kp = Kp::new(|p: &GpuComputePipeline| Some(&p.reduction_net.active_pairs), ...);

validate_for_gpu(&pipeline, &nodes_kp, &pairs_kp, 1000)?;
let gpu_data = extract_gpu_data(&pipeline, &nodes_kp, &pairs_kp);
par_scale_buffers(&buffers_kp, &mut pipeline, 2.0);

Run the full example: cargo run --example scale_par_gpu_validation (from the workspace root, with key-paths-iter and rayon enabled).

GPU-compatible types: NetNode is #[repr(C)] with kind, port0, port1, port2 (use NetNode::new(NodeKind, [u32;3]) and .kind() / .ports()). RedexPair is #[repr(C)] with left, right; convert with RedexPair::from((u32, u32)). With the gpu feature, both derive bytemuck::Pod and Zeroable for safe GPU buffer casting. Mutable access uses the keypath get_mut(root) API from rust_key_paths (returns Option<&mut V>).

Optional: GPU compute (feature `gpu`)

Enable the gpu feature for wgpu-based HVM2-style reduction:

key-paths-iter = { path = "../key-paths-iter", features = ["rayon", "gpu"] }

This adds GpuCompute::new() and execute_reduction(), and run_gpu_reduction_pipeline(pipeline, nodes_kp, pairs_kp) for a full validate → extract → dispatch → read-back flow. Requires a GPU adapter (Vulkan/Metal/DX12). The WGSL shader is in key-paths-iter/shaders/hvm_reduce.wgsl.

kp_gpu: GPU-aware KpType (no AKp/PKp)

With the gpu feature, the kp_gpu module provides GPU extensions on KpType only:

.map_gpu(wgsl) — attach a WGSL element-wise transform; returns GpuKp<R, V>.
.par_gpu(wgsl, roots, ctx) — attach kernel and run over a slice in one GPU dispatch.
.and_then_gpu(next_wgsl) — chain a second kernel (one dispatch).
GpuKpRunner — run a heterogeneous list of GpuKps (f32, u32, i32) on one root or many roots in one dispatch.
Vec value — for KpType<R, Vec<V>> (e.g. a field that is Vec<f32>): .map_gpu_vec(wgsl) returns GpuKpVec<R, V> with run_one(root, ctx) and apply_one(root, ctx) for one GPU dispatch over the whole vector.

Examples: cargo run --example kp_gpu_example, cargo run --example kp_gpu_vec_example (from the workspace root).

Benchmark (scale_par: parallel vs sequential)

From the workspace root (rust-key-paths), run:

cargo bench --bench scale_par_bench

This compares:

Buffer scaling: sequential nested loops vs keypath par_scale_buffers (multiple buffer × length sizes).
Validation (all non-empty): sequential iter().all(...) vs par_validate_buffers_non_empty.
Count by predicate: sequential filter().count() vs keypath par_count_by (nodes by kind at 5k–100k nodes).

On multi-core machines, parallel validation and par_count_by typically show speedups for large collections; buffer scaling may favor sequential for small inputs due to Rayon overhead. Use the benchmark to tune for your workload.

Enabling GPU access (using keypaths with a GPU backend)

This crate does not ship a GPU driver or runtime. It gives you CPU-side parallel validation, extraction, and preprocessing via keypaths. To run work on an actual GPU, add a GPU backend to your project and use keypaths to prepare and consume data.

1. System and driver requirements

Backend	Typical use	What you need
Vulkan	Cross-platform (Windows, Linux, macOS)	Vulkan SDK, GPU drivers with Vulkan support
Metal	macOS, iOS	Xcode / Metal (usually already installed on Mac)
CUDA	NVIDIA GPUs only	NVIDIA drivers, CUDA Toolkit, `nvcc` in path for Rust CUDA builds
ROCm	AMD GPUs (Linux)	ROCm stack, compatible AMD GPU

Windows: Install Vulkan SDK and/or CUDA Toolkit (NVIDIA). GPU drivers from your vendor must be up to date.
macOS: Metal is built in; for Vulkan you can use MoltenVK (often bundled by wgpu).
Linux: Install Vulkan (e.g. vulkan-tools, libvulkan-dev), or NVIDIA drivers + CUDA for NVIDIA, or ROCm for AMD.

2. Rust crates that enable GPU

wgpu — Cross-platform (Vulkan / Metal / DX12 / WebGPU). Good default for “run on GPU” without tying to one vendor. Use keypaths to build buffers (e.g. extract_gpu_data, par_flat_map_buffer_data) and validate before queue.write_buffer / dispatch.
cudarc / rust-cuda — NVIDIA CUDA from Rust. Use keypaths to prepare host data, then copy to device and launch kernels.
vulkano — Vulkan bindings. Keypaths can feed validated/extracted data into Vulkan buffers and compute dispatches.

3. How to “enable GPU” in your project

Add a GPU dependency to your Cargo.toml, e.g. wgpu or cudarc.
Use keypaths + scale_par on the CPU to validate and prepare data (e.g. validate_for_gpu, extract_gpu_data, par_scale_buffers).
Transfer to the GPU using the backend’s API (e.g. wgpu::Queue::write_buffer with data produced via keypath helpers).
Run your compute or render work on the device; then read back results and, if needed, write them back into your structures via keypaths (e.g. process_gpu_results).

Example pattern (pseudo-code):

let data = extract_gpu_data(&pipeline, &nodes_kp, &pairs_kp);  // keypath-based
validate_for_gpu(&pipeline, &nodes_kp, &pairs_kp, MAX_PORT)?;    // parallel validation
queue.write_buffer(&buffer, 0, bytemuck::cast_slice(&data.nodes));  // wgpu/CUDA/...
// ... dispatch ...
// read back and write into pipeline via keypath
process_gpu_results(&mut pipeline, &nodes_kp, gpu_results);

No feature flag in key-paths-iter is required for GPU: enable the rayon feature for parallel CPU prep; the GPU itself is enabled by your choice of backend and system drivers.

Rayon performance tuning

With the rayon feature, the crate exposes a Rayon optimization module: thread pool presets, chunk sizing, cache-friendly patterns, profiling helpers, and workload-specific guides. Use these with parallel keypath collection ops (e.g. query_par).

Examples (from the workspace root): rayon_config_example, adaptive_pool_example, chunk_size_example, memory_optimized_example, rayon_profiler_example, rayon_patterns_example, rayon_env_example, optimization_guide_example, performance_monitor_example. Run with cargo run --example <name> (requires key-paths-iter with rayon in dev-dependencies).

Performance benefits of parallel (`par`)

Throughput: On multi-core machines, parallel iteration spreads work across cores, so total time can drop by roughly a factor of the number of cores (for CPU-bound work with good load balance).
When you gain the most: Large collections (e.g. > 10k items), CPU-heavy per-item work (math, encoding, parsing), and batch operations (map, filter, count, sort, fold). Typical speedups are ~2–8× on 2–8 cores when the workload is uniform and not memory-bound.
When par may not help (or can hurt): Very small collections (overhead dominates), very cheap per-item work (< ~1 μs), or when the bottleneck is memory bandwidth or a single shared resource. Use RayonProfiler::compare_parallel_vs_sequential to measure.

Where you can use `par`

In this crate (keypath collections) — use the query_par module and the ParallelCollectionKeyPath trait on KpType<'static, Root, Vec<Item>> (e.g. from #[derive(Kp)]):

Category	Methods
Map / transform	`par_map`, `par_filter`, `par_filter_map`, `par_flat_map`, `par_map_with_index`
Reduce / aggregate	`par_fold`, `par_reduce`, `par_count`, `par_count_by`
Search	`par_find`, `par_find_any`, `par_any`, `par_all`, `par_contains`
Min / max	`par_min`, `par_max`, `par_min_by_key`, `par_max_by_key`
Partition / group	`par_partition`, `par_group_by`
Ordering	`par_sort`, `par_sort_by_key`
Side effects	`par_for_each`

Example: employees_kp.par_map(&company, |e| e.salary), employees_kp.par_count_by(&company, |e| e.active).

With raw slices and Rayon — on any &[T] or Vec<T> you can use Rayon’s par_iter(), par_chunks(), par_chunks_mut(), and the rest of the rayon::prelude API. The rayon_optimizations helpers (chunk sizing, pool config, profiling) work with both keypath-based and raw-slice parallel code.

Thread count rules of thumb

CPU-bound: use all cores → RAYON_NUM_THREADS = num_cpus::get()
I/O-bound: oversubscribe 2× → RAYON_NUM_THREADS = num_cpus::get() * 2
Memory-intensive: use half → RAYON_NUM_THREADS = num_cpus::get() / 2
Latency-sensitive: physical cores only → RAYON_NUM_THREADS = num_cpus::get_physical()

Chunk size formulas

Uniform work: ~8 chunks per thread → chunk_size = total_items / (num_threads * 8)
Variable work: ~16 chunks per thread → chunk_size = total_items / (num_threads * 16)
Expensive work: ~32 chunks per thread → chunk_size = total_items / (num_threads * 32)
Cheap work: ~2 chunks per thread → chunk_size = total_items / (num_threads * 2)

Helpers: ChunkSizeOptimizer::uniform, variable, expensive, cheap, and auto_detect(items, sample_size, work_fn).

When to use parallel

Use parallel iteration when:

items.len() > 1000 and
cost per item is non-trivial (e.g. > ~1 μs).

Otherwise prefer sequential to avoid overhead. Use RayonPatterns::small_collection_optimization(items, min_len, f) to switch automatically.

Cache-friendly chunk sizes

L1 (~32 KB): chunk_size = 32KB / sizeof(T) → MemoryOptimizedConfig::l1_cache_friendly
L2 (~256 KB): chunk_size = 256KB / sizeof(T) → MemoryOptimizedConfig::l2_cache_friendly
L3 (~8 MB shared): chunk_size = (8MB / num_threads) / sizeof(T) → MemoryOptimizedConfig::l3_cache_friendly

Anti-patterns to avoid

Multiple collects: avoid let a = data.par_iter().map(...).collect(); let b = a.par_iter().filter(...).collect();. Prefer chaining: data.par_iter().map(...).filter(...).collect().
Shared mutex: avoid a single Mutex<Vec<_>> with par_iter().for_each(|x| results.lock().unwrap().push(...)). Prefer local accumulation then combine, e.g. par_chunks(...).map(|chunk| ...).collect() or fold/reduce. See RayonPatterns::reduce_lock_contention.

Configuration file

Create rayon.conf:

RAYON_NUM_THREADS=16
RAYON_STACK_SIZE=2097152

Load in code:

key_paths_iter::rayon_optimizations::RayonEnvConfig::load_from_file("rayon.conf")?;

Save current suggested config: RayonEnvConfig::save_to_file("rayon.conf")?.

Quick benchmark (parallel vs sequential)

use std::time::Instant;

let start = Instant::now();
data.par_iter().for_each(|x| expensive_work(x));
println!("Parallel: {:?}", start.elapsed());

let start = Instant::now();
data.iter().for_each(|x| expensive_work(x));
println!("Sequential: {:?}", start.elapsed());

Or use RayonProfiler::compare_parallel_vs_sequential(sequential_fn, parallel_fn, iterations) for averaged timings and speedup.

Optimal settings by workload

Workload	Threads	Stack size	Breadth-first	Chunk size
CPU-bound	All cores	2 MB	No	Medium (8×)
I/O-bound	2× cores	1 MB	Yes	Small (16×)
Memory-heavy	Half cores	4 MB	No	Large (2×)
Latency	Physical only	2 MB	Yes	Very small (32×)
Real-time	Half cores	2 MB	Yes	Adaptive

Preset pools: OptimizationGuide::data_pipeline(), web_server(), scientific_computing(), real_time(), machine_learning(). Config builder: RayonConfig::cpu_bound(), io_bound(), memory_intensive(), latency_sensitive(), physical_cores_only(), then .build().

GPU / wgpu (key-paths-iter, optional)

The key-paths-iter crate can run numeric keypaths (e.g. f32) on the GPU via wgpu and arbitrary keypaths on the CPU. Use the Kp derive and the functional API (reference-based; no unnecessary copy/clone of the root).

Dependency: add key-paths-iter with the gpu feature (and rayon if you use CPU parallelism elsewhere):

[dependencies]
rust-key-paths = "2"
key-paths-derive = "2"
key-paths-iter = { version = "0.1", features = ["gpu"] }

Numeric keypath from Kp: use [IntoNumericAKp] so the keypath’s getter is used by reference; only the numeric value (e.g. one f32) is copied into the GPU path:

use key_paths_derive::{Kp, Pkp, Akp};
use key_paths_iter::wgpu::{IntoNumericAKp, AKpRunner, AKpTier, WgpuContext};
use rust_key_paths::{AKp, KpType};

#[derive(Kp, Pkp, Akp)]
struct User { name: String, score: f32 }

let score_tier = AKpTier::Numeric(User::score().into_numeric_akp("input * 2.0 + 1.0"));
let name_tier = AKpTier::Arbitrary(AKp::new(User::name()));
let runner = AKpRunner::new(vec![score_tier, name_tier], WgpuContext::new().ok());
let results = runner.run(&user as &dyn std::any::Any);

Functional API: use [Kp::map] for a derived keypath (takes a reference, no copy of the value in the get path), e.g. User::score().map(|s: &f32| *s * 2.0).
Examples and benchmarks:
- cargo run --example kp_pkp_wgpu — Kp/Pkp + derive + wgpu runner
- cargo run --example akp_wgpu_runner — AKp-only runner
- cargo bench --bench akp_cpu_bench — sequential vs Rayon vs GPU; includes a group using IntoNumericAKp from a derived Kp

key-paths-iter 0.1.2