Module grid_reduce

Expand description

GPU-style grid-based reduction and aggregation kernels.

Provides tiled parallel reductions, segmented scans, work-group min/max/sum operations, warp-level primitives (simulated on CPU), stream compaction (filter_compact), sparse-to-dense scatter/gather, and occupancy estimation helpers. All algorithms are CPU-side mocks that mimic GPU execution semantics using Rayon.

Structs§

GridReduceStats: Aggregate statistics computed over a 3-D grid of f64 values.
Histogram: Fixed-bin histogram over a f64 slice.
RunningMinMax: Streaming min/max tracker suitable for GPU readback values.
Tile: A single work-group tile holding up to CAPACITY f64 values.
TiledReducer: Parallel tiled reducer.
TwoLevelHistogram: Two-level histogram: first pass per-tile, second pass merge.
WelfordStats: Welford online statistics: accumulates count, mean, and variance in O(1) per sample. Suitable for streaming GPU readback values.

Constants§

WARP_SIZE: Simulated warp size: number of lanes in one warp.

Functions§

atomic_scatter_add: Atomic-add scatter (simulated serially): dst[idx] += value.
blelloch_exclusive_scan: Blelloch work-efficient parallel prefix scan (exclusive, in-place).
blelloch_inclusive_scan: Blelloch inclusive scan: built on top of the exclusive scan.
blelloch_segmented_exclusive_scan: Segmented exclusive scan using a parallel Blelloch-style approach.
compact_scatter: Compact scatter: given a predicate mask, scatter src elements into a destination buffer at compacted positions.
compaction_offsets: Build a compaction offset table from a boolean mask.
convolve1d: 1-D convolution of signal with kernel (full output, length = signal+kernel-1).
correlate1d_valid: 1-D cross-correlation of signal with pattern (valid region only). Output length = signal.len() - pattern.len() + 1.
covariance_matrix: Compute the d×d (population) covariance matrix for n observations of dimension d, stored row-major in data (shape n × d).
dist_l2: L2 distance between two equal-length vectors.
dist_sq_l2: Squared L2 distance between two equal-length vectors.
estimate_occupancy: Compute the theoretical SM occupancy given resource usage.
exclusive_scan_u64: Exclusive prefix sum on a u64 slice. Returns a new vec.
filter_compact: Stream compaction: collect elements satisfying predicate into a new vec.
filter_compact_counted: Stream compaction with count: returns (compacted, n_removed).
filter_compact_indexed: Stream compaction with index tracking.
gather: Gather: collect src[indices[i\]] into a new vec.
inclusive_scan_u64: Inclusive prefix sum on a u64 slice.
matmul: Compute C = A * B where A is m × k and B is k × n (all row-major). Returns a flat m*n vector.
matrix_diagonal: Extract the diagonal of a d×d matrix stored as a flat d*d slice.
matvec: Compute y = A * x where A is m × n (row-major), x has n elements, result y has m elements.
norm_l1: L1 norm: sum of absolute values.
norm_l2: L2 (Euclidean) norm.
norm_linf: L∞ (Chebyshev) norm: maximum absolute value.
normalise_by_sum: Normalise: divide each element by the total sum.
parallel_histogram: Parallel histogram reduce: split data into n_workers chunks, compute a partial histogram per chunk (in parallel), then merge all partial histograms serially. Mirrors the GPU pattern of per-work-group private histograms followed by a reduction pass.
parallel_segmented_reduce_sum: Segmented reduce: parallel version using Rayon.
partition_stable: Partition data into two groups: (passing, failing) — stable order.
radix_sort_f64: Radix sort for f64 values (sorts by bit representation, handles sign bit).
radix_sort_pass_u64: Single-pass radix sort step: sort data by a single bit_offset-wide digit extracted at bit position bit_pos with radix radix (must be a power of two, e.g. 256 for 8-bit digits).
radix_sort_u64: Full 64-bit radix sort (8 passes of 8-bit digits).
reduce_broadcast: Reduce data to a scalar and broadcast the result back to all positions.
scatter: Scatter: write src[i] to dst[indices[i\]].
segmented_exclusive_scan: Segmented exclusive prefix scan.
segmented_inclusive_scan: Segmented inclusive prefix scan.
segmented_reduce_sum: Segmented reduce: sum within each segment, returns one value per segment.
tree_reduce_max: Work-efficient tree reduction for max.
tree_reduce_min: Work-efficient tree reduction for min.
tree_reduce_sum: Work-efficient tree reduction: sums data using a binary tree pattern.
warp_broadcast: Simulate a warp-level broadcast: every lane gets lane_val[leader].
warp_exclusive_scan: Simulate a warp-level exclusive scan.
warp_reduce_sum: Simulate a warp-level reduce-sum: all lanes get the total sum.
warp_vote_all: Simulate warp vote: all — returns true if all lanes pass pred.
warp_vote_any: Simulate warp vote: any — returns true if any lane passes pred.

Module grid_reduce

Module grid_reduce Copy item path

Structs§

Constants§

Functions§

Module grid_reduce