Skip to main content

Module grid_reduce

Module grid_reduce 

Source
Expand description

GPU-style grid-based reduction and aggregation kernels.

Provides tiled parallel reductions, segmented scans, work-group min/max/sum operations, warp-level primitives (simulated on CPU), stream compaction (filter_compact), sparse-to-dense scatter/gather, and occupancy estimation helpers. All algorithms are CPU-side mocks that mimic GPU execution semantics using Rayon.

Structs§

GridReduceStats
Aggregate statistics computed over a 3-D grid of f64 values.
Histogram
Fixed-bin histogram over a f64 slice.
RunningMinMax
Streaming min/max tracker suitable for GPU readback values.
Tile
A single work-group tile holding up to CAPACITY f64 values.
TiledReducer
Parallel tiled reducer.
TwoLevelHistogram
Two-level histogram: first pass per-tile, second pass merge.
WelfordStats
Welford online statistics: accumulates count, mean, and variance in O(1) per sample. Suitable for streaming GPU readback values.

Constants§

WARP_SIZE
Simulated warp size: number of lanes in one warp.

Functions§

atomic_scatter_add
Atomic-add scatter (simulated serially): dst[idx] += value.
blelloch_exclusive_scan
Blelloch work-efficient parallel prefix scan (exclusive, in-place).
blelloch_inclusive_scan
Blelloch inclusive scan: built on top of the exclusive scan.
blelloch_segmented_exclusive_scan
Segmented exclusive scan using a parallel Blelloch-style approach.
compact_scatter
Compact scatter: given a predicate mask, scatter src elements into a destination buffer at compacted positions.
compaction_offsets
Build a compaction offset table from a boolean mask.
convolve1d
1-D convolution of signal with kernel (full output, length = signal+kernel-1).
correlate1d_valid
1-D cross-correlation of signal with pattern (valid region only). Output length = signal.len() - pattern.len() + 1.
covariance_matrix
Compute the d×d (population) covariance matrix for n observations of dimension d, stored row-major in data (shape n × d).
dist_l2
L2 distance between two equal-length vectors.
dist_sq_l2
Squared L2 distance between two equal-length vectors.
estimate_occupancy
Compute the theoretical SM occupancy given resource usage.
exclusive_scan_u64
Exclusive prefix sum on a u64 slice. Returns a new vec.
filter_compact
Stream compaction: collect elements satisfying predicate into a new vec.
filter_compact_counted
Stream compaction with count: returns (compacted, n_removed).
filter_compact_indexed
Stream compaction with index tracking.
gather
Gather: collect src[indices[i\]] into a new vec.
inclusive_scan_u64
Inclusive prefix sum on a u64 slice.
matmul
Compute C = A * B where A is m × k and B is k × n (all row-major). Returns a flat m*n vector.
matrix_diagonal
Extract the diagonal of a d×d matrix stored as a flat d*d slice.
matvec
Compute y = A * x where A is m × n (row-major), x has n elements, result y has m elements.
norm_l1
L1 norm: sum of absolute values.
norm_l2
L2 (Euclidean) norm.
norm_linf
L∞ (Chebyshev) norm: maximum absolute value.
normalise_by_sum
Normalise: divide each element by the total sum.
parallel_histogram
Parallel histogram reduce: split data into n_workers chunks, compute a partial histogram per chunk (in parallel), then merge all partial histograms serially. Mirrors the GPU pattern of per-work-group private histograms followed by a reduction pass.
parallel_segmented_reduce_sum
Segmented reduce: parallel version using Rayon.
partition_stable
Partition data into two groups: (passing, failing) — stable order.
radix_sort_f64
Radix sort for f64 values (sorts by bit representation, handles sign bit).
radix_sort_pass_u64
Single-pass radix sort step: sort data by a single bit_offset-wide digit extracted at bit position bit_pos with radix radix (must be a power of two, e.g. 256 for 8-bit digits).
radix_sort_u64
Full 64-bit radix sort (8 passes of 8-bit digits).
reduce_broadcast
Reduce data to a scalar and broadcast the result back to all positions.
scatter
Scatter: write src[i] to dst[indices[i\]].
segmented_exclusive_scan
Segmented exclusive prefix scan.
segmented_inclusive_scan
Segmented inclusive prefix scan.
segmented_reduce_sum
Segmented reduce: sum within each segment, returns one value per segment.
tree_reduce_max
Work-efficient tree reduction for max.
tree_reduce_min
Work-efficient tree reduction for min.
tree_reduce_sum
Work-efficient tree reduction: sums data using a binary tree pattern.
warp_broadcast
Simulate a warp-level broadcast: every lane gets lane_val[leader].
warp_exclusive_scan
Simulate a warp-level exclusive scan.
warp_reduce_sum
Simulate a warp-level reduce-sum: all lanes get the total sum.
warp_vote_all
Simulate warp vote: all — returns true if all lanes pass pred.
warp_vote_any
Simulate warp vote: any — returns true if any lane passes pred.