Expand description
GPU-style grid-based reduction and aggregation kernels.
Provides tiled parallel reductions, segmented scans, work-group min/max/sum operations, warp-level primitives (simulated on CPU), stream compaction (filter_compact), sparse-to-dense scatter/gather, and occupancy estimation helpers. All algorithms are CPU-side mocks that mimic GPU execution semantics using Rayon.
Structs§
- Grid
Reduce Stats - Aggregate statistics computed over a 3-D grid of f64 values.
- Histogram
- Fixed-bin histogram over a f64 slice.
- Running
MinMax - Streaming min/max tracker suitable for GPU readback values.
- Tile
- A single work-group tile holding up to
CAPACITYf64 values. - Tiled
Reducer - Parallel tiled reducer.
- TwoLevel
Histogram - Two-level histogram: first pass per-tile, second pass merge.
- Welford
Stats - Welford online statistics: accumulates count, mean, and variance in O(1) per sample. Suitable for streaming GPU readback values.
Constants§
- WARP_
SIZE - Simulated warp size: number of lanes in one warp.
Functions§
- atomic_
scatter_ add - Atomic-add scatter (simulated serially):
dst[idx] += value. - blelloch_
exclusive_ scan - Blelloch work-efficient parallel prefix scan (exclusive, in-place).
- blelloch_
inclusive_ scan - Blelloch inclusive scan: built on top of the exclusive scan.
- blelloch_
segmented_ exclusive_ scan - Segmented exclusive scan using a parallel Blelloch-style approach.
- compact_
scatter - Compact scatter: given a predicate mask, scatter
srcelements into a destination buffer at compacted positions. - compaction_
offsets - Build a compaction offset table from a boolean mask.
- convolve1d
- 1-D convolution of
signalwithkernel(full output, length = signal+kernel-1). - correlate1d_
valid - 1-D cross-correlation of
signalwithpattern(valid region only). Output length =signal.len() - pattern.len() + 1. - covariance_
matrix - Compute the
d×d(population) covariance matrix fornobservations of dimensiond, stored row-major indata(shapen × d). - dist_l2
- L2 distance between two equal-length vectors.
- dist_
sq_ l2 - Squared L2 distance between two equal-length vectors.
- estimate_
occupancy - Compute the theoretical SM occupancy given resource usage.
- exclusive_
scan_ u64 - Exclusive prefix sum on a
u64slice. Returns a new vec. - filter_
compact - Stream compaction: collect elements satisfying
predicateinto a new vec. - filter_
compact_ counted - Stream compaction with count: returns (compacted, n_removed).
- filter_
compact_ indexed - Stream compaction with index tracking.
- gather
- Gather: collect
src[indices[i\]]into a new vec. - inclusive_
scan_ u64 - Inclusive prefix sum on a
u64slice. - matmul
- Compute
C = A * BwhereAism × kandBisk × n(all row-major). Returns a flatm*nvector. - matrix_
diagonal - Extract the diagonal of a
d×dmatrix stored as a flatd*dslice. - matvec
- Compute
y = A * xwhereAism × n(row-major),xhasnelements, resultyhasmelements. - norm_l1
- L1 norm: sum of absolute values.
- norm_l2
- L2 (Euclidean) norm.
- norm_
linf - L∞ (Chebyshev) norm: maximum absolute value.
- normalise_
by_ sum - Normalise: divide each element by the total sum.
- parallel_
histogram - Parallel histogram reduce: split
datainton_workerschunks, compute a partial histogram per chunk (in parallel), then merge all partial histograms serially. Mirrors the GPU pattern of per-work-group private histograms followed by a reduction pass. - parallel_
segmented_ reduce_ sum - Segmented reduce: parallel version using Rayon.
- partition_
stable - Partition
datainto two groups: (passing, failing) — stable order. - radix_
sort_ f64 - Radix sort for f64 values (sorts by bit representation, handles sign bit).
- radix_
sort_ pass_ u64 - Single-pass radix sort step: sort
databy a singlebit_offset-wide digit extracted at bit positionbit_poswith radixradix(must be a power of two, e.g. 256 for 8-bit digits). - radix_
sort_ u64 - Full 64-bit radix sort (8 passes of 8-bit digits).
- reduce_
broadcast - Reduce
datato a scalar and broadcast the result back to all positions. - scatter
- Scatter: write
src[i]todst[indices[i\]]. - segmented_
exclusive_ scan - Segmented exclusive prefix scan.
- segmented_
inclusive_ scan - Segmented inclusive prefix scan.
- segmented_
reduce_ sum - Segmented reduce: sum within each segment, returns one value per segment.
- tree_
reduce_ max - Work-efficient tree reduction for max.
- tree_
reduce_ min - Work-efficient tree reduction for min.
- tree_
reduce_ sum - Work-efficient tree reduction: sums
datausing a binary tree pattern. - warp_
broadcast - Simulate a warp-level broadcast: every lane gets
lane_val[leader]. - warp_
exclusive_ scan - Simulate a warp-level exclusive scan.
- warp_
reduce_ sum - Simulate a warp-level reduce-sum: all lanes get the total sum.
- warp_
vote_ all - Simulate warp vote:
all— returns true if all lanes passpred. - warp_
vote_ any - Simulate warp vote:
any— returns true if any lane passespred.