Module streaming_border

Expand description

Streaming, deterministic, out-of-core border-Gram accumulation (#973).

Corpus-scale joint fits cannot hold the activation row set in memory: the Schur border Gram G = Σ_n x_n x_nᵀ (with x_n ∈ ℝ^k the row’s border coordinates) must be accumulated over fixed-size row chunks streamed from disk shards. Because the methodological program (replicate nulls, resumable workflows) rests on determinism, the accumulation here is bit-reproducible by construction, not by luck:

The chunk partition is a pure function of (n_rows, chunk_size) — chunk j covers rows [j·chunk_size, min((j+1)·chunk_size, n_rows)).
Each within-chunk Gram entry is a pairwise_sum over the chunk’s rows (the already-landed deterministic pairwise tree of gam_linalg::pairwise_reduce).
Cross-chunk reduction follows the same fixed pairwise tree (the StreamingPairwise cascade, applied entry-wise to whole chunk Grams): sequential base blocks of CROSS_CHUNK_BASE chunk partials, then power-of-two cascade merges. The tree shape depends only on the chunk count — never on values, device timing, or thread scheduling. A unit test pins the cross-chunk association bit-for-bit to pairwise_sum over the per-chunk entries.
Chunks may be submitted in any order (e.g. shards finishing on different devices at different times): every chunk is keyed by its chunk index, the in-order fold frontier advances eagerly, and out-of-order arrivals wait in a pending buffer. The final Gram is a pure function of the row content alone — identical bits for any submission order.

All accumulation buffers are f64 (the mixed-precision policy of #973: per-row kernels may run f32 upstream, but everything feeding evidence accumulates in f64 — this module exposes no f32 accumulation path at all).

The accumulation state — partial Grams (in-order fold forest + pending out-of-order chunk partials) plus the chunk cursor — serializes to a BorderGramCheckpoint and resumes via StreamingBorderGram::resume, with resume-equals-straight-through guaranteed (and unit-tested) at the bit level.

Pure library: no SAE coupling, no flags, no environment variables. Drivers that also need a right-hand side Σ_n x_n y_n stack the response columns onto the border coordinates ([X | Y]) and read the cross block of the returned Gram; per-row weights w_n are pre-scaled into the rows as √w_n · x_n by the caller.

Structs§

BorderGramCheckpoint: Serializable accumulation state of a StreamingBorderGram: the partial Grams plus the chunk cursor. Writing this to disk after every accepted chunk makes a preempted multi-hour pass resumable instead of restartable; StreamingBorderGram::resume reconstructs the accumulator with bit-identical future behavior (resume-equals-straight-through).
ChunkAssembler: Bridges arbitrary-length row batches onto the fixed chunk partition.
StreamingBorderGram: Chunked, out-of-core, bit-reproducible border-Gram accumulator.

Constants§

CROSS_CHUNK_BASE: Base-block size of the cross-chunk pairwise tree, in chunk partials.

Functions§

chunk_gram_flat: Deterministic per-chunk Gram contribution, flattened k·k row-major, with k = rows.ncols(). Entry (a, b) is the pairwise_sum of x_i[a]·x_i[b] over the chunk’s rows in row order; the symmetric mirror entry reuses the same products in the same order, so the matrix is bitwise symmetric.