Expand description
Streaming, deterministic, out-of-core border-Gram accumulation (#973).
Corpus-scale joint fits cannot hold the activation row set in memory: the
Schur border Gram G = Σ_n x_n x_nᵀ (with x_n ∈ ℝ^k the row’s border
coordinates) must be accumulated over fixed-size row chunks streamed
from disk shards. Because the methodological program (replicate nulls,
resumable workflows) rests on determinism, the accumulation here is
bit-reproducible by construction, not by luck:
- The chunk partition is a pure function of
(n_rows, chunk_size)— chunkjcovers rows[j·chunk_size, min((j+1)·chunk_size, n_rows)). - Each within-chunk Gram entry is a
pairwise_sumover the chunk’s rows (the already-landed deterministic pairwise tree ofgam_linalg::pairwise_reduce). - Cross-chunk reduction follows the same fixed pairwise tree (the
StreamingPairwisecascade, applied entry-wise to whole chunk Grams): sequential base blocks ofCROSS_CHUNK_BASEchunk partials, then power-of-two cascade merges. The tree shape depends only on the chunk count — never on values, device timing, or thread scheduling. A unit test pins the cross-chunk association bit-for-bit topairwise_sumover the per-chunk entries. - Chunks may be submitted in any order (e.g. shards finishing on different devices at different times): every chunk is keyed by its chunk index, the in-order fold frontier advances eagerly, and out-of-order arrivals wait in a pending buffer. The final Gram is a pure function of the row content alone — identical bits for any submission order.
All accumulation buffers are f64 (the mixed-precision policy of #973: per-row kernels may run f32 upstream, but everything feeding evidence accumulates in f64 — this module exposes no f32 accumulation path at all).
The accumulation state — partial Grams (in-order fold forest + pending
out-of-order chunk partials) plus the chunk cursor — serializes to a
BorderGramCheckpoint and resumes via StreamingBorderGram::resume,
with resume-equals-straight-through guaranteed (and unit-tested) at the
bit level.
Pure library: no SAE coupling, no flags, no environment variables. Drivers
that also need a right-hand side Σ_n x_n y_n stack the response columns
onto the border coordinates ([X | Y]) and read the cross block of the
returned Gram; per-row weights w_n are pre-scaled into the rows as
√w_n · x_n by the caller.
Structs§
- Border
Gram Checkpoint - Serializable accumulation state of a
StreamingBorderGram: the partial Grams plus the chunk cursor. Writing this to disk after every accepted chunk makes a preempted multi-hour pass resumable instead of restartable;StreamingBorderGram::resumereconstructs the accumulator with bit-identical future behavior (resume-equals-straight-through). - Chunk
Assembler - Bridges arbitrary-length row batches onto the fixed chunk partition.
- Streaming
Border Gram - Chunked, out-of-core, bit-reproducible border-Gram accumulator.
Constants§
- CROSS_
CHUNK_ BASE - Base-block size of the cross-chunk pairwise tree, in chunk partials.
Functions§
- chunk_
gram_ flat - Deterministic per-chunk Gram contribution, flattened
k·krow-major, withk = rows.ncols(). Entry(a, b)is thepairwise_sumofx_i[a]·x_i[b]over the chunk’s rows in row order; the symmetric mirror entry reuses the same products in the same order, so the matrix is bitwise symmetric.