Module encode_throughput

Expand description

Measured device-resident encode throughput for the SAE/LLM batched-solve shape (#1412, #988, #1017 Phase-3).

§Why this module exists

The historical throughput “decision gate” (#1412) asserted a 100_000 rows/sec/GPU deployment target without ever measuring a device. Its successor still keyed the deployment decision on a CPU measurement scaled by a hardcoded CPU_TO_GPU_SCALING = 100.0 fudge factor — so passing the gate established nothing about real GPU throughput. #988 closed COMPLETED while the maintainer’s own follow-up confirmed the GPU steady-state encode rate had never been measured.

This module makes the measurement real and testable as a library function (the prior real benchmark lived only in examples/throughput_1412.rs, which nothing in CI ran or asserted). measure_resident_solve_throughput runs the production IRLS inner step — upload X once, then repeatedly solve the penalized normal equations (XᵀWX + ridge·I)β = rhs with the p×p Gram and its Cholesky factor kept DEVICE-RESIDENT, downloading only the p-vector β — on the real device, and reports the measured design-rows/sec.

§Fail-loud, never false-route

The single recurring failure mode this guards against is false GPU routing: claiming a device measurement while the work silently ran on the CPU. ResidentSolveThroughput::engaged is true only when ResidentDesignGram::try_new actually staged X on the device AND every timed solve returned a device result. If the device path declines or fails mid-measurement, engaged is false and measured_rows_per_sec is left at 0.0 — a non-measurement that GpuThroughputVerdict can never report as meeting the target. There is no CPU fallback inside the measurement: a caller that wants the CPU oracle runs it separately for parity.

Structs§

EncodeShape: A representative LLM/SAE batched-solve work cell: n design rows, p wide decoder border. (d, the per-atom reduced-Schur block size, is fixed by the term and does not enter the resident-solve throughput.)
ResidentSolveThroughput: Outcome of measuring the device-resident penalized-solve throughput for one EncodeShape.

Constants§

CANONICAL_ENCODE_SHAPES: The canonical qwen/olmo-scale SAE residual-block shapes (matches the examples/throughput_1412.rs workload so the library measurement and the example agree).
DEPLOYMENT_TARGET_ROWS_PER_SEC: The deployment target, re-exported so callers measuring throughput do not have to import the policy module directly.

Functions§

cpu_oracle_normal_equations_solve: CPU oracle for the same penalized normal-equations solve, used for parity: (XᵀWX + ridge·I)β = rhs solved by a host Cholesky. This is the definition of truth the device solve must match (up to IEEE-754 reduction order).
measure_resident_solve_throughput: Measure the device-resident penalized-normal-equations solve throughput for one shape: upload X once, then time reps solves that cross only w (H2D), rhs (H2D, fixed), and β (D2H) — the production IRLS inner step.