Expand description
Measured device-resident encode throughput for the SAE/LLM batched-solve shape (#1412, #988, #1017 Phase-3).
§Why this module exists
The historical throughput “decision gate” (#1412) asserted a 100_000
rows/sec/GPU deployment target without ever measuring a device. Its
successor still keyed the deployment decision on a CPU measurement scaled
by a hardcoded CPU_TO_GPU_SCALING = 100.0 fudge factor — so passing the
gate established nothing about real GPU throughput. #988 closed
COMPLETED while the maintainer’s own follow-up confirmed the GPU
steady-state encode rate had never been measured.
This module makes the measurement real and testable as a library function
(the prior real benchmark lived only in examples/throughput_1412.rs, which
nothing in CI ran or asserted). measure_resident_solve_throughput runs
the production IRLS inner step — upload X once, then repeatedly solve the
penalized normal equations (XᵀWX + ridge·I)β = rhs with the p×p Gram and
its Cholesky factor kept DEVICE-RESIDENT, downloading only the p-vector
β — on the real device, and reports the measured design-rows/sec.
§Fail-loud, never false-route
The single recurring failure mode this guards against is false GPU
routing: claiming a device measurement while the work silently ran on the
CPU. ResidentSolveThroughput::engaged is true only when
ResidentDesignGram::try_new actually staged X on the device AND every
timed solve returned a device result. If the device path declines or fails
mid-measurement, engaged is false and measured_rows_per_sec is left at
0.0 — a non-measurement that GpuThroughputVerdict can never report as
meeting the target. There is no CPU fallback inside the measurement: a
caller that wants the CPU oracle runs it separately for parity.
Structs§
- Encode
Shape - A representative LLM/SAE batched-solve work cell:
ndesign rows,pwide decoder border. (d, the per-atom reduced-Schur block size, is fixed by the term and does not enter the resident-solve throughput.) - Resident
Solve Throughput - Outcome of measuring the device-resident penalized-solve throughput for one
EncodeShape.
Constants§
- CANONICAL_
ENCODE_ SHAPES - The canonical qwen/olmo-scale SAE residual-block shapes (matches the
examples/throughput_1412.rsworkload so the library measurement and the example agree). - DEPLOYMENT_
TARGET_ ROWS_ PER_ SEC - The deployment target, re-exported so callers measuring throughput do not have to import the policy module directly.
Functions§
- cpu_
oracle_ normal_ equations_ solve - CPU oracle for the same penalized normal-equations solve, used for parity:
(XᵀWX + ridge·I)β = rhssolved by a host Cholesky. This is the definition of truth the device solve must match (up to IEEE-754 reduction order). - measure_
resident_ solve_ throughput - Measure the device-resident penalized-normal-equations solve throughput for
one shape: upload
Xonce, then timerepssolves that cross onlyw(H2D),rhs(H2D, fixed), andβ(D2H) — the production IRLS inner step.