Module encode_throughput

Expand description

Measured device-resident throughput of the SAE/LLM batched-solve COMPONENT — the resident penalized normal-equations inner solve, NOT the full exact SAE encode (see the SCOPE section below) (#1412, #988, #1017 Phase-3).

§Why this module exists

The historical throughput “decision gate” (#1412) asserted a 100_000 rows/sec/GPU deployment target without ever measuring a device. Its successor still keyed the deployment decision on a CPU measurement scaled by a hardcoded CPU_TO_GPU_SCALING = 100.0 fudge factor — so passing the gate established nothing about real GPU throughput. #988 closed COMPLETED while the maintainer’s own follow-up confirmed the GPU steady-state encode rate had never been measured.

This module makes the measurement real and testable as a library function (the prior real benchmark lived only in examples/throughput_1412.rs, which nothing in CI ran or asserted). measure_resident_solve_throughput runs the production IRLS inner step — upload X once, then repeatedly solve the penalized normal equations (XᵀWX + ridge·I)β = rhs with the p×p Gram and its Cholesky factor kept DEVICE-RESIDENT, downloading only the p-vector β — on the real device, and reports the measured design-rows/sec.

§SCOPE — this is a COMPONENT benchmark, not the full exact SAE encode

What is timed here is the resident penalized normal-equations inner solve (XᵀWX + ridge·I)β = rhs ONLY. That is one component of the SAE encode, NOT the full exact per-row SAE encode, and the measured rate is therefore NOT evidence for a “batched exact per-row GPU encode” title claim. The full exact encode would additionally require, per row: active-set routing (which atoms are live), the per-row latent-coordinate Newton refinement on the manifold, the assignment/gate (softmax/IBP) solve, and the certificate/fallback + reconstruction-validation path. None of those are exercised or timed by this function. Establishing the end-to-end encode-throughput claim requires a separate benchmark that times the production encode path itself (routing + latent-coordinate Newton + assignment/gate solve + fallback/certificate), not this inner-solve cell. Treat the number below strictly as the resident normal-equations inner-solve throughput.

§Fail-loud, never false-route

The single recurring failure mode this guards against is false GPU routing: claiming a device measurement while the work silently ran on the CPU. ResidentSolveThroughput::engaged is true only when ResidentDesignGram::try_new actually staged X on the device AND every timed solve returned a device result. If the device path declines or fails mid-measurement, engaged is false and measured_rows_per_sec is left at 0.0 — a non-measurement that GpuThroughputVerdict can never report as meeting the target. There is no CPU fallback inside the measurement: a caller that wants the CPU oracle runs it separately for parity.

Structs§

EncodeQualityMetrics: Correctness of an encode result, measured against the production CPU encode (a per-row reference) and the reconstruction it implies.
EncodeShape: A representative LLM/SAE batched-solve work cell: n design rows, p wide decoder border. (d, the per-atom reduced-Schur block size, is fixed by the term and does not enter the resident-solve throughput.)
FullEncodeThroughput: End-to-end throughput of the FULL exact per-row encode for one batch.
ResidentSolveThroughput: Outcome of measuring the device-resident penalized-solve throughput for one EncodeShape.

Constants§

CANONICAL_ENCODE_SHAPES: The canonical qwen/olmo-scale SAE residual-block shapes (matches the examples/throughput_1412.rs workload so the library measurement and the example agree).
DEPLOYMENT_TARGET_ROWS_PER_SEC: The deployment target, re-exported so callers measuring throughput do not have to import the policy module directly.

Functions§

cpu_oracle_normal_equations_solve: CPU oracle for the same penalized normal-equations solve, used for parity: (XᵀWX + ridge·I)β = rhs solved by a host Cholesky. This is the definition of truth the device solve must match (up to IEEE-754 reduction order).
encode_quality_metrics: Compute EncodeQualityMetrics for an encode result.
measure_resident_solve_throughput: Measure the device-resident penalized-normal-equations solve throughput for one shape: upload X once, then time reps solves that cross only w (H2D), rhs (H2D, fixed), and β (D2H) — the production IRLS inner step.