pub struct ArrowBorderSolvePlan {
pub n: usize,
pub k: usize,
pub d: usize,
pub cg_iters: usize,
pub data_fit_rank: usize,
pub dense_border_rank_deficient: bool,
pub dense_direct_flops: u128,
pub reduced_iterative_flops: u128,
pub recommended: ArrowBorderStrategy,
pub device_favorable: bool,
}Expand description
Cost model + recommendation for the arrow-Schur border solve, a pure function of the joint-system shape (unit-testable, no device required).
This operationalises the measured #1017 finding that the full arrow-Schur
Newton solve is dominated by the dense k × k border Cholesky (the on-device
dense Direct solve was measured at ~0.94× — a slowdown — because the k³/3
factorization, not the GPU-favourable batched per-row work, is the bottleneck
at LLM/SAE border widths). The lever the issue calls for is to shrink or
factor the dense border so the batched n-row work dominates; the plan
makes that decision inspectable and honest.
§Flop model (deliberate, documented approximations)
- Dense Direct ≈
2·n·d·k²(assemble the reduced Schur: per row a rank-dsymmetric updateH_βt (H_tt)⁻¹ H_tβto thek × kborder,≈ 2·d·k²flops)+ k³/3(Cholesky of the densek × kSchur). - Reduced iterative ≈
cg_iters · n·(4·d·k + d²)(matrix-free PCG: per matvec a forward + transpose cross-block GEMV4·d·kplus the per-rowd × dsolved², summed overnrow blocks, overcg_itersapplies).
Both are dispatch-grade estimates, not exact operation counts; they omit preconditioner setup and lower-order terms symmetrically, so their ratio (the only thing the recommendation consumes) is meaningful while neither figure should be reused for speedup accounting.
§Status
Advisory / diagnostic. It is not wired into the live
ArrowSolverMode::automatic selector: replacing the fixed DIRECT_SOLVE_MAX_K
cut with this shape-driven crossover changes which production fits take the
Direct vs PCG path and must be validated on GPU hardware (#1017 Phase 2–4)
before it can change numerics. Today it is consumed by the honest
examples/full_color_fit_1017.rs measurement harness (modeled-vs-measured)
and by the unit tests below.
Fields§
§n: usizeNumber of per-row blocks (SAE observations / latent rows).
k: usizeBorder β width (the SAE decoder atom count K × basis width).
d: usizePer-row latent / active-frame depth (the M dimension).
cg_iters: usizeCG iteration budget assumed for the iterative estimate.
data_fit_rank: usizeEffective rank of the data-fit contribution to the k × k border,
bounded by Σ_i d_i ≈ n·d and never more than k.
dense_border_rank_deficient: boolTrue when n·d < k: the dense k × k Cholesky spends O(k³) factorising
a border whose data information is only rank n·d — the pathological
wide-sparse-border regime (color arm: n·d = 360 ≪ k = 15360).
dense_direct_flops: u128≈ 2·n·d·k² + k³/3 — reduced-Schur assembly plus dense border Cholesky.
reduced_iterative_flops: u128≈ cg_iters · n·(4·d·k + d²) — matrix-free PCG matvecs.
recommended: ArrowBorderStrategyThe recommended strategy: ReducedIterative iff the dense factorization
path costs strictly more arithmetic than the iterative path at
cg_iters.
device_favorable: boolWhether running the recommended strategy on the device is expected to
pay off. For ReducedIterative this is reduced_schur_matvec_should_offload;
for DenseDirect the device wins only when the batched per-row assembly
work (2·n·d·k², GPU-favourable batched GEMM/POTRF) at least matches the
border Cholesky (k³/3) and clears the dense flop floor — the honest
encoding of the measured 0.94× dense-Direct-on-device slowdown.
Trait Implementations§
Source§impl Clone for ArrowBorderSolvePlan
impl Clone for ArrowBorderSolvePlan
Source§fn clone(&self) -> ArrowBorderSolvePlan
fn clone(&self) -> ArrowBorderSolvePlan
1.0.0 (const: unstable) · Source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source. Read moreimpl Copy for ArrowBorderSolvePlan
Source§impl Debug for ArrowBorderSolvePlan
impl Debug for ArrowBorderSolvePlan
impl Eq for ArrowBorderSolvePlan
Source§impl PartialEq for ArrowBorderSolvePlan
impl PartialEq for ArrowBorderSolvePlan
Source§fn eq(&self, other: &ArrowBorderSolvePlan) -> bool
fn eq(&self, other: &ArrowBorderSolvePlan) -> bool
self and other values to be equal, and is used by ==.impl StructuralPartialEq for ArrowBorderSolvePlan
Auto Trait Implementations§
impl Freeze for ArrowBorderSolvePlan
impl RefUnwindSafe for ArrowBorderSolvePlan
impl Send for ArrowBorderSolvePlan
impl Sync for ArrowBorderSolvePlan
impl Unpin for ArrowBorderSolvePlan
impl UnsafeUnpin for ArrowBorderSolvePlan
impl UnwindSafe for ArrowBorderSolvePlan
Blanket Implementations§
impl<T> Boilerplate for T
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
impl<ST, DT> CastableFrom<ST, Initialized, Initialized> for DT
impl<ST, DT> CastableFrom<ST, Uninit, Uninit> for DT
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> DistributionExt for Twhere
T: ?Sized,
impl<T> DistributionExt for Twhere
T: ?Sized,
impl<T, U> Imply<T> for U
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more