1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
//! Dense linear algebra op family — Phase 6 (Category Linalg).
//!
//! Wraps cuSOLVER's dense API plus a few bespoke kernels for batched-QR
//! variants that cuSOLVER does not surface. The family covers:
//!
//! ### Factorizations
//! - [`CholeskyPlan`] — `A = L · L^T` (SPD), non-batched + batched.
//! - [`LuPlan`] — `P · A = L · U` (partial pivoting); non-batched today
//! (`batch_size == 1` only — cuSOLVER's dense `getrf` is non-batched,
//! cuBLAS batched LU is a deferred follow-up).
//! - [`QrPlan`] — `A = Q · R`; 2-D only (cuSOLVER has no batched `geqrf`).
//! - [`BatchedQrPlan`] — batched-QR via **cuBLAS** `geqrfBatched`,
//! packed output, `f32` / `f64` / `Complex32` / `Complex64`.
//! - [`BatchedQrMaterializePlan`] — bespoke kernel that unpacks
//! [`BatchedQrPlan`]'s output into dense `Q [B, M, M]` + `R [B, K, N]`.
//! - [`SvdPlan`] — `A = U · diag(S) · V^T`. 2-D only (`gesvd`,
//! bidiag-QR). `full_matrices` toggles full vs thin shapes.
//! - [`BatchedSvdPlan`] — Jacobi-batched (`gesvdjBatched`), square-only.
//! - [`BatchedSvdaPlan`] — rectangular approximate-SVD
//! (`gesvdaStridedBatched`) with rank-truncation.
//!
//! ### Eigendecompositions
//! - [`EighPlan`] — `A · v = λ · v` (symmetric / Hermitian), real eigvals.
//! - [`EigPlan`] — general non-symmetric `A · v = λ · v` via `Xgeev`;
//! real input → real packed eigvals (`wr` / `wi`), complex input →
//! complex eigvals.
//!
//! ### Solvers / inverse / least-squares
//! - [`SolvePlan`] — `A · X = B` via `getrf` + `getrs`.
//! - [`InversePlan`] — `A^{-1}` via `getrf` + `getrs` over identity RHS.
//! - [`LstSqPlan`] — `min ‖A·x - b‖²` via `_gels` (iterative) with
//! optional QR (`geqrf` + `ormqr` + `trsm`) fallback.
//!
//! ### Householder application
//! - [`BatchedOrmqrPlan`] — reflector-by-reflector apply (GEMV-rates;
//! wins for tiny matrices). Real `op ∈ {N, T}`, complex `op ∈ {N, C}`.
//! `side ∈ {Left, Right}`.
//! - [`BatchedOrmqrWyPlan`] — WY-blocked apply via cuBLAS strided-batched
//! GEMM (GEMM-rates; wins for `M, N > ~16`). `side = Left` only.
//! Real `op ∈ {N, T}`, complex `op ∈ {N, C}` — same gate as the
//! reflector-by-reflector plan.
//!
//! ## Dtype coverage
//!
//! Most plans support `f32` + `f64` only — cuSOLVER's dense API does
//! **not** expose `f16` / `bf16` for these factorizations. Complex
//! (`Complex32` / `Complex64`) is wired for [`EighPlan`], [`EigPlan`],
//! [`BatchedQrPlan`], [`BatchedOrmqrPlan`], [`BatchedOrmqrWyPlan`].
//! See per-plan docs for the authoritative dtype list.
//!
//! ## Row-major / column-major adapter
//!
//! cuSOLVER is column-major (LAPACK convention). PyTorch and the rest
//! of baracuda are row-major. The plan layer handles the bridge:
//!
//! - **Symmetric ops (Cholesky)**: a row-major lower-triangular factor
//! `L` over storage `S` is bit-identical to a column-major upper-
//! triangular factor `U` over the same storage `S` (because `L^T = U`
//! when re-interpreting row-major as column-major). So
//! `CholeskyDescriptor { lower: true }` (row-major input) maps to
//! `uplo = CUBLAS_FILL_MODE_UPPER` when handing the matrix to cuSOLVER.
//!
//! - **Non-symmetric ops (LU / QR / SVD)**: the row-major `[M, N]`
//! matrix `A` is interpreted as the column-major `[N, M]` matrix
//! `A^T`. cuSOLVER factors `A^T = L'U'` (LU) or `Q' R'` (QR). For
//! plans that surface separate output tensors (`Q`, `R`, `U`, `V^T`),
//! the caller-facing tensors document this transpose semantics — the
//! reconstructed `Q · R` (interpreted row-major) factors the input
//! row-major matrix bit-for-bit only after the appropriate transpose.
//! For LU's in-place output, callers similarly see the column-major
//! factor in their row-major buffer.
//!
//! The smoke tests anchor the convention by reconstructing the input
//! matrix from the factors using the *same* row-major / column-major
//! interpretation throughout — the algebra works out regardless of
//! which storage convention is on the wire, as long as it's consistent
//! end-to-end.
//!
//! ## Handle + workspace ownership
//!
//! Each plan lazily owns one `cusolverDnHandle_t` in a `Cell<>` (created
//! on first `run`; bound to the caller's stream on every launch so the
//! plan is reusable across streams). The handle is destroyed in `Drop`.
//! cuSOLVER handles are not thread-safe — the plan is `!Sync` / `!Send`
//! by virtue of the `Cell<cusolverDnHandle_t>` it holds.
//!
//! Workspace is **caller-provided** (`Workspace::Borrowed`). The plan
//! reports the required byte count through `workspace_size()`, which
//! reflects the upper bound from the cuSOLVER `_bufferSize` queries.
//! Because `_bufferSize` requires a live handle (which the plan does
//! not own at `select` time), the bytes-needed query is performed
//! lazily on first `run` and cached in a `Cell<usize>`. The
//! `workspace_size()` accessor returns 0 before the first `run` and
//! the true cached size afterwards — callers that need the size before
//! launching can call the `query_workspace_size(stream)` helper.
//!
//! Batched ops (`*potrfBatched`, `*getrfBatched`) do not take a
//! workspace argument — cuSOLVER allocates internally — so the plan
//! reports `0` for batched-only configurations.
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;