1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
//! AVX2/FMA-accelerated CPU backend for the Poulpy lattice cryptography library.
//!
//! This crate provides `FFT64Avx`, a high-performance backend implementation for [`poulpy_hal`]
//! that leverages x86-64 SIMD instruction sets (AVX2 and FMA) to accelerate cryptographic operations
//! in fully homomorphic encryption (FHE) schemes based on Module-LWE.
//!
//! # Architecture
//!
//! `poulpy_hal` defines a hardware abstraction layer (HAL) via the [`Backend`](poulpy_hal::layouts::Backend)
//! trait and a family of _open extension point_ (OEP) traits in [`poulpy_hal::oep`]. This crate
//! implements every OEP trait for the `FFT64Avx` backend using hand-optimized AVX2/FMA intrinsics
//! and assembly kernels where profiling demonstrates performance benefits over compiler-generated code.
//!
//! The internal modules are organized by operation domain:
//!
//! | Module | Domain |
//! |-----------------|-----------------------------------------------------------|
//! | `module` | Backend handle lifecycle, FFT table management |
//! | `scratch` | Temporary memory allocation and arena-style sub-allocation|
//! | `znx_avx` | Single ring element (`Z[X]/(X^n+1)`) SIMD arithmetic |
//! | `vec_znx` | Vectors of ring elements (limb decomposition) |
//! | `vec_znx_big` | Large-coefficient (multi-word) ring element vectors |
//! | `vec_znx_dft` | Fourier-domain ring element vectors (forward/inverse DFT) |
//! | `reim` | Real/imaginary interleaved FFT primitives |
//! | `convolution` | Polynomial convolution via FFT, by-constant, and pairwise |
//! | `svp` | Scalar-vector product in frequency domain |
//!
//! # Scalar types
//!
//! For the `FFT64Avx` backend:
//!
//! - `ScalarPrep = f64`: coefficients in the DFT / frequency domain.
//! - `ScalarBig = i64`: coefficients in the large-integer (multi-word) domain.
//! meaning each coefficient occupies exactly one scalar word.
//!
//! # CPU requirements
//!
//! This backend **requires** x86-64 CPUs with:
//! - **AVX2**: 256-bit SIMD registers and operations
//! - **FMA**: Fused multiply-add for reduced rounding error in FFT
//!
//! Runtime CPU feature detection is performed in [`Module::new()`](poulpy_hal::api::ModuleNew::new).
//! If the required features are not present, the constructor panics with a descriptive error message.
//!
//! # Compile-time requirements
//!
//! To compile this crate, you must enable AVX2 and FMA target features:
//!
//! ```text
//! RUSTFLAGS="-C target-cpu=native" cargo build --release
//! ```
//!
//! Or explicitly:
//!
//! ```text
//! RUSTFLAGS="-C target-feature=+avx2,+fma" cargo build --release
//! ```
//!
//! Failure to enable these features at compile time will result in a compilation error.
//!
//! # Correctness guarantees
//!
//! ## Determinism
//!
//! All operations produce **bit-identical results** across different runs and different backends
//! (when compared to `poulpy-cpu-ref`). Floating-point operations in FFT are constrained to
//! maintain error < 0.5 ULP, ensuring correct rounding when converting back to integers.
//!
//! ## Overflow handling
//!
//! Integer overflow is **intentional** and managed through bivariate polynomial representation.
//! The normalization functions (`znx_normalize_*`) use wrapping arithmetic to propagate carries
//! correctly across limbs in base-2^k representation.
//!
//! ## Memory alignment
//!
//! All data layouts enforce 64-byte alignment (matching cache line size) as specified by
//! `poulpy_hal::DEFAULTALIGN`. This alignment is verified at buffer allocation and enables
//! the use of aligned SIMD loads/stores for maximum performance.
//!
//! ## Safety invariants
//!
//! Many functions are marked `unsafe` and require:
//! - CPU features (AVX2/FMA) are present (verified at module creation)
//! - Input slices have matching lengths where documented
//! - Input values satisfy documented bounds (e.g., `|x| < 2^50` for IEEE 754 conversions)
//! - Buffers are properly aligned (enforced by HAL allocators)
//!
//! Violating these invariants may result in:
//! - Undefined behavior (e.g., invalid SIMD instructions, out-of-bounds memory access)
//! - Silent incorrect results (e.g., exceeding numeric bounds in FP conversion)
//! - Panics (in debug mode via assertions, or unconditionally for critical invariants)
//!
//! # Performance characteristics
//!
//! ## Asymptotic complexity
//!
//! - **FFT/IFFT**: O(n log n) for polynomial degree n
//! - **Convolution**: O(n log n) via FFT-based approach
//! - **Normalization**: O(n) per limb with vectorized digit extraction
//!
//! ## Speedup over reference backend
//!
//! Speedups depend on the host micro-architecture and on the operation profile of the
//! workload. Run the benches in `poulpy-bench` on the target host for representative
//! numbers. Qualitative trends:
//!
//! - **Ring element arithmetic** (add/sub/negate): bandwidth-bound, modest gains.
//! - **FFT16 kernels** (hand-written assembly): noticeably ahead of compiler-generated
//! intrinsics on supported micro-architectures.
//! - **Convolution** (large degree): the largest gains, scaling with coefficient size.
//!
//! ## Memory layout
//!
//! - **Vectorized storage**: Elements packed in groups of 4 (matching AVX2 register width for `i64`)
//! - **Tail handling**: Scalar fallback for lengths not divisible by 4
//! - **Cache-friendly**: 64-byte alignment ensures single cache line per vector load
//!
//! # Threading and concurrency
//!
//! - **`FFT64Avx` is `Send + Sync`**: Zero-sized marker type, no internal state.
//! - **`Module<FFT64Avx>` is `Send + Sync`**: FFT tables are immutable after construction.
//! - **Operations require `&mut` for outputs**: Prevents data races at the API level.
//! - **No internal locking**: All synchronization is caller's responsibility.
//!
//! # Feature flags
//!
//! - `enable-avx` (optional): Historically used for conditional compilation, currently inactive.
//!
//! # Platform support
//!
//! - **Required**: x86-64 architecture with AVX2 and FMA
//! - **Tested on**: Linux (x86_64), macOS (Intel), Windows (x86_64)
//! - **Not supported**: ARM, RISC-V, or other architectures
//!
//! # Threat model
//!
//! This library assumes an **"honest but curious"** adversary model:
//! - **No malicious inputs**: Callers are trusted to provide well-formed data within documented bounds.
//! - **No timing attack mitigation**: Operations are not constant-time (performance is prioritized).
//! - **Memory safety**: Bounds are validated to prevent crashes and corruption, but not for security.
//!
//! # Usage
//!
//! This crate exports a single public type, `FFT64Avx`, which is used as a type parameter
//! to the HAL generic types. Application code typically does not import this crate directly,
//! but instead uses it via `poulpy_core` or `poulpy_bin_fhe` with runtime backend selection.
//!
//! # Versioning and stability
//!
//! This crate follows semantic versioning. The public API consists solely of the `FFT64Avx`
//! marker type and its trait implementations from `poulpy_hal::oep`. All other items are
//! implementation details subject to change without notice.
// ─────────────────────────────────────────────────────────────
// Build the backend **only when ALL conditions are satisfied**
// ─────────────────────────────────────────────────────────────
//#![cfg(all(feature = "enable-avx", target_arch = "x86_64", target_feature = "avx2", target_feature = "fma"))]
// If the user enables this backend but targets a non-x86_64 CPU → abort
compile_error!;
// If the user enables this backend but AVX2 isn't enabled in the target → abort
compile_error!;
// If the user enables this backend but FMA isn't enabled in the target → abort
compile_error!;
// Keep the crate as a true opt-in backend: without `enable-avx`, none of the
// AVX modules or their unit tests are compiled.
pub use ;
pub use NTT120Avx;
// --- TransferFrom impls ---