1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
//! SIMD abstraction layer for high-performance FFT computation.
//!
//! Provides a unified interface for SIMD operations across different architectures,
//! enabling vectorized FFT butterflies and complex arithmetic.
//!
//! # Overview
//!
//! OxiFFT's SIMD layer provides:
//! - **Automatic runtime detection** via [`detect_simd_level()`]
//! - **Unified traits** ([`SimdVector`], [`SimdComplex`]) for portable code
//! - **Architecture-specific implementations** for maximum performance
//!
//! # Available Backends
//!
//! | Backend | Architecture | Vector Width | Lanes (f64) | Lanes (f32) | Features |
//! |---------|-------------|--------------|-------------|-------------|----------|
//! | [`Scalar`] | All | 64/32-bit | 1 | 1 | Always available |
//! | `Sse2F64`/`Sse2F32` | x86_64 | 128-bit | 2 | 4 | SSE2 (baseline x86_64) |
//! | `AvxF64`/`AvxF32` | x86_64 | 256-bit | 4 | 8 | AVX |
//! | `Avx2F64`/`Avx2F32` | x86_64 | 256-bit | 4 | 8 | AVX2 + FMA3 |
//! | `Avx512F64`/`Avx512F32` | x86_64 | 512-bit | 8 | 16 | AVX-512F |
//! | `NeonF64`/`NeonF32` | aarch64 | 128-bit | 2 | 4 | NEON (mandatory) |
//! | Portable* | All | Variable | 2-8 | 4-16 | Nightly + `portable_simd` |
//!
//! *Portable SIMD requires nightly Rust and the `portable_simd` feature flag.
//!
//! # CPU Requirements
//!
//! ## x86_64
//!
//! - **SSE2**: Required for x86_64 (guaranteed on all modern CPUs since 2003)
//! - **AVX**: Intel Sandy Bridge (2011+), AMD Bulldozer (2011+)
//! - **AVX2 + FMA**: Intel Haswell (2013+), AMD Excavator (2015+)
//! - **AVX-512**: Intel Skylake-X (2017+), AMD Zen 4 (2022+), limited server CPUs
//!
//! ## aarch64 (ARM64)
//!
//! - **NEON**: Mandatory on aarch64, always available (Apple M1/M2/M3, AWS Graviton, Ampere)
//!
//! # Runtime Detection
//!
//! Use [`detect_simd_level()`] to query the highest available SIMD level at runtime:
//!
//! ```
//! use oxifft::simd::{detect_simd_level, SimdLevel};
//!
//! let level = detect_simd_level();
//! match level {
//! SimdLevel::Avx512 => println!("Using AVX-512 (512-bit vectors)"),
//! SimdLevel::Avx2 => println!("Using AVX2 with FMA (256-bit vectors)"),
//! SimdLevel::Avx => println!("Using AVX (256-bit vectors)"),
//! SimdLevel::Sse2 => println!("Using SSE2 (128-bit vectors)"),
//! SimdLevel::Neon => println!("Using NEON (128-bit vectors)"),
//! SimdLevel::Sve => println!("Using ARM SVE (scalable vectors)"),
//! SimdLevel::Scalar => println!("No SIMD, using scalar fallback"),
//! }
//! ```
//!
//! # Performance Guidelines
//!
//! ## Memory Alignment
//!
//! For optimal SIMD performance, data should be aligned:
//! - SSE2/NEON: 16-byte alignment
//! - AVX/AVX2: 32-byte alignment
//! - AVX-512: 64-byte alignment
//!
//! Use [`alloc_complex_aligned`](crate::alloc_complex_aligned) or [`AlignedBuffer`](crate::AlignedBuffer) for aligned memory.
//! Unaligned loads/stores work but may be slower on some architectures.
//!
//! ## Expected Speedups
//!
//! Typical speedups over scalar code for FFT operations:
//! - **SSE2/NEON**: 1.5-2x for f64, 2-3x for f32
//! - **AVX/AVX2**: 2-3x for f64, 3-5x for f32
//! - **AVX-512**: 3-5x for f64, 5-8x for f32
//!
//! Actual speedups depend on problem size, memory bandwidth, and cache behavior.
//!
//! ## FMA (Fused Multiply-Add)
//!
//! AVX2 and later include FMA instructions which:
//! - Compute `a * b + c` in a single operation
//! - Provide better precision (single rounding instead of two)
//! - Reduce pipeline stalls in complex arithmetic
//!
//! # Feature Flags
//!
//! - `portable_simd`: Enable experimental portable SIMD backend (requires nightly)
//!
//! # Example: Using SIMD Traits
//!
//! ```ignore
//! use oxifft::simd::{SimdVector, SimdComplex};
//!
//! fn complex_butterfly<V: SimdComplex>(a: V, b: V) -> (V, V) {
//! V::butterfly(a, b) // Returns (a+b, a-b)
//! }
//! ```
//!
//! # Safety
//!
//! All SIMD types use `unsafe` internally but expose a safe API. The unsafe
//! operations are:
//! - `load_aligned`/`store_aligned`: Require proper alignment
//! - `load_unaligned`/`store_unaligned`: Require valid pointer for LANES elements
//!
//! OxiFFT's internal code handles alignment automatically.
pub use ;
pub use Scalar;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;
pub use ;