1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
//! PCM sample decode → normalized f32 widen.
//!
//! Tracking: [#146](https://github.com/Findit-AI/mlxrs/issues/146).
//!
//! # The defect class
//!
//! The original `crate::audio::io::push_samples` inner loops were
//! per-sample `f32::from(s) / divisor` (or `s as f32 / divisor`)
//! `Vec::push`es — each push has a bounds check, a `len` update, and
//! a non-vectorizable iterator shape. Symphonia hands us decoded PCM
//! samples one buffer at a time as `&[T]` where `T` is the typed
//! sample width (i8 / i16 / i32; u8 / u16 / u32 offset-binary;
//! i24/u24 packed-in-i32). The decode boils down to:
//!
//! - **Signed arms**: divide by `2^(bits-1)` (= 128 / 32768 / 2^23 /
//! 2^31). The reference's `mlx-audio` exact divisors are
//! `8 / 16 / 24 / 32` widths.
//! - **Unsigned (offset-binary) arms**: subtract midpoint
//! `2^(bits-1)`, then divide by `2^(bits-1)`. Equivalent to first
//! reinterpreting as `i*` then applying the signed divisor — but
//! the wraparound semantics on u8 → i16 / u16 → i32 / u24 → i32 /
//! u32 → i32 require a typed cast chain, not a blind reinterpret.
//!
//! # The fix — per-dtype NEON kernels for the hot widths
//!
//! This module ships NEON kernels for the **two hot signed widths**
//! that dominate real-world WAV/FLAC:
//!
//! - [`s16_to_f32_normalize`] — 8-lane (int16x8_t → two float32x4_t)
//! widen + multiply by `1.0 / 32768.0`. Hottest path: 16-bit PCM is
//! the default for WAV and CD-quality FLAC.
//! - [`s32_to_f32_normalize`] — 4-lane (int32x4_t → float32x4_t) widen
//! + multiply by `1.0 / 2^31`. Used for 24-bit and 32-bit PCM
//! (symphonia's `S24` `inner()` returns `i32` in `[-2^23, 2^23)`;
//! we apply the appropriate divisor at the call site).
//!
//! The remaining variants (s8, offset-binary u8/u16/u24/u32) keep
//! their scalar `Vec::push` loops — they are cold (very rare 8-bit
//! PCM; offset-binary is essentially WAV-only legacy), and bundling
//! more NEON kernels for those would be dead code on the hot path.
//! The scalar reference for s16 / s32 lives in this module too so a
//! call site that needs the auditable scalar version (e.g. a
//! force-scalar build) has a clean entry point.
//!
//! # Correctness class — `Exact` (integer arms)
//!
//! The integer-to-fp widens are **lossless** for the source widths in
//! question — an i16 / i32 is exactly representable in f32's 24-bit
//! mantissa for s16, and within rounding for s32 (full i32 range
//! exceeds f32's exact-representation window of `[-2^24, 2^24]`, but
//! the rounding is identical between `vcvtq_f32_s32` and the scalar
//! `s as f32` cast — both use the current rounding mode, default
//! round-to-nearest-even).
//!
//! The multiply by `1.0 / 2^(bits-1)` is a single `f32 * f32` —
//! identical between NEON `vmulq_n_f32` and scalar `*`. The dispatcher
//! produces bit-identical output to the scalar reference for every
//! input across the s16 and s32 arms.
//!
//! Differential tests in this module use
//! [`crate::simd::diff::assert_eq_over_lane_sweep`] (Exact class).
//!
//! # `MaybeUninit<f32>` API
//!
//! Matches the widen/quantize kernels: each kernel takes `&mut [MaybeUninit<f32>]` so the
//! call site can pass `Vec::spare_capacity_mut()` directly.
//!
//! # Bench
//!
//! `mlxrs/benches/simd_pcm_decode.rs`.
use MaybeUninit;
use ;
/// f32 divisor for 16-bit signed PCM — `1.0 / 2^15 = 1.0 / 32768.0`.
/// Matches `mlx_audio.audio_io.read`'s `int16 / 32768.0` convention.
pub const S16_INV_SCALE: f32 = 1.0 / 32_768.0;
/// f32 divisor for 32-bit signed PCM — `1.0 / 2^31 = 1.0 / 2_147_483_648.0`.
/// Matches `mlx_audio.audio_io.read`'s `int32 / 2^31` convention.
pub const S32_INV_SCALE: f32 = 1.0 / 2_147_483_648.0;
// ─── 16-bit signed PCM → f32 ──────────────────────────────────────────
/// Convert `src` 16-bit signed PCM samples to f32 normalized to
/// `[-1.0, 1.0)`, writing to `out`. Scalar reference — bit-exact
/// oracle for the NEON dispatcher and fallback on non-`aarch64`.
///
/// # Preconditions
///
/// - `out.len() == src.len()` (one f32 per input i16).
///
/// Asserted **unconditionally** (release-too).
///
/// # Initialization contract
///
/// Every f32 of `out` is written via `MaybeUninit::write` before this
/// returns.
/// Convert `src` 16-bit signed PCM samples to f32. NEON 8-lane
/// (int16x8_t → two float32x4_t) tile.
///
/// # Algorithm
///
/// Per 8-lane tile:
/// 1. Load 8 i16 via `vld1q_s16`.
/// 2. Widen to two `int32x4_t` halves via `vmovl_s16` (low) +
/// `vmovl_high_s16` (high) — sign-extending widen.
/// 3. Convert each to `float32x4_t` via `vcvtq_f32_s32` (lossless —
/// the i16 range `[-32768, 32767]` is exactly representable in
/// f32's 24-bit mantissa).
/// 4. Multiply by `1.0 / 32768.0` via `vmulq_n_f32`.
/// 5. Two `vst1q_f32` stores (4 f32 each = 8 f32 total per tile).
///
/// Tail (`src.len() % 8` samples ≤ 7) is delegated to the scalar arm.
///
/// # Safety
///
/// 1. NEON must be available on the executing CPU. Caller obligation
/// (the dispatcher discharges it).
/// 2. `out.len() == src.len()` — asserted unconditionally here.
pub unsafe
/// Convert `src` 16-bit signed PCM samples to f32 in `[-1.0, 1.0)`,
/// writing to `out`. Routes to NEON on `aarch64` (when NEON is
/// reported), else to [`s16_to_f32_normalize_scalar`].
///
/// # Preconditions
///
/// - `out.len() == src.len()` — asserted **unconditionally**.
///
/// # Correctness class
///
/// `Exact`.
// ─── 32-bit signed PCM (and i24-as-i32) → f32 ─────────────────────────
/// Convert `src` 32-bit signed PCM samples to f32 by multiplying by
/// `inv_scale` (typically `1.0 / 2^(bits-1)` for the source's bit
/// depth — `1/2^23` for 24-bit-packed-in-i32, `1/2^31` for 32-bit).
/// Scalar reference — bit-exact oracle for the NEON dispatcher.
///
/// # Preconditions
///
/// - `out.len() == src.len()` (one f32 per input i32).
///
/// Asserted **unconditionally** (release-too).
/// NEON 4-lane (int32x4_t → float32x4_t) widen + multiply.
///
/// # Safety
///
/// NEON must be available; `out.len() == src.len()` (asserted).
pub unsafe
/// Convert `src` 32-bit signed PCM samples to f32 by multiplying by
/// `inv_scale`. Routes to NEON on `aarch64`, else to scalar.
///
/// # Preconditions
///
/// - `out.len() == src.len()` — asserted **unconditionally**.