1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
//! `image_to_array` BGR R↔B swap widen: de-interleave a
//! `&[u8]` of packed BGR pixels into a `&mut [MaybeUninit<f32>]` of
//! channel-last `[R, G, B]` f32 triples (R and B swapped from the
//! input order).
//!
//! Tracking: [#149](https://github.com/Findit-AI/mlxrs/issues/149).
//! The BGR arm is the one LLVM most likely fails to auto-vectorize
//! because of the 3-element shuffle on the destination side.
//!
//! # The defect class
//!
//! The original [`crate::vlm::image::image_to_array`] BGR arm was:
//!
//! ```rust,ignore
//! ColorOrder::Bgr => {
//! for px in raw.chunks_exact(3) {
//! buf.push(f32::from(px[2])); // R from B-slot
//! buf.push(f32::from(px[1])); // G
//! buf.push(f32::from(px[0])); // B from R-slot
//! }
//! }
//! ```
//!
//! Three independent `push`es per pixel on a `Vec<f32>` — each push
//! does a bounds check, a `len` update, and a destination index
//! permutation. LLVM cannot easily reason about a `Vec::push` loop
//! that writes three differently-permuted indices per iteration, so it
//! falls back to the trivial scalar emission. The first improvement is
//! to switch to a pre-reserved `&mut [MaybeUninit<f32>]` slice and
//! write through `chunks_exact_mut(3)` — the bounded-stride writes
//! give LLVM a pattern it auto-vectorizes cleanly on aarch64.
//!
//! # Two layered fixes — the scalar restructure + the NEON kernel
//!
//! 1. **Scalar restructure** — replace the per-pixel `Vec::push` triple
//! with a single `chunks_exact_mut(3) + chunks_exact(3)` pair-
//! iteration into a pre-reserved buffer's spare capacity. Each
//! iteration writes three `MaybeUninit::write` calls with the R↔B
//! swap encoded in the read indices (`src_px[2], src_px[1],
//! src_px[0]`). LLVM auto-vectorizes this shape cleanly on aarch64
//! once the destination is a sized slice.
//! 2. **Hand-rolled NEON kernel** — `vld3q_u8` 3-way de-interleave +
//! permuted `vst3q_f32` 3-way interleave, 16 pixels per tile. The
//! R↔B swap is encoded structurally by the **plane order at the
//! store**: `vld3q_u8` on a BGR source yields `(planes.0, planes.1,
//! planes.2) = (B-values, G-values, R-values)`; the store then
//! feeds `(R-plane-widened, G-plane-widened, B-plane-widened)` to
//! `vst3q_f32`, which interleaves them lane-by-lane, producing
//! output `[R_value, G_value, B_value]` per pixel — i.e. RGB-
//! ordered channels containing exactly the same per-channel values
//! the scalar reference emits.
//!
//! # Benchmark
//!
//! We benchmarked three implementations at 256² / 1024² / 4096² pixel
//! counts (the same shape as the canvas-fill bench):
//!
//! | impl | 256² (≈196k B src) | 1024² (≈3.1M B src) | 4096² (≈50M B src) |
//! | ----------------------------------------------------- | ------------------:| -------------------:| ------------------:|
//! | OLD `chunks_exact(3) + Vec::push * 3` (per-push) | ~82.8 µs | ~1.66 ms | ~26.6 ms |
//! | NEW scalar `chunks_exact_mut(3) + MaybeUninit::write` | ~11.4 µs | ~171 µs | ~2.78 ms |
//! | NEW NEON `vld3q_u8` + permuted `vst3q_f32` (shipped) | ~13.1 µs | ~200 µs | ~3.25 ms |
//!
//! Throughput (criterion `Throughput::Bytes` over input bytes): NEW
//! scalar ≈ 16.0 / 17.1 / 16.8 GiB/s, NEW NEON ≈ 14.0 / 14.6 / 14.4
//! GiB/s. The OLD per-push loop is at ≈ 1.76–2.21 GiB/s — both NEW
//! arms beat it by ~7–9× across the sweep, and within the two NEW
//! arms the scalar's auto-vectorized output is ~13–17 % faster than
//! the hand-rolled NEON tile at every benched size.
//!
//! # Why the NEON kernel ships unconditionally
//!
//! The NEON kernel ships even though it is ~13–15 % *slower* than the
//! auto-vectorized scalar on the benched sizes (M-series Apple
//! silicon). Rationale:
//!
//! 1. **Auto-vectorization is compiler-version-dependent.** The scalar
//! path's speed comes from LLVM's auto-vectorizer recognising the
//! `chunks_exact_mut(3) + MaybeUninit::write` shape and emitting a
//! NEON loop. A rustc / LLVM upgrade, an inlining-heuristic change,
//! a stylistic refactor of the caller, or a future
//! `MaybeUninit::write` codegen tweak can silently de-vectorize the
//! scalar path — and the regression would not show up as a test
//! failure (the output is still bit-identical), only as a hidden
//! runtime cliff that we would catch only if someone re-ran the
//! bench. The default rule's "scalar is fast enough" reasoning is
//! silently load-bearing on LLVM heuristics that the SIMD module's
//! other contracts deliberately do **not** depend on.
//! 2. **The SIMD module's contract is to provide a guaranteed arch-
//! specific kernel.** Every other kernel in `simd::*` ships a hand-
//! rolled `#[target_feature(enable = "neon")]` NEON arm whose
//! behaviour does not depend on auto-vectorization. Dropping the
//! NEON arm was an unprincipled exception — the auto-vec scalar
//! cannot be relied on across toolchains the way an `unsafe fn`
//! annotated with the target feature can.
//! 3. **Other targets / sizes / surrounding code may not auto-vectorize
//! cleanly.** The 256²/1024²/4096² bench points and the M-series
//! cores we measured on are not the whole shipping matrix — on a
//! different aarch64 microarchitecture (Cortex-A series, future
//! Apple silicon revisions, a non-Apple aarch64 chip), with a
//! different surrounding call site that perturbs inlining, or at a
//! pixel count outside the bench grid, the auto-vec scalar's win
//! margin can flip. The hand-rolled NEON kernel is the only durable
//! arch-specific contract.
//! 4. **The scalar fallback path remains** as the differential-test
//! oracle and as the dispatcher's only routing target on non-
//! aarch64 targets — none of (1)/(2)/(3) costs us its presence.
//!
//! Why the NEON kernel "loses" on the bench: the `vld3q_u8` 3-way de-
//! interleave + the `vst3q_f32` permuted 3-way interleave have higher
//! per-iteration ALU cost than the scalar `MaybeUninit::write` triple's
//! auto-vectorized output, and the widen chain (`vmovl_u8` →
//! `vmovl_u16` → `vcvtq_f32_u32` × 12 per tile) adds enough latency
//! that the 16-pixel tile does not amortize. The kernel is memory-
//! bandwidth-bound on the output side (16 pixels = 48 f32 = 192 bytes
//! written per body iter) and the scalar auto-vectorized loop already
//! saturates that bandwidth on M-series silicon. None of that
//! invalidates the durability argument above.
//!
//! Concrete bench numbers live in the bench file
//! (`mlxrs/benches/simd_bgr_widen.rs` — kept in-tree as a
//! regression guard against both a future scalar regression and a
//! future NEON regression).
//!
//! # Correctness class — `Exact`
//!
//! This kernel is pure data movement plus a lossless u8 → f32 widen (every u8
//! is exactly representable in f32). The scalar arm and the NEON arm
//! produce **bit-identical** output for every input — the NEON kernel
//! performs the same `f32::from(u8)` widen via `vcvtq_f32_u32`
//! (lossless because the source u8 is in `[0, 255]`, exactly
//! representable in f32) and writes the same per-pixel R↔B-swapped
//! triple. The differential tests in this module are therefore byte-
//! identical assertions:
//!
//! - [`crate::simd::diff::assert_eq_over_lane_sweep`] drives both
//! scalar and dispatcher across the canonical lane sweep — on
//! `aarch64`, the dispatcher routes to the NEON arm, so this is
//! simultaneously a NEON-vs-scalar test.
//! - An explicit `bgr_widen_neon_matches_scalar_bit_identical` test
//! calls the NEON kernel **directly** under an `is_neon_available()`
//! guard, so the NEON-vs-scalar contract is asserted without
//! indirection through the dispatcher.
//!
//! # `MaybeUninit<f32>` API — type-encoded uninit safety
//!
//! The kernel API takes `&mut [MaybeUninit<f32>]` (not `&mut [f32]`)
//! so the call site in [`crate::vlm::image::image_to_array`] can pass
//! `Vec::spare_capacity_mut()` **directly** — no `from_raw_parts_mut`
//! cast over uninit backing memory (which would be UB regardless of
//! the subsequent writes, per the `from_raw_parts_mut` safety contract
//! requiring "properly initialized" elements). The scalar kernel
//! writes every f32 of `out` via `MaybeUninit::write`; the NEON kernel
//! writes every f32 via raw-pointer `vst3q_f32` stores (sound on
//! `MaybeUninit<f32>` backing memory — `MaybeUninit<f32>` has no
//! validity invariants beyond size + alignment, and any bit pattern
//! including a valid `f32` is acceptable). The function-level contract
//! on [`bgr_widen`] is "every f32 of `out` is written before this
//! returns", so the caller may safely `set_len` over the covered
//! region.
//!
//! # No new dependencies
//!
//! Pure `core::slice` + `core::arch::aarch64` + `core::mem::MaybeUninit`
//! (all `core`, no crate dep). The dispatcher routes through
//! [`crate::simd::is_neon_available`].
use MaybeUninit;
use ;
/// Widen a packed BGR `&[u8]` pixel buffer to a channel-last RGB
/// `&mut [MaybeUninit<f32>]` (R and B swapped from input order).
/// Scalar reference — the bit-exact oracle for the NEON dispatcher and
/// the fallback path on every non-`aarch64` target.
///
/// **Always compiled** — independent of `target_arch`. Anchors the
/// math contract (each input pixel `src[i*3..i*3+3]` produces
/// `out[i*3..i*3+3] = [f32(src[i*3+2]), f32(src[i*3+1]),
/// f32(src[i*3])]`), is the differential-test oracle, and is the
/// fallback path on every non-`aarch64` target.
///
/// # Preconditions
///
/// - `src.len()` must be a multiple of 3 (each input pixel is 3 bytes).
/// - `out.len()` must equal `src.len()` (one output f32 per input
/// byte). The call site [`crate::vlm::image::image_to_array`]
/// reserves exactly `H*W*3` f32s and slices the input to exactly
/// `H*W*3` bytes, so both preconditions hold there.
///
/// Both preconditions are asserted **unconditionally** (release-too).
/// The function is `pub`, reachable through `simd::vlm::bgr_widen`,
/// and its initialization contract ("every f32 of `out` is written
/// before return") is load-bearing for callers that then call
/// `Vec::set_len` over the covered region — a release-build size
/// mismatch would leave some `MaybeUninit<f32>` slots unwritten and
/// the caller's `set_len` would expose uninitialized memory. The
/// dispatcher [`bgr_widen`] also asserts these unconditionally at its
/// entry point; this kernel re-asserts them so direct callers (the
/// bench, the tests, any future caller bypassing the dispatcher) are
/// equally protected.
///
/// # Initialization contract
///
/// Every f32 of `out` is written via `MaybeUninit::write` before this
/// returns. On return the entire slice is fully initialized; the
/// caller may treat the backing memory as `[f32]` (via
/// `Vec::set_len`, `MaybeUninit::slice_assume_init_ref`, etc.).
///
/// # Implementation choice
///
/// `chunks_exact(3)` over `src` paired with `chunks_exact_mut(3)`
/// over `out` — one input/output pixel triple per loop iteration,
/// three `MaybeUninit::write` calls per iteration with the R↔B swap
/// encoded in the read indices (`src_px[2], src_px[1], src_px[0]`).
/// The alternative — `copy_from_slice` between two `&mut [f32]` arms
/// after initializing all of `out` — would require a zero-fill first
/// (defeating the uninit-safe API) or an `assume_init_mut` cast over
/// uninit memory (UB). LLVM auto-vectorizes this shape cleanly on
/// aarch64; the NEON kernel ships anyway for the durability reasons
/// in the module-level doc's "Why the NEON kernel ships unconditionally"
/// section.
/// Widen a packed BGR `&[u8]` pixel buffer to a channel-last RGB
/// `&mut [MaybeUninit<f32>]` (R↔B swap on widen). NEON 16-pixel
/// `vld3q_u8` + permuted `vst3q_f32` tile.
///
/// # Algorithm
///
/// 1. Load 16 BGR pixels (48 bytes) per iteration via `vld3q_u8`,
/// which performs a 3-way de-interleave into three `uint8x16_t`
/// planes (`b`, `g`, `r` — the source layout is BGR, so the first
/// plane carries B, the second G, the third R).
/// 2. Widen each plane to four `float32x4_t` lanes via the chain
/// `vmovl_u8` (low 8 lanes → `uint16x8_t`) and `vmovl_high_u8`
/// (high 8 lanes → `uint16x8_t`), then `vmovl_u16` /
/// `vmovl_high_u16` to `uint32x4_t`, then `vcvtq_f32_u32` to
/// `float32x4_t`. 12 widens per 16-pixel tile (3 planes × 4 quarter
/// widens per plane).
/// 3. Store the four 4-wide `float32x4x3_t` outputs via `vst3q_f32`,
/// feeding the planes in `(B_widened, G_widened, R_widened)` order
/// so the 3-way interleave-store writes `[R_from_B, G, B_from_R]`
/// per output pixel — the R↔B swap is encoded **structurally** by
/// the plane-order at the store, not by an extra shuffle in the
/// body.
/// 4. Tail (`pixel_count % 16` pixels) is delegated to
/// [`bgr_widen_scalar`] on the trailing input + output slices —
/// bounded above by 15 pixels (= 45 bytes input + 45 f32 output).
///
/// # Initialization contract
///
/// Every f32 of `out` is written before this returns — the body loop
/// covers `out[0..body_len * 3]` via raw `vst3q_f32` stores (each
/// store writes 12 contiguous f32 = 48 bytes), and the scalar arm
/// covers the trailing `out[body_len * 3..]` via `MaybeUninit::write`.
/// On return the entire slice is fully initialized.
///
/// # Safety
///
/// 1. NEON must be available on the executing CPU. This is the
/// caller's obligation — the public dispatcher [`bgr_widen`]
/// discharges it via [`crate::simd::is_neon_available`].
/// 2. `src.len()` must be a multiple of 3 and `out.len()` must equal
/// `src.len()`. Both are asserted **unconditionally** here
/// (release-too — a release mismatch would OOB-write `out` or
/// OOB-read `src` in the tile body, and the kernel's init
/// contract is load-bearing for a caller that then calls
/// `Vec::set_len`). The dispatcher also asserts them at its
/// entry point.
///
/// There is no input alignment requirement: `vld3q_u8` and
/// `vst3q_f32` accept unaligned addresses at full throughput on
/// aarch64 (no faulting on misalignment, no perf cliff). The kernel
/// reads `src.as_ptr().add(pixel_idx * 3)` and writes
/// `out.as_mut_ptr().cast::<f32>().add(pixel_idx * 3)` per 16-pixel
/// tile — both within the slices by the bounded `pixel_idx + 16 <=
/// body_len` loop condition.
pub unsafe
/// Widen a packed BGR `&[u8]` pixel buffer to a channel-last RGB
/// `&mut [MaybeUninit<f32>]` (R↔B swap on widen). Routes to NEON on
/// `aarch64` (when the CPU reports NEON), else to
/// [`bgr_widen_scalar`].
///
/// # Preconditions
///
/// - `src.len() % 3 == 0` — each input pixel is 3 bytes.
/// - `out.len() == src.len()` — one output f32 per input byte.
///
/// Both are asserted **unconditionally** (release-too — keeping the
/// assertion shape consistent with the canvas-fill dispatcher and with the
/// "dispatcher asserts unconditionally" rule the SIMD kernels follow).
/// Both internal kernels ([`bgr_widen_scalar`] and
/// [`bgr_widen_neon`]) also assert these preconditions unconditionally
/// at their own entry points so direct callers (the bench, the tests,
/// any future caller bypassing the dispatcher) are equally protected
/// from a release-build size mismatch leaving `MaybeUninit<f32>` slots
/// unwritten and a follow-up `Vec::set_len` exposing uninit memory.
///
/// # Initialization contract
///
/// **Every f32 of `out` is written before this returns.** On return
/// the entire `&mut [MaybeUninit<f32>]` slice is fully initialized;
/// the caller may treat the backing memory as `[f32]` (e.g. via
/// `Vec::set_len` over the covered region after passing
/// `spare_capacity_mut()`).
///
/// Tracking: [#149](https://github.com/Findit-AI/mlxrs/issues/149).
/// This is the BGR arm that LLVM
/// originally failed to auto-vectorize (the destination-side 3-element
/// shuffle was opaque to the iterator-level loop analysis the
/// auto-vectorizer ran on `Vec::push`). The fix is both a restructure
/// of the loop shape (pre-reserve via `try_reserve_exact` + write
/// through `chunks_exact_mut(3) + MaybeUninit::write` instead of
/// three `Vec::push`es per pixel — gives LLVM a shape it can
/// auto-vectorize) **and** a hand-rolled NEON kernel ([`bgr_widen_neon`])
/// that ships unconditionally on `aarch64`.
///
/// Why ship the NEON arm despite the bench showing the auto-vec
/// scalar is faster on the measured M-series sizes: see the module-
/// level doc's "Why the NEON kernel ships unconditionally" section.
/// The TL;DR is auto-vectorization is compiler-version-dependent and the
/// SIMD module's contract is to provide a guaranteed arch-specific kernel
/// that does not depend on LLVM heuristics, so the hand-rolled NEON kernel
/// is the durable arch-specific contract.
///
/// # Correctness class
///
/// `Exact` — the output is the same bit-pattern across the scalar arm
/// and the NEON arm (and bit-identical to the OLD per-push loop). Pure
/// data movement: a 3-way de-interleave + permuted 3-way interleave
/// (R↔B swap) over a lossless u8 → f32 widen. Differential tests in
/// [`mod@self`]'s `tests` module assert this via
/// [`crate::simd::diff::assert_eq_over_lane_sweep`] (scalar vs
/// dispatcher — on `aarch64` the dispatcher routes to NEON, so this
/// is a NEON-vs-scalar identity) and via the explicit
/// `bgr_widen_neon_matches_scalar_bit_identical` test that calls the
/// NEON arm directly.
///
/// # Call site
///
/// [`crate::vlm::image::image_to_array`] — widens the
/// `as_rgb8().as_raw()` `&[u8]` BGR slice into a pre-reserved
/// `Vec<f32>` spare capacity before the `Array::from_slice` FFI call.
/// Passes `buf.spare_capacity_mut()` directly (no `from_raw_parts_mut`
/// cast).