1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
//! VLM-side SIMD kernels.
//!
//! This module hosts the `aarch64` NEON + scalar dispatcher triples
//! for the VLM image-preprocessing kernels. Each ships as its own
//! dispatcher + `arch::neon` kernel + scalar kernel triple (see
//! [`crate::simd`] for the worked example with `dot`), plus a
//! differential test using the helpers in [`crate::simd::diff`].
//!
//! # Shipped kernels
//!
//! - `pad_to_square` canvas fill ([#151](https://github.com/Findit-AI/mlxrs/issues/151),
//! `vlm/image.rs`). Class: `Exact`. The `pad_canvas_fill` submodule
//! holds the dispatcher, the scalar reference, and the 48-byte
//! LCM(3, 16) NEON kernel. The submodule is `#[doc(hidden)]` so the
//! only public surface is the [`crate::vlm::image::pad_to_square`]
//! call site that consumes it.
//! - `image_to_array` BGR R↔B swap widen ([#149](https://github.com/Findit-AI/mlxrs/issues/149),
//! `vlm/image.rs`). Class: `Exact`. The `bgr_widen` submodule holds
//! the dispatcher, the scalar reference (`chunks_exact_mut(3) +
//! MaybeUninit::write` — LLVM auto-vectorizes this shape on aarch64
//! once the destination is a sized slice rather than `Vec::push`),
//! and the hand-rolled NEON `vld3q_u8` + permuted `vst3q_f32`
//! 16-pixel-tile kernel (R↔B swap encoded structurally by feeding
//! the de-interleaved planes to the interleave-store in `(B, G, R)`
//! order). The NEON kernel **ships unconditionally on aarch64** even
//! though the auto-vectorized scalar is ~13–15 % faster at 4096² on
//! M-series silicon: auto-vectorization is compiler-version-dependent
//! and can silently regress on a rustc / LLVM upgrade or a stylistic
//! refactor, whereas the hand-rolled kernel is a guaranteed
//! arch-specific contract that does not depend on auto-vec
//! heuristics; other targets / sizes / surrounding call-site contexts
//! may not auto-vectorize as cleanly as the benched M-series shape.
//! The scalar arm remains as the differential-test oracle and the
//! non-aarch64 fallback. The submodule is `#[doc(hidden)]` so the
//! only public surface is the [`crate::vlm::image::image_to_array`]
//! call site.
//!
//! # Additional shipped kernels
//!
//! - `image_to_array` u8 → f32 RGB widening ([#148](https://github.com/Findit-AI/mlxrs/issues/148),
//! `vlm/image.rs`). Class: `Exact`. The `rgb_widen` submodule holds
//! the dispatcher, the scalar reference, and the 16-byte tile NEON
//! kernel, which ships unconditionally on aarch64.
//! - `rotate_buf` pixel permutation ([#150](https://github.com/Findit-AI/mlxrs/issues/150),
//! `vlm/image.rs`). Class: `Exact`. The `rotate_buf` submodule
//! specialises the **u8 + channels=4** (Rgba8) hot path with a
//! 4-pixel-tile `vld1q_u8` + per-pixel u32 scattered store; all other
//! type / channel combinations fall back to the scalar arm. Per-pixel
//! destination scatter is gather-bound (NEON has no scatter), so the
//! SIMD win is bounded by the per-tile load width. Ships
//! unconditionally on aarch64.
pub use bgr_widen;
pub use pad_canvas_fill;
pub use rgb_widen;
// `rotate_buf::rotate_buf_u8` is re-exported via the public submodule
// (no `pub(crate) use` here yet — the caller wiring lands separately,
// and re-exporting an unused symbol triggers the workspace `-D
// warnings` gate).