//! Core image preprocessing primitives for vision-language models.
//!
//! Ported 1:1 from
//! [`mlx-swift-lm/Libraries/MLXVLM/MediaProcessing.swift`](https://github.com/ml-explore/mlx-swift-lm/blob/main/Libraries/MLXVLM/MediaProcessing.swift)
//! (the focused swift reference for VLM image preprocessing, 567 lines)
//! and cross-checked against
//! [`mlx-vlm/mlx_vlm/utils.py`](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/utils.py)
//! (`load_image`, `resize_image`, `process_image`).
//!
//! ## Scope
//! - **In scope (this PR):** the *cross-model* preprocessing surface every
//! ViT-class VLM encoder shares — image load, resize, channel layout,
//! `[0, 255]` u8 → f32 conversion, `1/255` rescale, per-channel
//! ImageNet-style normalization, uniform-grid patchify, and the
//! end-to-end [`preprocess`] composer.
//! - **Out of scope:** per-model image processors (CLIP / SigLIP / Idefics
//! / Qwen-VL / etc. specialized cropping, dynamic aspect-ratio patching,
//! anyres tiling). Those are per-usecase per the project's
//! no-per-model-arch rule; they live in user code that depends on these
//! primitives. Video frame preprocessing is also out of scope (the
//! `MediaProcessing.asProcessedSequence` family on lines 288-526 of the
//! swift reference) — VLM video support is a sibling concern.
//! The swift
//! [`inSRGBToneCurveSpace`](https://github.com/ml-explore/mlx-swift-lm/blob/main/Libraries/MLXVLM/MediaProcessing.swift#L50-L54)
//! sRGB gamma conversion is also deferred: it is a piecewise nonlinear
//! transform (`x <= 0.0031308 ? 12.92*x : 1.055*x^(1/2.4) - 0.055`),
//! not a color-matrix multiply, and requires a `where`/conditional-select
//! op that is not yet exposed in `mlxrs::ops`. The swift reference uses
//! CoreImage's `CIFilter.linearToSRGBToneCurve()` (an Apple-only
//! primitive); python `mlx-vlm` processors operate on already-decoded
//! sRGB-tagged inputs and do not perform an explicit linear→sRGB step.
//! When sRGB gamma is needed it can be added as a follow-up once the
//! `where_` op is exposed.
//!
//! ## Conventions
//! - **Channel layout (opt-in planar via [`ImageProcessorConfig::layout`]):**
//! [`image_to_array`] always emits channel-last `[H, W, 3]` (so the
//! `(3,)` mean/std broadcast cleanly across the trailing axis without
//! a layout-specific reshape in [`normalize`] / [`rescale`]). Alpha is
//! intentionally dropped — see [`image_to_array`]'s doc for the swift
//! `array[..., :3]` parity citation.
//!
//! [`preprocess`] then applies a **trailing layout post-step** picked
//! via [`ImageProcessorConfig::layout`] (a [`Layout`] enum defaulting
//! to `Hwc` — pre-existing callers see no change):
//! - `Hwc` (default): identity. The historical `[H, W, 3]` output.
//! - `Chw`: one `transpose_axes(&[2, 0, 1])` → `[3, H, W]`. The
//! torchvision / timm / classical-CV planar layout without a batch
//! axis.
//! - `Bchw`: same transpose + one `expand_dims_axes(&[0])` →
//! `[1, 3, H, W]`. Matches swift `MediaProcessing.asMLXArray`
//! (`MediaProcessing.swift:190` —
//! `array.reshape((1, h, w, 3)).transposed(0, 3, 1, 2)`), the layout
//! every `MLXVLM/Models/*.swift` ViT-class encoder consumes
//! directly.
//!
//! Both `Chw` / `Bchw` compose lazily on the [`Array`] (metadata
//! update only — `transpose_axes` and `expand_dims_axes` do not copy
//! the underlying buffer), so the planar arms have zero memory cost
//! beyond the shape/stride bookkeeping. Per-model encoders that need
//! a layout NOT in the three-arm enum (e.g. patchified `[N, P, P, C]`)
//! keep the default `Hwc` output and compose their own trailing step;
//! the [`Layout`] enum only enumerates the layouts ≥1 per-model
//! encoder consumes verbatim today.
//! - **Dtype:** [`image_to_array`] returns `f32` in `[0.0, 255.0]` *before*
//! [`rescale`] — exactly mirroring the swift `CIFormat.RGBAf` render
//! (`MediaProcessing.swift:171`) which produces f32 in `[0, 255]`.
//! - **No implicit eval:** every primitive composes lazily on `Array`;
//! callers must `eval()` (or use a data accessor) to materialize.
//! - **No hot-path allocations beyond unavoidable decode/resize:** the
//! `image` crate's `decode` / `imageops::resize` themselves allocate
//! (CPU pixel buffers), and the f32 conversion of the resized buffer is
//! one unavoidable `Vec<f32>` before [`Array::from_slice`] copies it
//! into MLX. All other steps stay on the FFI-owned arrays.
//!
//! ## Pipeline
//! ```text
//! load_image → resize → image_to_array → rescale → normalize_imagenet
//! ```
//! [`preprocess`] composes the full chain off a decoded
//! [`image::DynamicImage`] + an [`ImageProcessorConfig`].
//!
//! ## Related issues (#120 / #121 / #122 / #123 / #124 / #125 / #126)
//!
//! - **#120** — opt-in planar layout: addressed via
//! [`ImageProcessorConfig::layout`] (a [`Layout`] enum defaulting to
//! `Hwc`, with `Chw` / `Bchw` for planar). Pre-existing callers see
//! no change (`Hwc` is the identity arm); per-model encoders that
//! want `[1, C, H, W]` (the swift
//! `MediaProcessing.asMLXArray` shape) request `Bchw` and the
//! composer applies one lazy `transpose_axes` + `expand_dims_axes` at
//! zero memory cost. See [`Layout`] for the per-arm rationale +
//! parity citations.
//! - **#121** — bulk-fill widen in [`image_to_array`]:
//! addressed via the `rgb_widen` / `bgr_widen` NEON
//! dispatchers wired into BOTH the `as_rgb8()` fast path AND the
//! non-`Rgb8` per-pixel-projection path (the latter builds a
//! contiguous `Vec<u8>` of length `H*W*3` then hands it to the same
//! SIMD dispatcher — the per-pixel `push` overhead the issue called
//! out is gone). See [`image_to_array`] for the per-branch wiring.
//! - **#122** — SIMD resize: closed via the own
//! `vlm::resize` kernel (not `fast_image_resize`).
//! The issue's original ask was to adopt `fast_image_resize`; the
//! subsequent allocation-discipline audit showed `fast_image_resize`
//! allocates its internal scratch (coefficient tables, per-row work
//! buffers) infallibly inside the crate — so a near-budget hostile
//! target could abort the process despite our `Result`. The own
//! kernel runs the same SIMD shape on aarch64 AND routes every
//! internal buffer through `try_reserve_exact`, so the
//! recoverable-OOM contract is honest end-to-end. See [`resize`] for
//! the per-buffer fallibility breakdown.
//! - **#123** — hand-written SIMD u8→f32 + BGR-swap:
//! addressed via the `rgb_widen` (16-byte tile `vld1q_u8` + 4×
//! `vst1q_f32`) and `bgr_widen` (16-pixel tile `vld3q_u8` +
//! permuted `vst3q_f32`) NEON kernels. Both ship unconditionally on
//! aarch64 ("SIMD ship NEON regardless").
//! - **#124** — fallible-allocation discipline in
//! [`crate::vlm::prompt`]: every production allocation in the splice
//! / mask / placeholder paths uses the crate-private
//! `error::try_with_capacity` helper (i.e. `try_reserve_exact` under
//! the hood); allocator failure surfaces as
//! [`crate::error::Error::OutOfMemory`] instead of aborting. Grep
//! `Vec::with_capacity` in `vlm/prompt.rs` to verify: the only
//! matches are inline comments documenting what the new shape
//! replaces.
//! - **#125** — byte-budget validation in [`resize`]:
//! addressed via the target-dimension guard (`height * width * 4 <=
//! MAX_DECODED_IMAGE_BYTES`) that fires BEFORE any allocation. The
//! signature is `-> Result<DynamicImage>` — over-budget / zero /
//! overflow targets surface as typed errors (`Error::OutOfRange`,
//! `Error::CapExceeded`, `Error::ArithmeticOverflow`) with the offending
//! dims and the cap. See [`resize`]'s "Return type"
//! doc paragraph for the full rationale.
//! - **#126** — `to_rgb8()` clone elision in
//! [`image_to_array`]: addressed via the `as_rgb8()` fast path
//! (borrows the source's backing `&[u8]` directly) + a per-pixel
//! `dynamic_image_rgb_pixel` projection on the non-`Rgb8` branch
//! (NO intermediate source-sized RGB image allocation). Grep
//! `to_rgb8\|to_rgba8` in `vlm/image.rs` to verify: every match is
//! in an inline comment documenting what the new path replaces.
//!
//! ## Allocation-fallibility audit
//!
//! Every source-pixel-scaled allocation in this module is classified
//! below — the table is exhaustive (a `grep` of `to_rgb*` / `to_rgba*`
//! / `to_luma*` / `clone` / `rotate*` / `flip*` / `apply_orientation`
//! / `crop*` / `ImageBuffer::new` / `RgbImage::*` / the own resize
//! kernel's buffers). The audit splits the guarantee into TWO honest
//! columns rather than collapsing them into one ambiguous "fallible"
//! flag:
//!
//! - **Bounded-memory:** an upper bound on the allocation's byte
//! count is enforced before it runs — by [`MAX_DECODED_IMAGE_BYTES`]
//! (load-time decoder cap, 512 MiB), applied to BOTH source-scale
//! allocations and the [`resize`] destination ([`ImageProcessorConfig::size`]
//! is now LOADED from an untrusted on-disk config, so [`resize`]
//! validates its target against the same 512 MiB ceiling before
//! allocating — see [`resize`]'s target-dimension guard).
//! "Y" guarantees the call cannot trigger quadratic / unbounded
//! allocation from hostile input.
//! - **Recoverable-OOM:** an allocator failure surfaces as a typed
//! [`Error::OutOfMemory`] (or [`Error::CapExceeded`] /
//! [`Error::ArithmeticOverflow`] on the pre-allocation overflow gate)
//! rather than aborting the process.
//! "Y" requires every backing allocation under our direct control
//! to route through `try_reserve_exact` (or an equivalent fallible
//! primitive). "N" means image-crate-internal allocations
//! (`rotate90` / `ImageBuffer::new`) may panic-abort on allocator
//! pressure even though the byte count itself is bounded. (The
//! former `fast_image_resize`-internal scratch — coefficient tables,
//! per-row work buffers — was such an "N"; it is now an "Y" because
//! the resize is owned by `vlm::resize` and every kernel
//! buffer is `try_reserve_exact`-backed.)
//!
//! | Site | Scale | Caller fn | Bounded-memory | Recoverable-OOM | Notes |
//! |-----------------------------------------------|---------------|--------------------------|----------------|-----------------|---------------------------------------------|
//! | `apply_orientation_fallible` u8 variants (Rotate90/270/+FlipH on Luma8/LumaA8/Rgb8/Rgba8) | source pixels | `load_image` →`Result` | Y (`MAX_DECODED_IMAGE_BYTES`) | **Y** | **FIXED:** manual rotate over `try_reserve_exact`-backed buffer — no second alloc, no probe race |
//! | `apply_orientation_fallible` non-u8 variants (Luma16/LumaA16/Rgb16/Rgba16/Rgb32F/Rgba32F + rotates) | source pixels | `load_image` →`Result` | Y (`MAX_DECODED_IMAGE_BYTES`) | **Y** | **FIXED:** covered by manual generic rotate via `try_reserve_exact` over `T: Copy` — 16-bit PNGs (`Luma16`/`LumaA16`/`Rgb16`/`Rgba16` per image-rs PNG decoder) and 32-bit-float `DynamicImage` inputs (`Rgb32F`/`Rgba32F`) now route through the same fallible per-element-type buffer path as u8 |
//! | `apply_orientation_fallible` (NoTransforms/Flip/Rot180, all variants) | in-place | `load_image` →`Result` | Y (in-place) | Y (no alloc) | Upstream `*_in_place` path — zero allocation |
//! | `resize` source RGBA buffer (`try_reserve_exact` + `as_rgba8` fast path / per-pixel `dynamic_image_rgba_pixel`) | source pixels | `resize` →`Result<DynamicImage>` | Y (`MAX_DECODED_IMAGE_BYTES` via `load_image`) | **Y** | **FIXED:** replaced the infallible `img.to_rgba8()` clone with a `try_reserve_exact`-backed buffer filled manually (borrowed RGBA8 fast path or per-pixel projection); handed to the own `vlm::resize` kernel; allocator failure → `Error::OutOfMemory` |
//! | own resize kernel buffers — h+v coefficient tables, inter-pass intermediate, destination (`vlm::resize::resize_rgba8`) | target pixels (untrusted loaded config) | `resize` →`Result<DynamicImage>` | Y (`MAX_DECODED_IMAGE_BYTES` via `resize`'s target guard) | **Y** | **FIXED (own NEON resize, drop `fast_image_resize`):** every kernel buffer routes through `try_reserve_exact`; the target guard rejects zero/overflow/>512 MiB BEFORE any reservation; allocator failure → `Error::OutOfMemory`. Replaces `fast_image_resize`'s infallible internal scratch which could abort despite our `Result`. Output bit-exact with PIL `Image.resize`. |
//! | `img.clone()` (early-return in `center_crop`) | source pixels | `center_crop` →`DynamicImage` | Y (via `load_image` cap) | N (image-rs infallible `Vec::clone`) | OUT-OF-SCOPE: `-> DynamicImage` by reference parity |
//! | `img.crop_imm(...)` (in `center_crop`) | min(source, target) | `center_crop` →`DynamicImage` | Y (≤ source bound) | N | OUT-OF-SCOPE: same parity rationale |
//! | `Vec::<u8>::try_reserve_exact` canvas (in `pad_to_square`) | target square (bounded) | `pad_to_square` →`Result` | Y (`MAX_DECODED_IMAGE_BYTES`) | Y (`try_reserve_exact` + `Error::OutOfMemory`) | FALLIBLE |
//! | `Vec::<f32>::try_reserve_exact` buf (in `image_to_array`) | source pixels (bounded by `load_image` cap) | `image_to_array` →`Result` | Y (via `load_image` cap) | Y (`try_reserve_exact` + `Error::OutOfMemory`) | FALLIBLE |
//! | `dynamic_image_rgb_pixel` / `dynamic_image_rgba_pixel` per-pixel `get_pixel` | none (stack `Rgb`/`Rgba<u8>` only) | shared helpers | Y (zero alloc) | Y (no alloc) | OK — no full-image intermediate alloc |
//! | mlx `Array` ops (rescale/normalize/patchify/preprocess) | output array | each `-> Result<Array>` | Y (output-shape) | Y (mlx backend `Result`) | OK — mlx backend allocator errors surface via `Array::*` `Result` |
//!
//! **Class closure invariant.** This module guarantees
//! **bounded-memory** end-to-end — every source-scale allocation is
//! capped by [`MAX_DECODED_IMAGE_BYTES`] (512 MiB), and the [`resize`]
//! destination (driven by the now-untrusted loaded
//! [`ImageProcessorConfig::size`]) is capped against the same ceiling
//! by [`resize`]'s target-dimension guard; no quadratic or unbounded
//! growth is possible from hostile input OR hostile config.
//! **Recoverable-OOM** is now guaranteed for every allocation under our
//! direct control: [`pad_to_square`]'s canvas, [`image_to_array`]'s
//! f32 buffer, [`resize`]'s source RGBA buffer AND every buffer the own
//! resize kernel allocates (the horizontal + vertical coefficient
//! tables, the inter-pass intermediate, and the destination — see
//! `vlm::resize`), and the rotate buffer for every
//! `DynamicImage` element type (u8 / u16 / f32) in
//! `apply_orientation_fallible` (private helper called by [`load_image`])
//! — the generic-rotate path covers 16-bit-PNG-derived
//! `Luma16`/`LumaA16`/`Rgb16`/`Rgba16` (image-rs 0.25's PNG decoder
//! emits 16-bit variants for `BitDepth::Sixteen` PNG inputs) and
//! caller-supplied `Rgb32F`/`Rgba32F` inputs. [`resize`] also rejects an
//! over-budget / zero / overflowing target as a typed
//! [`Error::OutOfRange`] / [`Error::CapExceeded`] BEFORE allocating (its target now flows from
//! an untrusted on-disk config), then materializes its source RGBA
//! buffer (formerly an infallible `img.to_rgba8()` clone) via
//! `try_reserve_exact` and hands it to the own `vlm::resize`
//! kernel — which dropped the `fast_image_resize` dependency precisely
//! because that crate allocated its internal scratch infallibly and
//! could abort despite our `Result`. So a just-under-cap hostile target
//! can no longer force a ~512 MiB infallible alloc anywhere in the
//! resize path, and `resize`'s `Result` signature is honest. The
//! ONLY remaining sites that keep an infallible image-crate `Vec`
//! allocator are the `-> DynamicImage` by-reference-parity helpers
//! (`center_crop`'s early-return `img.clone()` / `crop_imm`), whose byte
//! counts are still bounded ≤512 MiB (source by the decoder cap) so the
//! only abort path is a genuine system-wide OOM at a ≤512 MiB request;
//! their public docstrings document this residual, and their signatures
//! stay `-> DynamicImage` per `feedback_match_official_binding_design`.
use crate::{
Dtype,
array::Array,
error::{
ArithmeticOverflowPayload, CapExceededPayload, DivisibilityConstraintPayload, Error,
FileIoPayload, FileOp, LengthMismatchPayload, OutOfRangePayload, ParsePayload,
RankMismatchPayload, Result, UnknownEnumValuePayload, UnsupportedDtypePayload,
},
ops::{
arithmetic::{divide, multiply, subtract},
misc::astype,
shape::{reshape, transpose_axes},
},
};
/// Upper bound on decoded RGB pixel-buffer size accepted by host-side
/// allocators in this module (e.g. [`pad_to_square`]'s `size × size × 3`
/// canvas). Matches `image::Limits::default().max_alloc = 512 * 1024 *
/// 1024` (the same 512 MiB ceiling [`load_image`] enforces via
/// `Limits::default().reserve(decoder.total_bytes())?` — see the
/// `Allocation guard` block in [`load_image`]'s doc). Exposing a single
/// shared constant here keeps the per-step caps consistent: a
/// `DynamicImage` that fit through `load_image` still has to clear this
/// gate before any quadratic-canvas builder allocates.
pub const MAX_DECODED_IMAGE_BYTES: u64 = 512 * 1024 * 1024;
/// Interpolation filter for [`resize`], mirroring swift
/// `MediaProcessing.swift`'s resampler choices (lines 81-132):
/// [`resampleLanczos`](https://github.com/ml-explore/mlx-swift-lm/blob/main/Libraries/MLXVLM/MediaProcessing.swift#L81-L103)
/// and
/// [`resampleBicubic`](https://github.com/ml-explore/mlx-swift-lm/blob/main/Libraries/MLXVLM/MediaProcessing.swift#L110-L132).
///
/// Backed by mlxrs's own fully-fallible, PIL-matching resize kernel
/// (`vlm::resize`) — bit-exact with PIL `Image.resize` (the
/// reference mlx-vlm preprocessing targets). Filter names match PIL's
/// `Image.NEAREST` / `BILINEAR` / `BICUBIC` / `LANCZOS` resampling
/// filters, so existing call-site usage of `Bilinear` / `Bicubic` /
/// `Lanczos3` / `Nearest` is unchanged.
///
/// The swift reference exposes `bicubic` (default) and `lanczos`; we add
/// `Nearest` and `Bilinear` because they appear in the python VLM ecosystem
/// (PIL's `Image.resize` `resample=` argument that `mlx-vlm`'s
/// `resize_image` uses transitively at `mlx_vlm/utils.py:835-839`).
#[derive(Debug, Clone, Copy, PartialEq, Eq, derive_more::Display, derive_more::IsVariant)]
#[display("{}", self.as_str())]
pub enum ResizeFilter {
/// Nearest-neighbor (no smoothing). Cheapest; rarely used for VLM.
/// PIL `Image.NEAREST`.
Nearest,
/// Bilinear interpolation (triangle kernel, support 1.0). PIL
/// `Image.BILINEAR`.
Bilinear,
/// Bicubic interpolation (Keys cubic `a = -0.5`, support 2.0) — matches
/// PIL's `Image.BICUBIC`. Mirrors the swift `resampleBicubic` default
/// (`MediaProcessing.swift:110-132`); the recommended choice for most
/// ViT-class encoders.
Bicubic,
/// Lanczos3 interpolation (window=3 sinc-windowed sinc, support 3.0) —
/// matches PIL's `Image.LANCZOS`. Mirrors the swift `resampleLanczos`
/// (`MediaProcessing.swift:81-103`).
Lanczos3,
}
impl ResizeFilter {
/// Lowercase string tag matching PIL resampling filter names.
pub const fn as_str(&self) -> &'static str {
match self {
Self::Nearest => "nearest",
Self::Bilinear => "bilinear",
Self::Bicubic => "bicubic",
Self::Lanczos3 => "lanczos3",
}
}
}
impl ResizeFilter {
/// Map to the crate-internal `vlm::resize::Filter` driven by
/// the own fallible resize kernel. Kept private; the internal filter
/// type does not leak into the public surface. The mapping is 1:1 and
/// each variant matches the identically-named PIL resampling filter
/// (see `vlm::resize` for the per-filter PIL parity notes).
fn to_internal(self) -> crate::vlm::resize::Filter {
use crate::vlm::resize::Filter;
match self {
Self::Nearest => Filter::Nearest,
Self::Bilinear => Filter::Bilinear,
Self::Bicubic => Filter::Bicubic,
Self::Lanczos3 => Filter::Lanczos3,
}
}
}
/// Channel layout for [`image_to_array`]. `RGB` is the swift default
/// (`MediaProcessing.swift:171` — `CIFormat.RGBAf`'s RGBA channel order);
/// `BGR` is exposed for parity with python image-processor configs that
/// use OpenCV-style BGR (e.g. some older CLIP variants).
#[derive(Debug, Clone, Copy, PartialEq, Eq, derive_more::Display, derive_more::IsVariant)]
#[display("{}", self.as_str())]
pub enum ColorOrder {
/// Red-Green-Blue (the default; matches PIL / swift CoreImage).
Rgb,
/// Blue-Green-Red (OpenCV-style; swap R↔B).
Bgr,
}
impl ColorOrder {
/// Lowercase string tag matching Python color-order convention.
pub const fn as_str(&self) -> &'static str {
match self {
Self::Rgb => "rgb",
Self::Bgr => "bgr",
}
}
}
/// Trailing tensor layout applied by [`preprocess`] AFTER the
/// channel-last `[H, W, 3]` ImageNet pipeline (resize → widen → rescale
/// → normalize) completes.
///
/// **Why an enum, not a per-model post-step.** The cross-model primitive
/// always emits `[H, W, 3]` from [`image_to_array`] (so the `(3,)`
/// mean/std broadcasts cleanly across the trailing axis without a
/// layout-specific reshape — see the module-doc `Conventions > Channel
/// layout` block for the rationale). The downstream per-model encoder
/// then needs one of three concrete layouts depending on its
/// architecture:
///
/// - `Hwc` (`[H, W, 3]`): identity. The historical [`preprocess`]
/// output; the natural ImageNet-pipeline layout. Default for source
/// compatibility — pre-existing callers see no change.
/// - `Chw` (`[3, H, W]`): the planar layout torchvision / timm /
/// classical-CV stack uses without a batch axis. One
/// `transpose_axes(&[2, 0, 1])` over the lazy `Array` — metadata
/// update only, no buffer copy.
/// - `Bchw` (`[1, 3, H, W]`): the planar+batch layout swift
/// `MediaProcessing.asMLXArray` produces (`MLXVLM/MediaProcessing.swift:190`,
/// `array.reshape(1, h, w, 3).transposed(0, 3, 1, 2)`) for direct
/// ViT-encoder feed. One `transpose_axes` plus one
/// `expand_dims_axes(&[0])` over the lazy `Array`.
///
/// The per-model encoder picks the layout via
/// [`ImageProcessorConfig::layout`]; the cross-model primitive owns the
/// trailing post-step so the per-model code does NOT manually compose
/// the transpose + expand at every call site (and so a future variant —
/// e.g. a `[H, W, 3, 1]` patch grid — only adds one new arm here).
///
/// **Cost.** The post-step composes lazily on the [`Array`] — both
/// [`transpose_axes`] and `expand_dims_axes` update strides/shape
/// metadata without copying the underlying buffer (mlx's standard
/// no-op view semantics). The `Bchw` and `Chw` arms therefore have
/// **zero memory cost** beyond the metadata update.
///
/// **Parity citations.**
/// - swift `MediaProcessing.asMLXArray` line 190 — planar `[1, C, H, W]`
/// via `reshape((1, h, w, 3))` + `transpose_axes(&[0, 3, 1, 2])`;
/// this is the `Bchw` arm verbatim.
/// - python `mlx_vlm` per-model processors (`siglip` /
/// `clip_image_processor`) call `np.transpose(arr, (2, 0, 1))` after
/// the ImageNet pipeline for the `Chw` arm.
///
/// Tracking issue: [#120](https://github.com/Findit-AI/mlxrs/issues/120).
#[derive(Debug, Clone, Copy, PartialEq, Eq, derive_more::Display, derive_more::IsVariant)]
#[display("{}", self.as_str())]
pub enum Layout {
/// Channel-last `[H, W, 3]`. Identity post-step — the historical
/// [`preprocess`] output, kept as default for source compatibility.
Hwc,
/// Planar `[3, H, W]`. One `transpose_axes(&[2, 0, 1])`; no batch axis.
Chw,
/// Planar batched `[1, 3, H, W]`. Matches swift's
/// `MediaProcessing.asMLXArray` (`MediaProcessing.swift:190`).
Bchw,
}
impl Layout {
/// Lowercase string tag.
pub const fn as_str(&self) -> &'static str {
match self {
Self::Hwc => "hwc",
Self::Chw => "chw",
Self::Bchw => "bchw",
}
}
}
/// Image preprocessing config — the *union* of fields common across VLM
/// image processors.
///
/// Mirrors the swift `MediaProcessing` pipeline configuration (no single
/// struct in the swift source — the swift pipeline composes call-site
/// arguments at `MediaProcessing.swift:30-39` in the module-doc example,
/// and the python `BaseImageProcessor` HF subclasses expose the same
/// fields). [`Default`] is the ImageNet baseline that matches the values
/// hardcoded in nearly every HF image-processor JSON:
/// `mean = [0.485, 0.456, 0.406]`, `std = [0.229, 0.224, 0.225]`,
/// `rescale_factor = 1/255`.
#[derive(Debug, Clone, Copy, PartialEq)]
pub struct ImageProcessorConfig {
/// Target image size `(height, width)`.
size: (u32, u32),
/// Per-channel mean for [`normalize_imagenet`].
mean: [f32; 3],
/// Per-channel std for [`normalize_imagenet`].
std: [f32; 3],
/// Multiplier applied by [`rescale`] (typically `1.0 / 255.0`).
rescale_factor: f32,
/// Whether the [`preprocess`] composer applies [`resize`].
do_resize: bool,
/// Whether the [`preprocess`] composer applies [`rescale`].
do_rescale: bool,
/// Whether the [`preprocess`] composer applies [`normalize_imagenet`].
do_normalize: bool,
/// Interpolation filter forwarded to [`resize`].
resample: ResizeFilter,
/// Channel layout forwarded to [`image_to_array`].
color_order: ColorOrder,
/// Trailing tensor layout applied by [`preprocess`] after the
/// ImageNet pipeline. Default [`Layout::Hwc`] preserves the
/// historical channel-last `[H, W, 3]` output for source
/// compatibility; per-model encoders that consume planar
/// `[1, C, H, W]` (the swift `MediaProcessing.asMLXArray` shape) can
/// opt in via [`Layout::Bchw`] without a manual call-site transpose.
/// See [`Layout`] for the full per-arm rationale + cost analysis
/// (zero copy — lazy `Array` metadata update only).
///
/// Tracking issue: [#120](https://github.com/Findit-AI/mlxrs/issues/120).
layout: Layout,
}
impl Default for ImageProcessorConfig {
fn default() -> Self {
Self::new()
}
}
impl ImageProcessorConfig {
/// ImageNet defaults: `size = (224, 224)`, `mean = [0.485, 0.456,
/// 0.406]`, `std = [0.229, 0.224, 0.225]`, `rescale_factor = 1/255`,
/// `resample = Bicubic`, `color_order = Rgb`, `layout = Hwc`, all
/// `do_*` flags `true`. These are the values nearly every CLIP /
/// SigLIP / DINO / ViT preprocessing config ships with. `Hwc` is the
/// historical [`preprocess`] output — pre-existing callers see no
/// change; opt into [`Layout::Bchw`] / [`Layout::Chw`] explicitly when
/// the per-model encoder wants planar layout.
pub fn new() -> Self {
Self {
size: (224, 224),
mean: [0.485, 0.456, 0.406],
std: [0.229, 0.224, 0.225],
rescale_factor: 1.0 / 255.0,
do_resize: true,
do_rescale: true,
do_normalize: true,
resample: ResizeFilter::Bicubic,
color_order: ColorOrder::Rgb,
layout: Layout::Hwc,
}
}
// ── builders ──────────────────────────────────────────────────────────────
/// Set the target image size `(height, width)`.
#[must_use]
pub fn with_size(mut self, v: (u32, u32)) -> Self {
self.size = v;
self
}
/// Set the per-channel mean for normalization.
#[must_use]
pub fn with_mean(mut self, v: [f32; 3]) -> Self {
self.mean = v;
self
}
/// Set the per-channel std for normalization.
#[must_use]
pub fn with_std(mut self, v: [f32; 3]) -> Self {
self.std = v;
self
}
/// Set the rescale factor (typically `1.0 / 255.0`).
#[must_use]
pub fn with_rescale_factor(mut self, v: f32) -> Self {
self.rescale_factor = v;
self
}
/// Set the `do_resize` flag.
#[must_use]
pub fn with_do_resize(mut self, v: bool) -> Self {
self.do_resize = v;
self
}
/// Set the `do_rescale` flag.
#[must_use]
pub fn with_do_rescale(mut self, v: bool) -> Self {
self.do_rescale = v;
self
}
/// Set the `do_normalize` flag.
#[must_use]
pub fn with_do_normalize(mut self, v: bool) -> Self {
self.do_normalize = v;
self
}
/// Set the interpolation filter.
#[must_use]
pub fn with_resample(mut self, v: ResizeFilter) -> Self {
self.resample = v;
self
}
/// Set the channel order.
#[must_use]
pub fn with_color_order(mut self, v: ColorOrder) -> Self {
self.color_order = v;
self
}
/// Set the trailing tensor layout.
#[must_use]
pub fn with_layout(mut self, v: Layout) -> Self {
self.layout = v;
self
}
// ── accessors ─────────────────────────────────────────────────────────────
/// Target image size `(height, width)`.
#[inline(always)]
pub fn size(&self) -> (u32, u32) {
self.size
}
/// Per-channel mean for normalization.
#[inline(always)]
pub fn mean(&self) -> [f32; 3] {
self.mean
}
/// Per-channel std for normalization.
#[inline(always)]
pub fn std(&self) -> [f32; 3] {
self.std
}
/// Rescale factor (typically `1.0 / 255.0`).
#[inline(always)]
pub fn rescale_factor(&self) -> f32 {
self.rescale_factor
}
/// Whether [`preprocess`] applies [`resize`].
#[inline(always)]
pub fn do_resize(&self) -> bool {
self.do_resize
}
/// Whether [`preprocess`] applies [`rescale`].
#[inline(always)]
pub fn do_rescale(&self) -> bool {
self.do_rescale
}
/// Whether [`preprocess`] applies [`normalize_imagenet`].
#[inline(always)]
pub fn do_normalize(&self) -> bool {
self.do_normalize
}
/// Interpolation filter for [`resize`].
#[inline(always)]
pub fn resample(&self) -> ResizeFilter {
self.resample
}
/// Channel order.
#[inline(always)]
pub fn color_order(&self) -> ColorOrder {
self.color_order
}
/// Trailing tensor layout.
#[inline(always)]
pub fn layout(&self) -> Layout {
self.layout
}
}
/// Load and decode an image from disk, applying EXIF orientation.
///
/// Mirrors the swift `CIImage(contentsOf: url)` /
/// `CIImage(cgImage:)` entry points implied by
/// `MediaProcessing.swift:288-330` (the video frame loader uses
/// `CIImage(cgImage:)` per line 321-322 — Apple's `CIImage` honors
/// EXIF orientation transparently on macOS), and the python
/// [`load_image`](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/utils.py#L801-L832)
/// which explicitly applies `ImageOps.exif_transpose(image)` before
/// returning. To match that parity, we route through `ImageReader`,
/// read the decoder's `orientation()` (which inspects EXIF metadata),
/// decode, then `DynamicImage::apply_orientation` so common
/// phone-camera JPEGs come back upright.
///
/// **Scope:** local files only. HTTP / data-URI / `BytesIO` sources are
/// handled by the python reference but are out of scope here; callers
/// that need them can construct an `image::DynamicImage` themselves and
/// hand it to [`preprocess`] / [`resize`] directly.
///
/// **Allocation guard:** this function honors the `image` crate's
/// default `Limits` (512 MiB `max_alloc`) — we explicitly call
/// `Limits::default().reserve(decoder.total_bytes())?` before consuming
/// the decoder, mirroring what `ImageReader::decode()` does internally
/// (image 0.25 `io::image_reader_type::ImageReader::decode`). Using
/// `into_decoder` so we can read EXIF orientation does NOT bypass this
/// check. Callers that need a different ceiling can pre-validate
/// dimensions with `image::ImageReader::open(path)?.into_dimensions()?`
/// (O(1) header probe) and either accept or build a custom reader with
/// `image::Limits` themselves before handing the result into
/// [`preprocess`]. Oversized images are rejected with
/// [`Error::Backend`] (the underlying `image::ImageError::Limits`).
///
/// **EXIF rotate gate.** The EXIF orientation step
/// is routed through the private `apply_orientation_fallible` helper.
/// The decoder-side `set_limits` cap above does NOT protect the rotate
/// variants (`Rotate90` / `Rotate270` / `Rotate90FlipH` /
/// `Rotate270FlipH`) which each need a NEW source-sized buffer — the
/// decoder has already been consumed by `from_decoder` at that point.
/// The fallible helper allocates that buffer directly via
/// `Vec::try_reserve_exact` and performs a manual rotation into it via
/// the generic `rotate_buf<T: Copy + Default>` helper, which covers
/// every `image` 0.25 pixel variant: u8 (`Luma8` / `LumaA8` / `Rgb8` /
/// `Rgba8`), u16 (`Luma16` / `LumaA16` / `Rgb16` / `Rgba16` — image-rs
/// 0.25's PNG decoder emits these for 16-bit-per-channel PNGs per
/// `codecs/png.rs:64-71`, and PNG decoders expose EXIF orientation
/// via `ImageDecoder::orientation`), and f32 (`Rgb32F` / `Rgba32F`).
/// Allocator failure surfaces as [`Error::OutOfMemory`] before the
/// pixel-copy loop runs — no probe-then-delegate race, no infallible
/// `apply_orientation` fallback. In-place variants (`NoTransforms` /
/// `FlipHorizontal` / `FlipVertical` / `Rotate180`) pass through with
/// zero allocation. See the helper's doc for the full rationale and
/// the module-level audit table for the per-variant bounded-memory vs
/// recoverable-OOM contract.
pub fn load_image(path: &std::path::Path) -> Result<::image::DynamicImage> {
// `ImageDecoder` is the trait that provides `.orientation()`; pull it
// into local scope so the method resolves on the opaque decoder type
// returned by `into_decoder`.
use ::image::ImageDecoder as _;
let parse_err =
|e: ::image::ImageError| Error::Parse(ParsePayload::new("vlm::image::load_image", "image", e));
let io_err = |e: std::io::Error| {
Error::FileIo(FileIoPayload::new(
"vlm::image::load_image",
FileOp::Read,
path.to_path_buf(),
e,
))
};
// ImageReader::open guesses the format from the path extension; we
// then call `with_guessed_format` to fall back to content sniffing
// for extension-less paths (mirroring python `Image.open` which
// sniffs the file header).
let reader = ::image::ImageReader::open(path)
.map_err(io_err)?
.with_guessed_format()
.map_err(io_err)?;
let mut decoder = reader.into_decoder().map_err(parse_err)?;
// NOTE: `into_decoder()` calls `JpegDecoder::new()` which does
// `r.read_to_end(&mut input)?` *before* any `Limits` check fires
// (image 0.25 `codecs/jpeg/decoder.rs:30-33`), so a very large JPEG
// file allocates the compressed bytes uncapped before `total_bytes()`
// gates the decoded buffer. Rejected on faithful-parity grounds: the
// upstream canonical `ImageReader::decode()` flow has the *identical*
// ordering (`image_reader_type.rs:311-322` — `make_decoder` runs the
// jpeg `read_to_end` before `limits.reserve(decoder.total_bytes())`),
// and python `PIL.Image.open` likewise does not cap compressed input
// (only the post-decode `MAX_IMAGE_PIXELS` warning). The function's
// documented scope is *local files only* — callers that need to
// bound untrusted input should pre-validate with
// `std::fs::metadata(path).len()` or use a `Take`-wrapped reader, the
// same as for any `std::fs::read`. This primitive mirrors the
// references' behavior and does not add divergent hardening.
// `decoder.orientation()` returns `Orientation::NoTransforms` for
// formats that don't carry orientation metadata. With the current
// `image` features (`png` + `jpeg`; TIFF/WebP NOT in the build,
// `mlxrs/Cargo.toml`) BOTH formats may expose EXIF orientation:
// JpegDecoder parses APP1/Exif, and image 0.25 PngDecoder exposes
// `exif_metadata` which the default `ImageDecoder::orientation`
// parses — so 16-bit PNGs with EXIF `Rotate90`/`Rotate270` reach
// the rotate path here too (this is NOT JPEG-only as of
// `image` 0.25.10). All rotate
// orientations are handled by `apply_orientation_fallible` over
// `rotate_buf<T>` covering u8/u16/f32 pixel variants. We read
// orientation here while we still have a `&mut` borrow on the
// decoder; once it's consumed by `from_decoder` below, the
// metadata can no longer be queried.
let orientation = decoder.orientation().map_err(parse_err)?;
// Preserve the 512 MiB default allocation guard that
// `ImageReader::decode()` enforces. Our use of `into_decoder` (so
// we can read orientation) skips the `limits.reserve(total_bytes)`
// check `decode()` does internally — see image 0.25
// `io::image_reader_type::ImageReader::decode`. Mirror that check
// explicitly so an oversized image is rejected with a clean
// `Error::Backend` instead of running through the decoder and
// panic-allocating downstream.
let mut limits = ::image::Limits::default();
limits.reserve(decoder.total_bytes()).map_err(parse_err)?;
decoder.set_limits(limits).map_err(parse_err)?;
let img = ::image::DynamicImage::from_decoder(decoder).map_err(parse_err)?;
apply_orientation_fallible(img, orientation)
}
/// Apply EXIF `orientation` to `img` with a *truly recoverable*
/// allocator gate on the four allocating rotate variants
/// (`Rotate90` / `Rotate270` / `Rotate90FlipH` / `Rotate270FlipH`),
/// for **every** `DynamicImage` element type — `u8`
/// (Luma8 / LumaA8 / Rgb8 / Rgba8), `u16`
/// (Luma16 / LumaA16 / Rgb16 / Rgba16), and `f32`
/// (Rgb32F / Rgba32F).
///
/// **The defect class this closes.** A "probe-then-delegate" pattern
/// (reserve a throwaway `Vec<u8>`, drop it, then call image-rs's
/// `apply_orientation`) is race-prone — allocator pressure between the
/// probe drop and the real `rotate90` / `rotate270` alloc can flip the
/// result from "would succeed" to "aborts". Instead, the rotated buffer
/// is built INSIDE a `try_reserve_exact`-backed `Vec<T>` via a *manual*
/// rotation (no second alloc, no probe drop, no race window),
/// generalized over the element type `T: Copy` (see [`rotate_buf`]) so
/// the u16 and f32 variants share the same `try_reserve_exact` gate as
/// u8. Generalizing over all element types is required, not optional:
/// image-rs 0.25's PNG decoder emits `Luma16` / `LumaA16` / `Rgb16` /
/// `Rgba16` for 16-bit PNG inputs (`codecs/png.rs:64-71`), so a 16-bit
/// PNG carrying an EXIF `Rotate90` / `Rotate270` orientation could
/// otherwise reach an infallible fallback from [`load_image`] and abort
/// on allocator pressure. Allocator failure now surfaces as
/// [`Error::OutOfMemory`] exactly once for every `DynamicImage` variant,
/// before the pixel-copy loop runs.
///
/// **Dispatch.**
/// - **All rotate orientations × all element types** (u8 / u16 / f32):
/// manual per-pixel rotation into a `Vec<T>` whose backing buffer
/// was reserved via `try_reserve_exact`. Rebuild a `DynamicImage`
/// via `ImageBuffer::from_raw`. Recoverable-OOM guaranteed.
/// - **No-op / in-place variants** (NoTransforms, FlipHorizontal,
/// FlipVertical, Rotate180) on any variant: zero source-sized
/// alloc — upstream `apply_orientation` dispatches to
/// `*_in_place` helpers. Recoverable-OOM trivially (no alloc).
///
/// **Rotation conventions** (cross-checked against
/// `image-0.25.10/src/imageops/affine.rs`):
/// - `rotate90` (clockwise 90°): for each `(x, y)` in source dims
/// `(w0, h0)`, writes to `(h0 - 1 - y, x)` in output dims
/// `(h0, w0)`.
/// - `rotate270` (clockwise 270°): writes `(x, y)` → `(y, w0 - 1 - x)`
/// in output dims `(h0, w0)`.
/// - `Rotate90FlipH`: rotate90 then `fliph_in_place`. Composed
/// directly into the destination via `(h0 - 1 - y, x)` then
/// `(width - 1 - new_x, new_y)`, which collapses to writing
/// `(y, x)` directly.
/// - `Rotate270FlipH`: rotate270 then `fliph_in_place`. Collapses
/// to writing `(h0 - 1 - y, w0 - 1 - x)`.
fn apply_orientation_fallible(
mut img: ::image::DynamicImage,
orientation: ::image::metadata::Orientation,
) -> Result<::image::DynamicImage> {
use ::image::metadata::Orientation;
match orientation {
// No-op / in-place variants: zero source-sized alloc on any
// pixel variant — upstream `apply_orientation` dispatches to
// `*_in_place` helpers or returns immediately. See
// `image::DynamicImage::apply_orientation` arms at
// `images/dynimage.rs:1163-1180`:
// - NoTransforms → `()` (no-op)
// - FlipHorizontal → `fliph_in_place`
// - FlipVertical → `flipv_in_place`
// - Rotate180 → `rotate180_in_place`
Orientation::NoTransforms
| Orientation::FlipHorizontal
| Orientation::FlipVertical
| Orientation::Rotate180 => {
img.apply_orientation(orientation);
Ok(img)
}
// Allocating variants — every `DynamicImage` element type
// (u8 / u16 / f32) routes through the generic-rotate path in
// `apply_rotate_fallible`. No infallible fallback remains.
Orientation::Rotate90
| Orientation::Rotate270
| Orientation::Rotate90FlipH
| Orientation::Rotate270FlipH => {
let kind = match orientation {
Orientation::Rotate90 => RotateKind::Rotate90,
Orientation::Rotate270 => RotateKind::Rotate270,
Orientation::Rotate90FlipH => RotateKind::Rotate90FlipH,
Orientation::Rotate270FlipH => RotateKind::Rotate270FlipH,
_ => unreachable!("outer match restricts to the 4 rotate variants"),
};
apply_rotate_fallible(img, kind)
}
}
}
/// Internal kind tag for the four EXIF rotation orientations that
/// allocate a fresh source-sized buffer. Carried as a typed enum (not
/// re-derived from `image::Orientation`) so the manual-rotation
/// dispatch in [`apply_rotate_fallible`] can `match` on a closed set.
#[derive(Debug, Clone, Copy)]
enum RotateKind {
/// Clockwise 90° (transpose + flip on the horizontal axis).
Rotate90,
/// Clockwise 270° (transpose + flip on the vertical axis).
Rotate270,
/// `Rotate90` then horizontal flip.
Rotate90FlipH,
/// `Rotate270` then horizontal flip.
Rotate270FlipH,
}
/// Manual fallible rotation for every [`DynamicImage`] element type
/// (u8: Luma8 / LumaA8 / Rgb8 / Rgba8; u16: Luma16 / LumaA16 / Rgb16
/// / Rgba16; f32: Rgb32F / Rgba32F). The rotated buffer is built
/// INSIDE a `try_reserve_exact`-backed `Vec<T>` (via [`rotate_buf`])
/// so an allocator failure surfaces as [`Error::OutOfMemory`]
/// exactly once, before the pixel-copy loop runs (no probe race
/// window, no infallible fallback to image-rs).
///
/// **Why all element types.** The rotate is generic over `T: Copy` so
/// the same fallible path covers u16 and f32, not just u8: image-rs
/// 0.25's PNG decoder emits 16-bit variants for `BitDepth::Sixteen`
/// PNG inputs (`codecs/png.rs:64-71`), which can be reached from
/// [`load_image`], so a non-u8 variant on an infallible
/// `apply_orientation` fallback would abort on allocator pressure.
/// `DynamicImage` is `#[non_exhaustive]` upstream, so a wildcard
/// arm is required by the compiler; we surface any future-added
/// variant as a typed [`Error::Backend`] rather than silently
/// falling back to image-rs's infallible `apply_orientation` — the
/// "no abort on allocator pressure" contract holds for every
/// variant the helper handles today (the ten listed above), and a
/// new variant produces a recoverable `Err` until its rotate arm
/// is wired in.
fn apply_rotate_fallible(
img: ::image::DynamicImage,
rotation: RotateKind,
) -> Result<::image::DynamicImage> {
use ::image::{DynamicImage, ImageBuffer, Luma, LumaA, Rgb, Rgba};
let w = u64::from(img.width());
let h = u64::from(img.height());
let bytes_per_pixel = u64::from(img.color().bytes_per_pixel());
// `width * height * bytes_per_pixel`. u64 to keep the overflow
// check host-arch-independent; `u32::MAX^2 * 16 ≈ 2.95e20` fits
// in u64, so the `checked_mul` only fires for hostile dimensions.
// The byte budget is `bytes_per_pixel`-based (not subpixel-count)
// so 16-bit and f32 variants are bounded by their true memory
// footprint, not by the channel count alone.
let bytes = w
.checked_mul(h)
.and_then(|wh| wh.checked_mul(bytes_per_pixel))
.ok_or_else(|| {
Error::ArithmeticOverflow(ArithmeticOverflowPayload::with_operands(
"apply_rotate_fallible: rotated buffer size (w * h * bytes_per_pixel)",
"u64",
[("w", w), ("h", h), ("bytes_per_pixel", bytes_per_pixel)],
))
})?;
if bytes > MAX_DECODED_IMAGE_BYTES {
return Err(Error::CapExceeded(CapExceededPayload::new(
"apply_rotate_fallible: rotated buffer",
"MAX_DECODED_IMAGE_BYTES",
MAX_DECODED_IMAGE_BYTES,
bytes,
)));
}
let src_w = img.width();
let src_h = img.height();
let (out_w, out_h) = rotated_dims(src_w, src_h, rotation);
// Exhaustive per-variant dispatch into the generic
// `rotate_buf<T: Copy>` helper. `channels` is the per-pixel
// *subpixel count* (number of `T` elements per pixel), NOT a
// byte count — so `Rgb<u16>` is 3 (three u16 subpixels per
// pixel), `Rgba<f32>` is 4 (four f32 subpixels per pixel), etc.
// `ImageBuffer::from_raw` validates the container length is
// `width * height * CHANNEL_COUNT` in subpixel units, which
// exactly matches the buffer we hand it.
match img {
DynamicImage::ImageLuma8(buf) => {
let dst = rotate_buf::<u8>(buf.as_raw(), src_w, src_h, 1, rotation)?;
let out: ::image::GrayImage = ImageBuffer::from_raw(out_w, out_h, dst).expect(
"ImageBuffer::from_raw: dst buffer length matches w*h*1 by construction in rotate_buf",
);
Ok(DynamicImage::ImageLuma8(out))
}
DynamicImage::ImageLumaA8(buf) => {
let dst = rotate_buf::<u8>(buf.as_raw(), src_w, src_h, 2, rotation)?;
let out: ::image::GrayAlphaImage = ImageBuffer::from_raw(out_w, out_h, dst).expect(
"ImageBuffer::from_raw: dst buffer length matches w*h*2 by construction in rotate_buf",
);
Ok(DynamicImage::ImageLumaA8(out))
}
DynamicImage::ImageRgb8(buf) => {
let dst = rotate_buf::<u8>(buf.as_raw(), src_w, src_h, 3, rotation)?;
let out: ::image::RgbImage = ImageBuffer::from_raw(out_w, out_h, dst).expect(
"ImageBuffer::from_raw: dst buffer length matches w*h*3 by construction in rotate_buf",
);
Ok(DynamicImage::ImageRgb8(out))
}
DynamicImage::ImageRgba8(buf) => {
// SIMD: Rgba8 (u8 + channels=4) is the hot path — dispatch
// through the NEON kernel (`simd::vlm::rotate_buf::rotate_buf_u8`)
// which uses a 4-pixel-tile `vld1q_u8` load + per-pixel u32
// scattered store. The other u8 channel counts (1/2/3) and
// every u16/f32 arm continue to use the generic `rotate_buf<T>`.
let dst = rotate_buf_u8(buf.as_raw(), src_w, src_h, 4, rotation)?;
let out: ::image::RgbaImage = ImageBuffer::from_raw(out_w, out_h, dst).expect(
"ImageBuffer::from_raw: dst buffer length matches w*h*4 by construction in rotate_buf",
);
Ok(DynamicImage::ImageRgba8(out))
}
DynamicImage::ImageLuma16(buf) => {
let dst = rotate_buf::<u16>(buf.as_raw(), src_w, src_h, 1, rotation)?;
let out: ImageBuffer<Luma<u16>, Vec<u16>> = ImageBuffer::from_raw(out_w, out_h, dst).expect(
"ImageBuffer::from_raw: Luma16 dst length matches w*h*1 u16 subpixels by construction",
);
Ok(DynamicImage::ImageLuma16(out))
}
DynamicImage::ImageLumaA16(buf) => {
let dst = rotate_buf::<u16>(buf.as_raw(), src_w, src_h, 2, rotation)?;
let out: ImageBuffer<LumaA<u16>, Vec<u16>> = ImageBuffer::from_raw(out_w, out_h, dst).expect(
"ImageBuffer::from_raw: LumaA16 dst length matches w*h*2 u16 subpixels by construction",
);
Ok(DynamicImage::ImageLumaA16(out))
}
DynamicImage::ImageRgb16(buf) => {
let dst = rotate_buf::<u16>(buf.as_raw(), src_w, src_h, 3, rotation)?;
let out: ImageBuffer<Rgb<u16>, Vec<u16>> = ImageBuffer::from_raw(out_w, out_h, dst).expect(
"ImageBuffer::from_raw: Rgb16 dst length matches w*h*3 u16 subpixels by construction",
);
Ok(DynamicImage::ImageRgb16(out))
}
DynamicImage::ImageRgba16(buf) => {
let dst = rotate_buf::<u16>(buf.as_raw(), src_w, src_h, 4, rotation)?;
let out: ImageBuffer<Rgba<u16>, Vec<u16>> = ImageBuffer::from_raw(out_w, out_h, dst).expect(
"ImageBuffer::from_raw: Rgba16 dst length matches w*h*4 u16 subpixels by construction",
);
Ok(DynamicImage::ImageRgba16(out))
}
DynamicImage::ImageRgb32F(buf) => {
let dst = rotate_buf::<f32>(buf.as_raw(), src_w, src_h, 3, rotation)?;
let out: ImageBuffer<Rgb<f32>, Vec<f32>> = ImageBuffer::from_raw(out_w, out_h, dst).expect(
"ImageBuffer::from_raw: Rgb32F dst length matches w*h*3 f32 subpixels by construction",
);
Ok(DynamicImage::ImageRgb32F(out))
}
DynamicImage::ImageRgba32F(buf) => {
let dst = rotate_buf::<f32>(buf.as_raw(), src_w, src_h, 4, rotation)?;
let out: ImageBuffer<Rgba<f32>, Vec<f32>> = ImageBuffer::from_raw(out_w, out_h, dst).expect(
"ImageBuffer::from_raw: Rgba32F dst length matches w*h*4 f32 subpixels by construction",
);
Ok(DynamicImage::ImageRgba32F(out))
}
// `DynamicImage` is `#[non_exhaustive]` upstream — a wildcard
// arm is required by the compiler even though the ten arms
// above cover every variant defined in image-rs 0.25. Surface
// any future addition as a typed `Backend` error rather than
// silently calling the infallible `apply_orientation` (which
// would reintroduce the abort-on-OOM behavior this path avoids).
// Recoverable by definition (it's an `Err`, not a panic), and
// upgrading to a proper rotate arm becomes a localized edit
// here the day a new variant ships.
other => Err(Error::UnknownEnumValue(UnknownEnumValuePayload::new(
"apply_rotate_fallible: DynamicImage color variant (image-rs added a new pixel type that \
mlxrs has not yet wired into the fallible rotate path — please file an issue and extend \
the match arms above)",
format!("{:?}", other.color()),
&[
"L8", "La8", "Rgb8", "Rgba8", "L16", "La16", "Rgb16", "Rgba16", "Rgb32F", "Rgba32F",
],
))),
}
}
/// Rotated output dimensions for [`RotateKind`]. All four rotate
/// variants swap width and height (90°/270° rotation, with or
/// without a subsequent in-place horizontal flip that does not
/// change extent).
fn rotated_dims(src_w: u32, src_h: u32, _rotation: RotateKind) -> (u32, u32) {
(src_h, src_w)
}
/// Manual rotation of a `T`-typed pixel buffer for the four EXIF
/// rotate orientations, into a `Vec<T>` whose backing storage is
/// reserved via `try_reserve_exact` (recoverable-OOM).
///
/// **Generic over the element type** (`T: Copy + Default`) so the
/// same body covers every `DynamicImage` element width:
/// - u8: Luma8 / LumaA8 / Rgb8 / Rgba8
/// - u16: Luma16 / LumaA16 / Rgb16 / Rgba16
/// - f32: Rgb32F / Rgba32F
///
/// Rotation is a *permutation of pixel positions*, not a per-channel
/// projection — the index math is identical for every element type;
/// only the per-pixel memcpy length (`channels` subpixels of `T`)
/// scales with the variant. `Copy` is required for the per-pixel
/// `copy_from_slice`; `Default` is required to zero-init the
/// reservation before the permutation writes land (cheap: u8/u16
/// default to 0, f32 to 0.0).
///
/// **Pixel conventions** (cross-checked against image-rs
/// `imageops/affine.rs` in `image-0.25.10`):
/// - `Rotate90` (CW 90°): `image[y][x] -> dst[x][h-1-y]`, i.e.
/// `dst.put_pixel(h-1-y, x, src.get_pixel(x, y))` with output
/// dims `(h, w)`. (Source matches `rotate90_in` at
/// `imageops/affine.rs:65-71`.)
/// - `Rotate270` (CW 270°): `dst.put_pixel(y, w-1-x, src.get_pixel(x, y))`
/// with output dims `(h, w)`. (Matches `rotate270_in` at
/// `imageops/affine.rs:108-114`.)
/// - `Rotate90FlipH`: rotate90 then horizontal flip — the rotated
/// x-coordinate `x_rot = h-1-y` becomes `(out_w-1) - x_rot
/// = (h-1) - (h-1-y) = y`. Output: `dst.put_pixel(y, x, src.get_pixel(x, y))`
/// with dims `(h, w)`.
/// - `Rotate270FlipH`: rotate270 then horizontal flip — the rotated
/// x-coordinate `x_rot = y` becomes `(out_w-1) - y = (h-1) - y`.
/// Output: `dst.put_pixel(h-1-y, w-1-x, src.get_pixel(x, y))`
/// with dims `(h, w)`.
///
/// `channels` is the per-pixel *subpixel count* (1 for Luma, 2 for
/// LumaA, 3 for Rgb, 4 for Rgba) — NOT a byte count. The element
/// stride per pixel is `channels * size_of::<T>()` bytes, but the
/// allocator and copy operate in `T`-element units, so `bytes` here
/// is the subpixel-element count for `Vec::try_reserve_exact::<T>`.
///
/// `src.len()` MUST equal `src_w * src_h * channels` (in `T` units);
/// the caller (the ten per-variant arms in [`apply_rotate_fallible`])
/// guarantees this via the `ImageBuffer::as_raw()` contract. The
/// byte budget has already been validated by the caller against
/// [`MAX_DECODED_IMAGE_BYTES`] (using true `bytes_per_pixel`), so
/// the subpixel-element count is well under `usize::MAX`.
fn rotate_buf<T: Copy + Default>(
src: &[T],
src_w: u32,
src_h: u32,
channels: usize,
rotation: RotateKind,
) -> Result<Vec<T>> {
let w_usize = src_w as usize;
let h_usize = src_h as usize;
// `elements = w * h * channels` (in `T` subpixel units; total
// bytes = elements * size_of::<T>() is already validated <=
// MAX_DECODED_IMAGE_BYTES by the caller). Recompute here in
// usize (the cast is lossless on any 64-bit host) so the buffer
// reservation has a precise length. Defend against an unexpected
// overflow regardless — `try_reserve_exact` on a pathologically
// large `usize` would still return Err, but the explicit overflow
// check yields a typed `ArithmeticOverflow` rather than the less
// specific `OutOfMemory`.
let elements = w_usize
.checked_mul(h_usize)
.and_then(|wh| wh.checked_mul(channels))
.ok_or_else(|| {
Error::ArithmeticOverflow(ArithmeticOverflowPayload::with_operands(
"rotate_buf: elements (w * h * channels)",
"usize",
[
("w", w_usize as u64),
("h", h_usize as u64),
("channels", channels as u64),
],
))
})?;
debug_assert_eq!(
src.len(),
elements,
"rotate_buf: src.len() must equal w*h*channels by ImageBuffer::as_raw() contract"
);
// Allocate the destination buffer fallibly. `try_reserve_exact`
// returns Err on allocator failure rather than aborting; convert
// to `Error::OutOfMemory`. We then `resize(elements, T::default())`
// so the length matches and the slice indices below are valid. No
// second allocation occurs (`Vec::resize` does not reallocate when
// `len + n <= capacity`, which is the case immediately after a
// `try_reserve_exact(elements)` from `len == 0`).
let mut dst: Vec<T> = Vec::new();
dst
.try_reserve_exact(elements)
.map_err(|_| Error::OutOfMemory)?;
dst.resize(elements, T::default());
let (out_w, _out_h) = rotated_dims(src_w, src_h, rotation);
let out_w_usize = out_w as usize;
// Iterate the source once. For each source pixel at (x, y), copy
// its `channels` `T` elements into the destination position
// implied by `rotation`. The inner per-pixel `copy_from_slice` is
// a contiguous memcpy of `channels * size_of::<T>()` bytes (LLVM
// unrolls / converts to single loads/stores for small fixed sizes
// — same code shape as the u8-only path, just parameterized
// by `T`).
//
// Index math (verified against `imageops/affine.rs` upstream):
// Rotate90 : (x, y) -> (h_usize - 1 - y, x) out dims (h, w)
// Rotate270 : (x, y) -> (y, w_usize - 1 - x) out dims (h, w)
// Rotate90FlipH : (x, y) -> (y, x) out dims (h, w)
// Rotate270FlipH : (x, y) -> (h_usize - 1 - y, w_usize - 1 - x) out dims (h, w)
for y in 0..h_usize {
for x in 0..w_usize {
let (nx, ny) = match rotation {
RotateKind::Rotate90 => (h_usize - 1 - y, x),
RotateKind::Rotate270 => (y, w_usize - 1 - x),
RotateKind::Rotate90FlipH => (y, x),
RotateKind::Rotate270FlipH => (h_usize - 1 - y, w_usize - 1 - x),
};
let src_off = (y * w_usize + x) * channels;
let dst_off = (ny * out_w_usize + nx) * channels;
dst[dst_off..dst_off + channels].copy_from_slice(&src[src_off..src_off + channels]);
}
}
Ok(dst)
}
/// SIMD-routed u8 rotate helper — shares the same allocation /
/// length contract as [`rotate_buf`] but dispatches the inner two-loop
/// through [`crate::simd::vlm::rotate_buf::rotate_buf_u8`], whose
/// aarch64 kernel uses a 4-pixel-tile `vld1q_u8` + per-pixel u32
/// scattered store on the `channels = 4` (Rgba8) hot path.
///
/// `channels = 1/2/3` fall through to the SIMD dispatcher's scalar arm —
/// bit-identical to [`rotate_buf::<u8>`]'s per-pixel `copy_from_slice`
/// shape. Only `Rgba8` (the dominant call site post-image-decode) hits
/// the NEON tile in the SIMD dispatcher.
fn rotate_buf_u8(
src: &[u8],
src_w: u32,
src_h: u32,
channels: usize,
rotation: RotateKind,
) -> Result<Vec<u8>> {
let w_usize = src_w as usize;
let h_usize = src_h as usize;
let elements = w_usize
.checked_mul(h_usize)
.and_then(|wh| wh.checked_mul(channels))
.ok_or_else(|| {
Error::ArithmeticOverflow(ArithmeticOverflowPayload::with_operands(
"rotate_buf_u8: elements (w * h * channels)",
"usize",
[
("w", w_usize as u64),
("h", h_usize as u64),
("channels", channels as u64),
],
))
})?;
debug_assert_eq!(
src.len(),
elements,
"rotate_buf_u8: src.len() must equal w*h*channels by ImageBuffer::as_raw() contract"
);
let mut dst: Vec<u8> = Vec::new();
dst
.try_reserve_exact(elements)
.map_err(|_| Error::OutOfMemory)?;
dst.resize(elements, 0u8);
let simd_rotation = match rotation {
RotateKind::Rotate90 => crate::simd::vlm::rotate_buf::RotateKind::Rotate90,
RotateKind::Rotate270 => crate::simd::vlm::rotate_buf::RotateKind::Rotate270,
RotateKind::Rotate90FlipH => crate::simd::vlm::rotate_buf::RotateKind::Rotate90FlipH,
RotateKind::Rotate270FlipH => crate::simd::vlm::rotate_buf::RotateKind::Rotate270FlipH,
};
crate::simd::vlm::rotate_buf::rotate_buf_u8(
&mut dst,
src,
w_usize,
h_usize,
channels,
simd_rotation,
);
Ok(dst)
}
/// Resize `img` to `(height, width)` using `filter`.
///
/// Mirrors swift `MediaProcessing.resampleBicubic` /
/// `resampleLanczos` (`MediaProcessing.swift:110-132` / `81-103`).
///
/// The swift reference applies a separate y-scale + aspect-ratio adjust
/// (then a final crop) to "ensure exact dimensions" (lines 113-131); the
/// `image::imageops::resize` we forward to *also* produces exact
/// dimensions (the crate documents `resize(image, nwidth, nheight, filter)`
/// as scaling to exactly `nwidth × nheight`), so the trailing crop step
/// is unnecessary.
///
/// **Aspect-ratio preservation:** none. This is a *forced* resize to the
/// target dimensions, mirroring the swift `resampleBicubic` (which also
/// distorts to the requested size — it computes independent x- and
/// y-scale factors at lines 113-114 and applies them separately). The
/// python `resize_image` (`mlx_vlm/utils.py:835-839`) computes an
/// aspect-ratio-preserving thumbnail; that variant is a per-model
/// concern and is intentionally not exposed here. Callers that need it
/// can compute the target tuple themselves before calling `resize`.
///
/// **Return type — fallible target-dimension guard.**
/// The signature is `-> Result<DynamicImage>` rather than the
/// infallible `-> DynamicImage` of the swift `resampleBicubic(_,
/// to:) -> CIImage` (line 110-132) / python `Image.resize(new_size)
/// -> Image` references. The earlier faithful-port rationale assumed
/// `target` came from a TRUSTED `ImageProcessorConfig::size` (model
/// metadata baked into the binary or a trusted JSON), so a
/// pathological dimension was a caller bug. That assumption no longer
/// holds: [`ImageProcessorConfig::size`] is now populated from a
/// LOADED `preprocessor_config.json` / `processor_config.json` (see
/// [`crate::vlm::load`]), which is UNTRUSTED on-disk input. A
/// hostile/malformed config with an enormous `size` would otherwise
/// drive the source RGBA materialization and the destination resize
/// alloc to panic-abort the process — taking down image AND video
/// preprocessing on the first request. We therefore validate the
/// target dimensions BEFORE any allocation and surface an over-budget /
/// zero / overflow target as a recoverable `Err` instead of an abort.
///
/// **Allocation contract — fully fallible.** Byte counts are bounded on
/// both ends: the source is bounded by [`MAX_DECODED_IMAGE_BYTES`] via
/// [`load_image`]'s decoder cap, and the destination is bounded by the
/// same [`MAX_DECODED_IMAGE_BYTES`] ceiling enforced on
/// `height * width * 4` by the target guard below (mirroring
/// [`pad_to_square`]'s canvas gate). The cap bounds the SIZE; fallibility
/// is also guaranteed. EVERY allocation in the resize path routes through
/// `try_reserve_exact` and surfaces allocator failure as
/// [`Error::OutOfMemory`] instead of aborting:
/// - the source RGBA buffer (formerly an infallible `img.to_rgba8()`
/// clone) is reserved and filled manually — a borrowed `as_rgba8()`
/// fast path or a per-pixel `dynamic_image_rgba_pixel` projection;
/// - the resize kernel (`vlm::resize::resize_rgba8`) reserves
/// its horizontal + vertical coefficient tables, the inter-pass
/// intermediate, and the destination buffer ALL via `try_reserve_exact`
/// — replacing the dropped `fast_image_resize` dependency, whose
/// internal scratch (coefficient tables, per-row work buffers)
/// allocated infallibly inside the crate and could abort despite our
/// `Result`.
///
/// No infallible `vec!` / external-crate alloc / `to_rgba8` remains in
/// this path, so a just-under-cap hostile target (≈11585×11585 ≈
/// 512 MiB) can no longer force a ~512 MiB infallible alloc → the
/// `Result` signature is honest. The only residual abort is a genuine
/// allocator-internal failure (e.g. a panic inside `Vec::resize`'s growth
/// of an ALREADY-reserved buffer, which cannot fail), not the
/// reservation. The kernel's output is bit-exact with PIL `Image.resize`
/// (the reference mlx-vlm preprocessing expects). The [`preprocess`] and
/// [`crate::vlm::video::process_frames`] composers inherit this
/// transitively (both call `resize` via `?`).
///
/// # Errors
/// - [`Error::OutOfRange`] if either target dimension is `0`.
/// - [`Error::ArithmeticOverflow`] if `height * width * 4` overflows
/// `u64`, or if `source_width * source_height * 4` overflows `usize`.
/// - [`Error::CapExceeded`] if `height * width * 4` exceeds
/// [`MAX_DECODED_IMAGE_BYTES`], or if the RGBA8-expanded source staging
/// size exceeds [`MAX_DECODED_IMAGE_BYTES`] — the message carries the
/// offending dims and the cap. (The over-cap cases use `CapExceeded`
/// rather than [`Error::OutOfMemory`] so they can name the dims +
/// ceiling, matching [`pad_to_square`]'s canvas gate.) The source guard
/// matters because `load_image`'s cap is on the *source* pixel format:
/// a `Luma8` (1 B/px) image accepted just under the cap would expand
/// 4x as RGBA8 staging.
/// - [`Error::OutOfMemory`] if the allocator cannot satisfy the
/// `try_reserve_exact` for the source RGBA buffer or for any buffer the
/// resize kernel allocates (coefficient tables, inter-pass
/// intermediate, destination — all bounded ≤512 MiB by the caps above)
/// — a recoverable typed error instead of a process abort.
pub fn resize(
img: &::image::DynamicImage,
target: (u32, u32),
filter: ResizeFilter,
) -> Result<::image::DynamicImage> {
let (height, width) = target;
// Target-dimension guard: `target` now flows from an
// UNTRUSTED loaded processor config, so validate it BEFORE the
// source RGBA materialization and the `width * height * 4`
// destination alloc. Mirrors `pad_to_square`'s canvas gate so the
// per-step caps stay consistent.
if width == 0 {
return Err(Error::OutOfRange(OutOfRangePayload::new(
"resize: target width",
"must be non-zero",
format!("{width}"),
)));
}
if height == 0 {
return Err(Error::OutOfRange(OutOfRangePayload::new(
"resize: target height",
"must be non-zero",
format!("{height}"),
)));
}
// `height * width * 4` (RGBA8 destination bytes). u64 throughout so the
// check is host-width-independent; `u32::MAX^2 * 4 ≈ 7.4e19` fits in
// u64, so the `checked_mul` chain only fires for a truly hostile
// dimension product.
let dst_bytes = u64::from(height)
.checked_mul(u64::from(width))
.and_then(|hw| hw.checked_mul(4))
.ok_or_else(|| {
Error::ArithmeticOverflow(ArithmeticOverflowPayload::with_operands(
"resize: destination bytes (height * width * 4)",
"u64",
[("height", u64::from(height)), ("width", u64::from(width))],
))
})?;
if dst_bytes > MAX_DECODED_IMAGE_BYTES {
return Err(Error::CapExceeded(CapExceededPayload::new(
"resize: destination RGBA8",
"MAX_DECODED_IMAGE_BYTES",
MAX_DECODED_IMAGE_BYTES,
dst_bytes,
)));
}
// SIMD-accelerated, fully-fallible resize via mlxrs's OWN kernel
// (`vlm::resize`) — `fast_image_resize` is dropped. NEON on
// aarch64 (scalar fallback always compiled, `mlxrs_force_scalar`
// honored), bit-exact with PIL `Image.resize`. Decode-side stays on
// `image` (above in `load_image`); only the resize hot path is owned.
// Public API of this fn is unchanged.
//
// Pixel-type: RGBA8 for parity with the prior behavior (image-rs's
// `imageops::resize` over a `DynamicImage` projects to `Rgba8`
// unconditionally; downstream `image_to_array` drops alpha as before).
//
// FALLIBLE source materialization: the prior
// `img.to_rgba8()` was an OWNED source-sized copy/convert backed by an
// INFALLIBLE `Vec` (image 0.25 returns a fresh `RgbaImage` and clones
// even when the source is already `ImageRgba8`; only the consuming
// `into_rgba8()` path avoids the clone, and we can't use it on a
// `&DynamicImage`). Even though the source byte count is bounded by
// [`MAX_DECODED_IMAGE_BYTES`] via `load_image`'s decoder cap, the
// allocation itself would `abort()` under allocator pressure — making
// the `Result` signature dishonest. We now reserve the source RGBA
// buffer via `try_reserve_exact` (`error::try_with_capacity`) and fill
// it ourselves, mirroring `image_to_array` / `pad_to_square`: a
// borrowed `as_rgba8()` fast path when the source is already RGBA8,
// else a per-pixel `dynamic_image_rgba_pixel` projection (the same
// `dynamic_map!` color-space conversion `to_rgba8()` performed).
// Allocator failure surfaces as [`Error::OutOfMemory`].
let src_w = img.width();
let src_h = img.height();
// `src_w * src_h * 4` (RGBA8 source bytes) in usize. The source came
// through `load_image`'s 512 MiB decoder cap so this product is bounded
// in practice, but a `checked_mul` keeps the reservation honest on a
// 32-bit `usize` (where the cap's byte count could still wrap) and
// routes any pathological product to a recoverable `ArithmeticOverflow`
// instead of a silent under-allocation.
let src_bytes = (src_w as usize)
.checked_mul(src_h as usize)
.and_then(|wh| wh.checked_mul(4))
.ok_or_else(|| {
Error::ArithmeticOverflow(ArithmeticOverflowPayload::with_operands(
"resize: source bytes (src_w * src_h * 4)",
"usize",
[("src_w", src_w as u64), ("src_h", src_h as u64)],
))
})?;
// Source-staging cap: `load_image`'s 512 MiB ceiling is
// enforced on `decoder.total_bytes()` — the SOURCE pixel format. A
// low-bytes-per-pixel source (e.g. `Luma8`, 1 B/px) accepted just under
// that cap expands 4x when projected to the RGBA8 staging buffer below,
// so an under-cap Luma8 image could drive a ~2 GiB RGBA staging
// allocation and break the documented bounded-memory invariant. The
// destination already has this guard (`dst_bytes` above); apply the
// same `MAX_DECODED_IMAGE_BYTES` ceiling to the RGBA-expanded SOURCE
// staging before its `try_reserve_exact`. `u64` so the comparison is
// host-width-independent (matches the `dst_bytes` gate).
if src_bytes as u64 > MAX_DECODED_IMAGE_BYTES {
return Err(Error::CapExceeded(CapExceededPayload::new(
"resize: source RGBA8 staging (Luma8/Rgb8/etc. expanded to RGBA8)",
"MAX_DECODED_IMAGE_BYTES",
MAX_DECODED_IMAGE_BYTES,
src_bytes as u64,
)));
}
let mut src_buf = crate::error::try_with_capacity::<u8>(src_bytes)?;
if let Some(rgba) = img.as_rgba8() {
// Already RGBA8: borrow the backing `&[u8]` and copy exactly
// `src_bytes` (slice to guard an over-long backing buffer, matching
// `image_to_array`'s `.get(..total)` discipline so the fill cannot
// grow `src_buf` past the `try_reserve_exact` reservation).
let raw = rgba.as_raw().get(..src_bytes).ok_or_else(|| {
Error::LengthMismatch(LengthMismatchPayload::new(
"resize: rgba backing buffer bytes vs W*H*4",
src_bytes,
rgba.as_raw().len(),
))
})?;
src_buf.extend_from_slice(raw);
} else {
// Non-RGBA8 source (Luma8 / Rgb8 / Rgb16 / Rgb32F / …): per-pixel
// `DynamicImage::get_pixel(x, y) -> Rgba<u8>` projection (alpha
// preserved — this feeds the resize, downstream `image_to_array`
// drops it). NO intermediate source-sized image allocation.
for y in 0..src_h {
for x in 0..src_w {
src_buf.extend_from_slice(&dynamic_image_rgba_pixel(img, x, y));
}
}
}
debug_assert_eq!(
src_buf.len(),
src_bytes,
"resize source RGBA fill length must equal pre-computed src_bytes",
);
// FALLIBLE resize via mlxrs's own kernel (`vlm::resize`) —
// the previous `fast_image_resize` dependency is dropped. Every buffer
// the kernel allocates (the horizontal + vertical coefficient tables,
// the inter-pass intermediate, and the destination) routes through
// `try_reserve_exact`, so allocator failure surfaces as
// [`Error::OutOfMemory`] instead of aborting — closing the last abort
// path that `fast_image_resize`'s infallible internal scratch left open
// for an untrusted-config target. The kernel is bit-exact with PIL
// `Image.resize` (the reference mlx-vlm preprocessing expects); the
// public API of this fn is unchanged.
//
// `resize_rgba8` takes `(w, h)` in pixels (NOT bytes); `src_w`/`src_h`
// are the source dims and `(width, height)` the validated target. The
// returned `Vec<u8>` is exactly `width * height * 4` bytes (validated
// ≤ `MAX_DECODED_IMAGE_BYTES` by the target guard above).
let dst_buf = crate::vlm::resize::resize_rgba8(
&src_buf,
src_w as usize,
src_h as usize,
width as usize,
height as usize,
filter.to_internal(),
)?;
debug_assert_eq!(
dst_buf.len(),
dst_bytes as usize,
"resize destination RGBA length must equal pre-computed dst_bytes",
);
Ok(::image::DynamicImage::ImageRgba8(
::image::ImageBuffer::from_raw(width, height, dst_buf)
// `resize_rgba8` returns exactly `width * height * 4` bytes — the
// precise length `ImageBuffer::from_raw` requires for a
// `width x height` RGBA buffer.
.expect("ImageBuffer::from_raw: dst buffer length matches width * height * 4 by construction"),
))
}
/// Resize `img` to `(target_h, target_w)` using Lanczos3 interpolation.
///
/// Convenience wrapper around [`resize`] that fixes the filter to
/// [`ResizeFilter::Lanczos3`]. Mirrors swift
/// [`MediaProcessing.resampleLanczos`](https://github.com/ml-explore/mlx-swift-lm/blob/main/Libraries/MLXVLM/MediaProcessing.swift#L81-L103)
/// and PIL `Image.resize(LANCZOS)`. Lanczos3 (`a = 3`, sinc-windowed sinc)
/// matches PIL's `Image.LANCZOS` kernel exactly (Pillow renamed the old
/// `ANTIALIAS` to `LANCZOS` in 9.1.0; both are the `a=3` Lanczos
/// convolution).
///
/// The argument order is `(target_h, target_w)` — the swift API takes a
/// `CGSize(width:, height:)` but we mirror the python image-processor
/// convention (`(height, width)`) that the rest of [`resize`] /
/// [`ImageProcessorConfig::size`] uses. Output dimensions are exact —
/// the own resize kernel (`vlm::resize`) produces exactly the
/// requested `(w, h)` (no trailing crop step needed, unlike the swift
/// implementation which has to `cropped(to: exactRect)` after
/// `lanczosScaleTransform` produces a near-target output).
///
/// # Errors
/// Propagates [`resize`]'s target-dimension guard verbatim
/// ([`Error::OutOfRange`] on zero dims, [`Error::ArithmeticOverflow`] on
/// overflow, [`Error::CapExceeded`] on over-cap target dims).
pub fn resize_lanczos(
img: &::image::DynamicImage,
target_h: u32,
target_w: u32,
) -> Result<::image::DynamicImage> {
resize(img, (target_h, target_w), ResizeFilter::Lanczos3)
}
/// Center crop `img` to `(target_h, target_w)`.
///
/// Mirrors swift
/// [`MediaProcessing.centerCrop(_:size:)`](https://github.com/ml-explore/mlx-swift-lm/blob/main/Libraries/MLXVLM/MediaProcessing.swift#L213-L224)
/// and the python HF `BaseImageProcessor.center_crop` (`crop_size` field
/// at `mlx_vlm/models/base.py:140-153`).
///
/// **Early-return parity:** swift `rectSmallerOrEqual`
/// (`MediaProcessing.swift:196-198`) returns true only when *both* axes
/// already fit within the target (`source.width <= target.width &&
/// source.height <= target.height`); only then is the source returned
/// unchanged. When just one axis exceeds the target, swift's
/// `centerCrop` helper at lines 201-210 clamps each crop dim with
/// `min(source, target)` and computes centered offsets — the bigger
/// axis is cropped, the smaller axis is kept at the source extent
/// (centered offset of 0).
///
/// The geometric center is `(W - crop_w) / 2`, `(H - crop_h) / 2`
/// (integer division — for an even-sized source with odd-sized target
/// the crop is biased toward the top-left pixel by 0.5, matching
/// `crop_imm`'s unsigned-floor semantics and PIL `Image.crop` behavior).
///
/// Infallible by reference parity (swift signature returns `CIImage`
/// non-throwing; python `center_crop` returns `np.ndarray`).
pub fn center_crop(
img: &::image::DynamicImage,
target_h: u32,
target_w: u32,
) -> ::image::DynamicImage {
let w = img.width();
let h = img.height();
// Swift `rectSmallerOrEqual` early-return: BOTH axes must already
// fit. If only one axis is larger than the target we still need to
// crop that bigger axis (using `min(source, target)` for the smaller
// axis), matching swift's `min(extent, target)` clamp at
// `MediaProcessing.swift:201-210`.
if w <= target_w && h <= target_h {
return img.clone();
}
// Clamp each crop dimension to `min(source, target)` so a partial-fit
// case crops only the bigger axis (the smaller axis is kept at the
// source extent with a centered offset of 0). For the fully-bigger
// case (both axes > target) this collapses to `(target_w, target_h)`.
let crop_w = w.min(target_w);
let crop_h = h.min(target_h);
// Integer-floor center offsets; PIL `Image.crop` and the swift
// `centerCrop` rect helper compute `(extent - crop) / 2` likewise.
// When `crop_x == w` (smaller axis kept whole) this is `0`.
let x = (w - crop_w) / 2;
let y = (h - crop_h) / 2;
img.crop_imm(x, y, crop_w, crop_h)
}
/// Pad `img` to a square by filling the shorter side with `fill`.
///
/// Mirrors python `expand2square` (`mlx_vlm/models/base.py:251-262`)
/// and `expand_to_square` (`mlx_vlm/models/fastvlm/processing.py:29-61`)
/// — the canonical pre-resize step that LLaVA-family processors apply
/// to preserve aspect ratio before a square encoder resize. The swift
/// `MediaProcessing` module does not expose a `padSquare` helper
/// directly; the per-model swift processors handle aspect-ratio
/// preservation in their own `preprocess` step.
///
/// **Padding policy:** the shorter side is symmetrically padded —
/// `(long - short) / 2` pixels on each edge (integer floor; for odd
/// differences the bottom / right edge gets the extra row / column,
/// matching python `Image.new(...).paste(img, ((width - height) // 2,
/// 0))` which centers the input). If `width == height`, the source is
/// returned **unchanged** (the input `DynamicImage` is moved out — no
/// allocation, no variant conversion).
///
/// **Ownership / signature:** `img` is taken **by value** so the
/// already-square fast path can hand the input back without a clone.
/// `DynamicImage::clone()` deep-copies the entire decoded pixel
/// buffer; for a near-budget input that buffer can itself be hundreds
/// of MiB, and Rust's infallible `Clone` would abort the process on
/// allocator failure — defeating the recoverable-OOM contract the
/// padded path enforces below. Callers that need to keep the source
/// alongside the output should clone upstream, where the failure mode
/// is the caller's to choose.
///
/// **Color order (padded path only):** `fill` is an `[R, G, B]` u8
/// triple regardless of the source dtype — the *padded* output is
/// always `Rgb8`. The already-square fast path returns the input
/// unchanged (any `DynamicImage` variant); callers that need a
/// uniform output dtype must convert post-hoc (e.g. `out.to_rgb8()`).
/// Callers needing `[0.0, 1.0]` float-space padding should pad after
/// the [`image_to_array`] + [`rescale`] steps in array space (one
/// `pad` op, not yet exposed here; per-model concern when needed).
///
/// **End-to-end fallible canvas (Rust safety, not parity):** the python
/// reference is infallible because `PIL.Image.new(size, size)` raises
/// `MemoryError` on OOM — an exception that propagates cleanly up the
/// processor stack. Rust's `RgbImage::from_pixel(size, size, ...)`,
/// `Vec::with_capacity`, and `DynamicImage::to_rgb8()` *abort* the
/// process on allocator failure (the standard `Vec` reallocation and
/// `image`'s infallible buffer constructors all panic). A
/// `100_000 x 1` source would otherwise drive a `100_000² × 3` = 30 GiB
/// canvas allocation. To preserve the exception-like recoverability
/// the python contract assumes, this function:
/// 1. Checks `size × size × 3` for `u64` overflow *and* against
/// [`MAX_DECODED_IMAGE_BYTES`] (the same 512 MiB ceiling
/// [`load_image`] enforces);
/// 2. Allocates the pixel buffer via `Vec::try_reserve_exact` so an
/// allocator failure surfaces as [`Error::OutOfMemory`] rather than
/// a panic-abort;
/// 3. Uses `image::ImageBuffer::from_raw` on a uniform-fill buffer to
/// keep the `RgbImage::from_pixel` semantics without its panicking
/// backing alloc;
/// 4. Writes source pixels into the *already-reserved* canvas slice
/// in-place — either row-wise `copy_from_slice` when the source is
/// already `ImageRgb8` (zero intermediate alloc), or via the
/// `DynamicImage::get_pixel` color-space-converting accessor for
/// non-`Rgb8` variants (one `Rgba<u8>` per pixel on the stack;
/// again no intermediate full-image alloc). The prior
/// `img.to_rgb8()` call materialized a fresh decoded-byte-sized
/// copy infallibly — a near-budget nonsquare input (e.g.
/// `13377×13376` RGB ≈ 511 MiB) would pass the canvas gate, then
/// panic-abort on the ~511 MiB `to_rgb8` clone. The
/// per-pixel-write path eliminates that second source-sized
/// allocation entirely.
///
/// The [`MAX_DECODED_IMAGE_BYTES`] budget bounds the canvas alone; the
/// source itself is already-decoded (its own allocation was bounded at
/// [`load_image`] time, or by the caller if constructed via a custom
/// path). The per-pixel iteration touches that already-resident memory
/// without spawning a second copy.
///
/// Oversized inputs return [`Error::CapExceeded`] (or
/// [`Error::ArithmeticOverflow`] on a product overflow) with the
/// requested vs allowed byte count; allocator failures return
/// [`Error::OutOfMemory`].
pub fn pad_to_square(img: ::image::DynamicImage, fill: [u8; 3]) -> Result<::image::DynamicImage> {
let w = img.width();
let h = img.height();
if w == h {
// Square fast path: return the input unchanged. NOT `img.clone()`
// — `DynamicImage::clone()` deep-copies the entire decoded buffer
// via the infallible `Vec` clone, which `abort()`s on allocator
// failure for near-budget inputs. By taking `img` by value we hand
// the same allocation back to the caller; no second source-sized
// copy ever happens here.
return Ok(img);
}
let size = w.max(h);
// `size * size * 3` byte budget. Use u64 throughout so the check is
// identical on 32-bit and 64-bit hosts (and so `MAX_DECODED_IMAGE_BYTES`
// can be compared without lossy casts). `u32::MAX^2 * 3 ≈ 5.5e19` fits
// in u64, so the `checked_mul` chain only fires for a truly hostile
// dimension product.
let size_u64 = u64::from(size);
let bytes = size_u64
.checked_mul(size_u64)
.and_then(|sq| sq.checked_mul(3))
.ok_or_else(|| {
Error::ArithmeticOverflow(ArithmeticOverflowPayload::with_operands(
"pad_to_square: canvas bytes (size * size * 3)",
"u64",
[("size", size_u64)],
))
})?;
if bytes > MAX_DECODED_IMAGE_BYTES {
return Err(Error::CapExceeded(CapExceededPayload::new(
"pad_to_square: canvas (size x size x 3)",
"MAX_DECODED_IMAGE_BYTES",
MAX_DECODED_IMAGE_BYTES,
bytes,
)));
}
// `bytes <= MAX_DECODED_IMAGE_BYTES = 512 MiB`, well under `usize::MAX`
// on every supported host (we ship aarch64-darwin and x86_64-linux —
// both 64-bit; the 32-bit edge case is bounded by the u64 check
// above). Cast is lossless given the prior gate.
let bytes_usize = bytes as usize;
// Recoverable OOM at the canvas allocation. `vec![value; n]` and
// `Vec::with_capacity` would `abort()` on allocator failure;
// `try_reserve_exact` surfaces it as `Error::OutOfMemory`.
let mut canvas_buf: Vec<u8> = Vec::new();
canvas_buf
.try_reserve_exact(bytes_usize)
.map_err(|_| Error::OutOfMemory)?;
// Uniform RGB fill via the `pad_canvas_fill` dispatcher
// (`crate::simd::vlm::pad_canvas_fill`). On `aarch64` this routes to
// a 48-byte LCM(3, 16) NEON pre-broadcast tile (~2-4× faster than
// the new scalar at 4096²; ~10-50× faster than the prior per-3-byte
// `extend_from_slice` idiom this replaces — see the
// bench in `mlxrs/benches/simd_pad_canvas_fill.rs`); elsewhere it
// falls back to the scalar `chunks_exact_mut(3) + copy_from_slice`
// path. Tracking ([#151]).
//
// The dispatcher takes `&mut [MaybeUninit<u8>]` (type-encoded uninit
// safety — see the kernel-module doc), so we pass the pre-reserved
// spare capacity **directly** and `set_len` after every byte has
// been written. No `from_raw_parts_mut` cast over uninit backing
// memory (which would be UB regardless of subsequent writes, per
// `from_raw_parts_mut`'s "properly initialized" precondition).
{
let spare = canvas_buf.spare_capacity_mut();
// `bytes_usize <= spare.len()` because `try_reserve_exact` reserved
// exactly `bytes_usize` extra capacity. Take the first `bytes_usize`
// slots of spare (the canvas region).
debug_assert!(
spare.len() >= bytes_usize,
"try_reserve_exact must have reserved at least bytes_usize"
);
crate::simd::vlm::pad_canvas_fill(&mut spare[..bytes_usize], fill);
}
// SAFETY: `pad_canvas_fill` wrote every byte in `0..bytes_usize` of
// the spare capacity (its function-level contract: "every byte of
// `out` is written before this returns" — both the scalar
// `chunks_exact_mut(3)` path and the NEON 48-byte tile path cover
// the full slice). `Vec::set_len`'s preconditions:
// (1) `bytes_usize <= canvas_buf.capacity()` — `try_reserve_exact`
// above reserved exactly that much capacity (a
// `try_reserve_exact` failure would have early-returned
// `Error::OutOfMemory`);
// (2) elements at `[old_len..new_len]` are initialized — the
// kernel-contract above guarantees this.
// Both hold.
unsafe { canvas_buf.set_len(bytes_usize) };
debug_assert_eq!(
canvas_buf.len(),
bytes_usize,
"canvas fill length must equal pre-computed bytes",
);
// Symmetric center offset on the shorter axis; longer axis stays at 0.
let (x_off, y_off) = if w > h {
(0u32, (w - h) / 2)
} else {
((h - w) / 2, 0u32)
};
// `size` was bounded above (`bytes = size * size * 3 <= 512 MiB`),
// so `size <= 13_377` — well within `u32`/`usize` on any 64-bit host.
// Cast to `usize` for slice indexing.
let size_usize = size as usize;
let w_usize = w as usize;
let h_usize = h as usize;
let x_off_usize = x_off as usize;
let y_off_usize = y_off as usize;
// Write source pixels directly into the already-reserved canvas
// buffer — NO intermediate `to_rgb8` allocation. Two paths:
// - source is `ImageRgb8`: row-wise `copy_from_slice` (one memcpy
// of `w*3` bytes per source row). Zero per-pixel overhead, zero
// intermediate alloc.
// - other variant (Luma8 / Rgba8 / Rgb16 / …): per-pixel
// `DynamicImage::get_pixel(x, y) -> Rgba<u8>` (color-space
// conversion handled by `image`'s `dynamic_map!` dispatch). We
// drop alpha and keep the RGB channels — identical projection
// to `image_to_array` and the prior `to_rgb8()` did.
if let Some(src_rgb) = img.as_rgb8() {
let src_raw = src_rgb.as_raw();
// `src_raw.len() == w * h * 3` by `ImageBuffer<Rgb<u8>>::as_raw`
// contract (image 0.25 `ImageBuffer::as_raw` for `P = Rgb<u8>`).
// The `dst_stride * src_h` and `src_stride * src_h` bounds below
// are therefore both within their respective buffers.
let src_stride = w_usize * 3;
let dst_stride = size_usize * 3;
let dst_x_byte = x_off_usize * 3;
for y_src in 0..h_usize {
let dst_row_off = (y_off_usize + y_src) * dst_stride + dst_x_byte;
let src_row_off = y_src * src_stride;
canvas_buf[dst_row_off..dst_row_off + src_stride]
.copy_from_slice(&src_raw[src_row_off..src_row_off + src_stride]);
}
} else {
// Per-pixel path for non-`Rgb8` sources. Reuses
// [`dynamic_image_rgb_pixel`] (the shared `to_rgba() -> drop alpha`
// projection) so the non-Rgb8 branch and `image_to_array`'s
// non-Rgb8 branch produce byte-identical RGB triples — the
// structural unification kills the defect class (any future
// tweak to the projection lives in one place).
let dst_stride = size_usize * 3;
for y_src in 0..h_usize {
let dst_row_off = (y_off_usize + y_src) * dst_stride + x_off_usize * 3;
for x_src in 0..w_usize {
let rgb = dynamic_image_rgb_pixel(&img, x_src as u32, y_src as u32);
let off = dst_row_off + x_src * 3;
canvas_buf[off] = rgb[0];
canvas_buf[off + 1] = rgb[1];
canvas_buf[off + 2] = rgb[2];
}
}
}
// `from_raw` only returns `None` when `buf.len() < width * height *
// channels`. By construction `canvas_buf.len() == size * size * 3`
// (the uniform-fill loop pushed exactly `bytes_usize / 3` × 3 bytes;
// the source overlay above writes in place via index assignment and
// does not change the buffer length).
let canvas: ::image::RgbImage = ::image::ImageBuffer::from_raw(size, size, canvas_buf)
.expect("ImageBuffer::from_raw: canvas buffer length matches size * size * 3 by construction");
Ok(::image::DynamicImage::ImageRgb8(canvas))
}
/// Convert a [`image::DynamicImage`] to an `Array` of shape `[H, W, 3]`,
/// dtype `f32`, value range `[0.0, 255.0]` (BEFORE [`rescale`]).
///
/// Mirrors swift `MediaProcessing.asMLXArray` (`MediaProcessing.swift:
/// 164-193`) up to channel layout: the swift reference renders RGBAf at
/// line 171, slices the 4th alpha channel at line 187 (`array[0..., 0...,
/// ..<3]`), and *additionally* reshapes to planar `[1, C, H, W]` at line
/// 190. We stop at the channel-last `[H, W, 3]` step — channel-last is
/// the natural layout for the subsequent [`normalize_imagenet`] +
/// [`rescale`] (the `(3,)` mean/std broadcast cleanly over the last
/// axis), and the per-model planar conversion (`transpose_axes(&[2, 0,
/// 1])` then optional batch axis) is a model-input detail the per-model
/// processor owns.
///
/// **Alpha:** dropped. The swift reference also drops it explicitly at
/// line 187. RGBA images are converted to RGB by discarding the alpha
/// channel (no compositing onto a background) to match the swift
/// behavior.
///
/// **`color_order`:** if [`ColorOrder::Bgr`], the per-pixel R and B
/// channels are swapped during the buffer build (no MLX transpose / no
/// extra Array allocation).
///
/// **Memory ceiling:** none at the `image_to_array` boundary itself.
/// [`load_image`] enforces the `image` crate's 512 MiB
/// `Limits::default().max_alloc` guard before any decoded buffer is
/// returned, so the `image_to_array` input is already size-bounded when
/// it comes through that path. The `h * w * 3` f32 buffer allocated
/// here is the unavoidable `decoded -> f32` widening — the swift
/// reference allocates the same buffer at `Data(count: w * h *
/// bytesPerPixel)` in `MediaProcessing.swift:176`, and python
/// `np.asarray(image)` does too. The upper bound is roughly
/// `4 * 512 MiB = 2 GiB` of f32s for a decoder-default `load_image`
/// source. Callers that hand a `DynamicImage` from a different source
/// (raw `image::open`, network decoders, etc.) inherit whatever limit
/// that source imposed; pre-validating `img.dimensions()` (a cheap
/// O(1) field read) is the standard escape hatch.
///
/// **No infallible source clone:** the prior
/// implementation called `img.to_rgb8()` unconditionally as its first
/// step. `DynamicImage::to_rgb8()` is documented as "Returns a copy
/// of this image as an RGB image" (image 0.25
/// `DynamicImage::to_rgb8`) — it clones the backing buffer for *every*
/// variant including the already-`Rgb8` case (the buffer is cloned
/// via the infallible `Vec::clone` because the underlying `RgbImage`
/// is `Clone`). For a near-budget input (e.g. an `ImageRgb8` whose
/// decoded buffer is ~512 MiB) this materialized a second source-sized
/// allocation infallibly before the recoverable `try_reserve_exact`
/// gate ever ran — `Vec::clone` aborts on allocator failure.
/// The current implementation eliminates that second source-sized
/// clone:
/// 1. Reserve the output f32 buffer first via `try_reserve_exact`.
/// 2. `as_rgb8()` fast path: read directly from the source's backing
/// `&[u8]` (no clone) and widen to f32.
/// 3. Non-`Rgb8` (`Luma8`/`Rgba8`/`Rgb16`/`Rgb32F`/…): per-pixel
/// `dynamic_image_rgb_pixel` projection (shared private helper) —
/// one `Rgba<u8>` on the stack per pixel, no intermediate
/// full-image alloc. The same projection [`pad_to_square`]'s
/// non-`Rgb8` branch uses, so any future tweak to the per-pixel
/// RGB extraction lives in one place.
pub fn image_to_array(img: &::image::DynamicImage, color_order: ColorOrder) -> Result<Array> {
let w = img.width();
let h = img.height();
let w_usize = w as usize;
let h_usize = h as usize;
// FFI-bound shape product overflow guard. `Array::from_slice` validates
// shape-product vs buffer length but does so in `usize` arithmetic
// *after* our cast; on a 32-bit usize the multiplication
// `h_usize * w_usize * 3` can wrap silently. Catch it here with a
// recoverable `Error::ArithmeticOverflow` so callers see a clean error rather
// than a panic downstream. This MUST run before any allocation so a
// hostile dimension product cannot abort in the allocator.
let total = h_usize
.checked_mul(w_usize)
.and_then(|hw| hw.checked_mul(3))
.ok_or_else(|| {
Error::ArithmeticOverflow(ArithmeticOverflowPayload::with_operands(
"image_to_array: total elements (h * w * 3)",
"usize",
[("h", h_usize as u64), ("w", w_usize as u64)],
))
})?;
// Recoverable OOM at the f32 widening boundary. `Vec::with_capacity`
// would `abort()` on a hostile-but-non-overflowing image (the
// `checked_mul` only proves `total` fits `usize`, not that the
// `total * 4` byte alloc succeeds). `try_reserve_exact` surfaces an
// allocator failure as a recoverable `Error::OutOfMemory` so callers
// get a typed Err instead of process termination — matches the
// allocation-discipline pattern `mlxrs::error::Error::OutOfMemory`
// exists for. NO `to_rgb8()` clone runs before this gate.
let mut buf: Vec<f32> = Vec::new();
buf
.try_reserve_exact(total)
.map_err(|_| Error::OutOfMemory)?;
// Fast path: source is already `ImageRgb8`. Read its backing `&[u8]`
// directly (no clone, no per-pixel dispatch) and widen to f32.
//
// `as_rgb8()` returns `Option<&RgbImage>` (borrow, not clone). When
// `Some`, `rgb.as_raw()` is `&[u8]` with length AT LEAST
// `width * height * 3` — the `ImageBuffer::as_raw()` contract allows
// a backing buffer longer than the logical extent (callers can
// construct via `from_raw` with an oversized Vec). Slice to exactly
// `total = H*W*3` bytes via `.get(..total)` so the fill loop iterates
// the correct extent — without this slice an overlong-backing-buffer
// source would grow `buf` past the `try_reserve_exact(total)`
// reservation via infallible allocation, reintroducing the
// abort-on-OOM hazard.
if let Some(rgb) = img.as_rgb8() {
let raw = rgb.as_raw().get(..total).ok_or_else(|| {
Error::LengthMismatch(LengthMismatchPayload::new(
"image_to_array: rgb backing buffer bytes vs H*W*3",
total,
rgb.as_raw().len(),
))
})?;
match color_order {
ColorOrder::Rgb => {
// SIMD dispatcher (`crate::simd::vlm::rgb_widen`). On
// `aarch64` this routes to a `vld1q_u8` + four `vst1q_f32`
// 16-byte-per-tile NEON kernel (pure byte-for-byte widen, no
// de-interleave needed for RGB-in-RGB-out); elsewhere it
// falls back to the scalar `MaybeUninit::write` path. The
// prior `buf.extend(raw.iter().map(|&b| f32::from(b)))`
// shape LLVM auto-vectorized cleanly, but the hand-rolled
// NEON arm pins the contract against compiler-version drift.
// Tracking [#148].
//
// The dispatcher takes `&mut [MaybeUninit<f32>]` (type-
// encoded uninit safety); we pass the
// pre-reserved spare capacity directly and `set_len(total)`
// after every f32 has been written.
{
let spare = buf.spare_capacity_mut();
debug_assert!(
spare.len() >= total,
"try_reserve_exact must have reserved at least total f32s"
);
crate::simd::vlm::rgb_widen(&mut spare[..total], raw);
}
// SAFETY: `rgb_widen` wrote every f32 in `0..total` of the
// spare capacity (function-level contract: "every f32 of
// `out` is written before this returns" — both scalar and
// NEON arms cover the full slice). `Vec::set_len`'s
// preconditions:
// (1) `total <= buf.capacity()` — `try_reserve_exact`
// above reserved exactly that much;
// (2) elements at `[0..total]` are initialized — by
// kernel contract.
unsafe { buf.set_len(total) };
}
ColorOrder::Bgr => {
// Per-pixel R↔B swap via the `bgr_widen` dispatcher
// (`crate::simd::vlm::bgr_widen`). On `aarch64` this routes
// to a `vld3q_u8` + permuted `vst3q_f32` 16-pixel-per-tile
// NEON kernel (the R↔B swap is encoded by feeding the
// de-interleaved planes to the interleave-store in reversed
// R/B order); elsewhere it falls back to the scalar
// `chunks_exact_mut(3) + MaybeUninit::write` path. Tracking
// ([#149]).
//
// The dispatcher takes `&mut [MaybeUninit<f32>]` (type-encoded
// uninit safety — see the kernel-module doc), so we pass the
// pre-reserved spare capacity **directly** and `set_len` after
// every f32 has been written. No `from_raw_parts_mut` cast
// over uninit backing memory (which would be UB regardless of
// subsequent writes, per `from_raw_parts_mut`'s "properly
// initialized" precondition).
{
let spare = buf.spare_capacity_mut();
// `total <= spare.len()` because `try_reserve_exact(total)`
// above reserved exactly `total` extra capacity. Take the
// first `total` slots of spare (one f32 per input byte).
debug_assert!(
spare.len() >= total,
"try_reserve_exact must have reserved at least total f32s"
);
crate::simd::vlm::bgr_widen(&mut spare[..total], raw);
}
// SAFETY: `bgr_widen` wrote every f32 in `0..total` of the
// spare capacity (its function-level contract: "every f32 of
// `out` is written before this returns" — both the scalar
// `chunks_exact_mut(3)` path and the NEON 16-pixel tile path
// cover the full slice). `Vec::set_len`'s preconditions:
// (1) `total <= buf.capacity()` — `try_reserve_exact` above
// reserved exactly that much capacity (a
// `try_reserve_exact` failure would have early-returned
// `Error::OutOfMemory`);
// (2) elements at `[0..total]` are initialized — the
// kernel-contract above guarantees this.
// Both hold.
unsafe { buf.set_len(total) };
}
}
} else {
// Non-`Rgb8` source (Luma8 / Rgba8 / Rgb16 / Rgb32F / …):
// build a contiguous `Vec<u8>` of length `H*W*3` via the shared
// [`dynamic_image_rgb_pixel`] helper (one `Rgba<u8>` on the stack
// per pixel — NO intermediate source-sized RGB image allocation),
// THEN hand the bulk buffer to the same SIMD dispatcher
// the `Rgb8` fast path uses above. The earlier per-pixel
// `buf.push(f32::from(...))` shape paid an iterator + per-element
// capacity-check overhead even though `try_reserve_exact` had
// already reserved exact capacity (#121).
//
// The intermediate `Vec<u8>` is itself fallibly reserved through
// the crate-private `error::try_with_capacity` helper so allocator
// failure surfaces as [`Error::OutOfMemory`] identically to the f32 buf
// (no second abort-on-OOM seam). The intermediate is bounded by
// the same `total` overflow-checked shape product above (its
// byte count == 1×total, vs the f32 buf's 4×total), so the
// byte-budget audit table's "Bounded-memory" Y still holds
// end-to-end — `MAX_DECODED_IMAGE_BYTES` (512 MiB) at
// [`load_image`] caps the SOURCE byte count, the f32 widening
// here is at most 4× of that, and this intermediate adds at most
// 1× more (5× total = ~2.5 GiB peak for a near-cap source —
// unchanged from the prior `push` shape which materialized the
// same f32 buf and merely paid per-element overhead).
let mut rgb_buf = crate::error::try_with_capacity::<u8>(total)?;
for y in 0..h {
for x in 0..w {
let rgb = dynamic_image_rgb_pixel(img, x, y);
rgb_buf.push(rgb[0]);
rgb_buf.push(rgb[1]);
rgb_buf.push(rgb[2]);
}
}
debug_assert_eq!(
rgb_buf.len(),
total,
"non-Rgb8 RGB intermediate fill length must equal pre-computed total"
);
// Dispatch through the same SIMD widener the Rgb8 fast path
// uses. NEON kernel on aarch64; auto-vectorized scalar elsewhere.
// Identical kernel-contract (every f32 in `0..total` of spare is
// written before return), identical post-call `set_len` safety
// chain — fold this branch into the same code shape as the fast
// path so the SIMD ship-decision applies uniformly to every input
// variant, not just `ImageRgb8`.
match color_order {
ColorOrder::Rgb => {
let spare = buf.spare_capacity_mut();
debug_assert!(
spare.len() >= total,
"try_reserve_exact must have reserved at least total f32s"
);
crate::simd::vlm::rgb_widen(&mut spare[..total], &rgb_buf);
}
ColorOrder::Bgr => {
let spare = buf.spare_capacity_mut();
debug_assert!(
spare.len() >= total,
"try_reserve_exact must have reserved at least total f32s"
);
crate::simd::vlm::bgr_widen(&mut spare[..total], &rgb_buf);
}
}
// SAFETY: identical to the `Rgb8` fast path above — `rgb_widen` /
// `bgr_widen` write every f32 in `0..total` per their function-level
// contract; `try_reserve_exact(total)` reserved exactly that much
// capacity; both `Vec::set_len` preconditions hold.
unsafe { buf.set_len(total) };
}
debug_assert_eq!(
buf.len(),
total,
"buf fill length must equal pre-computed total"
);
Array::from_slice(&buf, &(h_usize, w_usize, 3))
}
/// Multiply `arr` by `scale` (typically `1.0 / 255.0`).
///
/// Mirrors the rescale step folded into the swift `MediaProcessing.
/// normalize` colorMatrix (`MediaProcessing.swift:145-156` — "input *
/// factor + bias", where `factor = 1/std` and the `1/255` rescale is
/// pre-applied by callers that pass `mean/255` and `std/255`). The
/// python image-processor surface (`mlx_vlm`) breaks rescale out as a
/// separate step (the HF `BaseImageProcessor.rescale` contract); we
/// expose it as its own primitive for that parity, and [`preprocess`]
/// composes it before [`normalize_imagenet`].
///
/// **Dtype requirement:** `arr` must be a floating-point dtype
/// (`F16` / `BF16` / `F32` / `F64`). The swift reference's CIFilter
/// colorMatrix only operates on float pixel buffers
/// (`MediaProcessing.swift:171` always renders `CIFormat.RGBAf`) and
/// the python `BaseImageProcessor.rescale` converts to f32 before
/// multiplying; rescaling a u8/i32 input by a sub-unit factor in the
/// input dtype would silently floor to zero (e.g.
/// `astype(1/255, U8) = 0`). Non-float inputs are rejected with
/// [`Error::UnsupportedDtype`].
///
/// Returns a *new* array; the source is unchanged (mlx's standard
/// out-of-place op semantics).
pub fn rescale(arr: &Array, scale: f32) -> Result<Array> {
let dtype = arr.dtype()?;
require_float_dtype("rescale", dtype)?;
// Build a `(1,)` f32 scalar in the input's dtype so an f16/bf16/f64
// input is not silently promoted to f32 (same dtype-fidelity
// discipline as `embeddings::scalar_like`). For an f32 input this is
// a no-op cast. Non-float inputs are rejected above.
let s = Array::full::<f32>(&(1,), scale)?;
let s = astype(&s, dtype)?;
multiply(arr, &s)
}
/// Per-channel normalization: `(x - mean[c]) / std[c]`.
///
/// `arr` shape: `[..., 3]` (channel-last). The mean/std tuples are
/// broadcast across all leading dims by reshaping them to `[1, 1, 3]`
/// (when `arr` is `[H, W, 3]`) — generally `arr.ndim() - 1` leading
/// 1-dims so the broadcast applies cleanly regardless of batch axis.
///
/// Mirrors swift
/// [`MediaProcessing.normalize`](https://github.com/ml-explore/mlx-swift-lm/blob/main/Libraries/MLXVLM/MediaProcessing.swift#L135-L157)
/// and torchvision
/// [`Normalize`](https://pytorch.org/vision/main/generated/torchvision.transforms.Normalize.html)
/// (`output[c] = (input[c] - mean[c]) / std[c]`). The swift
/// implementation expresses `(x - mean) / std` via the CIFilter
/// colorMatrix "input * (1/std) + (-mean/std)" trick (algebraically
/// equivalent — see the swift comment block lines 142-148). We use the
/// direct subtract + divide for readability; the math is the same.
///
/// **Dtype requirement:** `arr` must be a floating-point dtype
/// (`F16` / `BF16` / `F32` / `F64`). ImageNet mean/std values are
/// sub-unit f32s; casting them to an integer dtype would floor every
/// channel to zero and division by `astype(0.229, U8) = 0` would be
/// undefined. Both reference implementations always normalize in
/// float — swift `MediaProcessing.normalize` operates on the f32
/// CIFormat.RGBAf buffer (`MediaProcessing.swift:135-156`) and the
/// HF python `BaseImageProcessor.normalize` converts to f32 before
/// the subtract / divide. Non-float inputs are rejected with
/// [`Error::UnsupportedDtype`].
///
/// **Dtype fidelity (float inputs):** the mean/std arrays adopt the
/// input dtype (so an f16/bf16 input is not silently promoted to f32),
/// matching the embeddings crate's `scalar_like` discipline.
///
/// **Layout note:** the swift / torchvision references both operate on
/// the layout natural to their stack (CIFilter on `[H, W, C]`-rendered
/// CIFormat.RGBAf; torchvision on planar `[C, H, W]` with a `[C, 1, 1]`
/// broadcast). We chose channel-last `[..., 3]` because [`image_to_array`]
/// emits that layout and the `(3,)` mean/std broadcasts over the trailing
/// axis without an extra transpose. Per-model processors that operate
/// post planar-conversion can adapt by adding leading singleton axes to
/// the mean/std tensors themselves before calling [`subtract`] /
/// [`divide`] directly.
pub fn normalize(arr: &Array, mean: &[f32; 3], std: &[f32; 3]) -> Result<Array> {
let ndim = arr.ndim();
if ndim == 0 {
return Err(Error::RankMismatch(RankMismatchPayload::new(
"normalize: input must be rank >= 1 (at least 1 dimension)",
ndim as u32,
arr.shape(),
)));
}
// Validate trailing channel dim == 3 with a clear error before falling
// through to mlx's less-friendly broadcast failure.
let shape = arr.shape();
let trailing = shape[ndim - 1];
if trailing != 3 {
return Err(Error::LengthMismatch(LengthMismatchPayload::new(
"normalize: trailing channel dim (must be 3 for RGB)",
3,
trailing,
)));
}
let dtype = arr.dtype()?;
require_float_dtype("normalize", dtype)?;
// Build (3,) mean and std arrays in the input dtype, then reshape to
// [1, ..., 1, 3] so they broadcast over every leading axis of `arr`.
let mean_arr = make_channel_broadcast(mean, ndim, dtype)?;
let std_arr = make_channel_broadcast(std, ndim, dtype)?;
let centered = subtract(arr, &mean_arr)?;
divide(¢ered, &std_arr)
}
/// ImageNet-named alias for [`normalize`] — same `(x - mean) / std`
/// per-channel semantics. Retained for source-compatibility; new code
/// should prefer [`normalize`] (matches swift `MediaProcessing.normalize`
/// and torchvision `Normalize` naming).
pub fn normalize_imagenet(arr: &Array, mean: &[f32; 3], std: &[f32; 3]) -> Result<Array> {
normalize(arr, mean, std)
}
/// Reject non-float dtypes for primitives that need fractional arithmetic.
///
/// The swift reference's CIFilter pipeline runs exclusively in float
/// space (CIFormat.RGBAf @ `MediaProcessing.swift:171`); the python
/// HF processors call `array.astype(np.float32)` before rescale /
/// normalize. We surface the dtype mismatch as a clean
/// `Error::UnsupportedDtype` rather than letting the caller discover it
/// as silent zeros downstream.
fn require_float_dtype(op: &'static str, dtype: Dtype) -> Result<()> {
match dtype {
Dtype::F16 | Dtype::BF16 | Dtype::F32 | Dtype::F64 => Ok(()),
_ => Err(Error::UnsupportedDtype(UnsupportedDtypePayload::new(
op,
dtype,
&[Dtype::F16, Dtype::BF16, Dtype::F32, Dtype::F64],
))),
}
}
/// Build a `[1, ..., 1, 3]`-shaped broadcast tensor from a length-3 f32
/// slice, cast to `dtype`. Helper for [`normalize`].
fn make_channel_broadcast(vals: &[f32; 3], ndim: usize, dtype: Dtype) -> Result<Array> {
// 1-D (3,) constant in f32, then astype to the input dtype.
let a = Array::from_slice(vals, &(3usize,))?;
let a = astype(&a, dtype)?;
// Reshape to [1, ..., 1, 3] (ndim-1 leading 1-dims + the channel axis).
// For ndim == 1 this is a no-op reshape back to (3,).
if ndim <= 1 {
return Ok(a);
}
// Build the target shape on the stack via a 16-dim ceiling: mlx
// arrays are bounded well below 16 dims in practice (CLIP/SigLIP/
// patchify all stay <= 5), and a stack buffer avoids the Vec
// allocation per `feedback_allocation_discipline`. If a caller ever
// hands us > 16 dims, the explicit guard below converts cleanly.
const MAX_NDIM: usize = 16;
if ndim > MAX_NDIM {
return Err(Error::CapExceeded(CapExceededPayload::new(
"normalize: input ndim",
"MAX_NDIM",
MAX_NDIM as u64,
ndim as u64,
)));
}
let mut buf = [1usize; MAX_NDIM];
buf[ndim - 1] = 3;
reshape(&a, &&buf[..ndim])
}
/// Read a single `[R, G, B]` u8 triple at `(x, y)` from a [`DynamicImage`]
/// without materializing an intermediate full-image `RgbImage`.
///
/// Shared per-pixel projection for the non-`Rgb8` branches of
/// [`pad_to_square`] and [`image_to_array`]. Both callers used to embed
/// the same `get_pixel().0[..3]` projection inline; lifting it into one
/// helper structurally unifies them so any future tweak (alpha
/// premultiplication, gamma handling, etc.) lives in one place.
///
/// **Why not `img.to_rgb8()` once?** `DynamicImage::to_rgb8()` is
/// documented as "Returns a copy of this image as an RGB image"
/// (image 0.25 `DynamicImage::to_rgb8`) — it clones the backing buffer
/// for *every* variant including the already-`Rgb8` case, via the
/// infallible `Vec::clone`. For a near-budget input that buffer is
/// itself hundreds of MiB, so the clone aborts the process on allocator
/// failure — defeating the recoverable-OOM contract the two callers
/// enforce on their *output* allocations. The per-pixel path here
/// touches the source's already-resident memory in place (one
/// `Rgba<u8>` on the stack per pixel) and never spawns a second
/// source-sized copy.
///
/// **Color projection:** `DynamicImage::get_pixel(x, y)` returns
/// `Rgba<u8>` regardless of the underlying variant — see image 0.25
/// `dynimage.rs:1499-1501` for the `dynamic_map!(*self, ref p,
/// p.get_pixel(x, y).to_rgba().into_color())` dispatch. The
/// per-variant projections this composes through:
/// - `ImageLuma8`: grey → broadcast to `(L, L, L, 255)`.
/// - `ImageLumaA8`: grey + alpha → `(L, L, L, A)`.
/// - `ImageRgb8`: `(R, G, B)` → `(R, G, B, 255)` (we don't take this
/// path here — see `as_rgb8()` fast path in the callers).
/// - `ImageRgba8`: identity.
/// - `ImageRgb16` / `ImageRgba16`: 16-bit → 8-bit via the standard
/// `Subpixel: ColorConvert` shift-down.
/// - `ImageRgb32F` / `ImageRgba32F`: float → 8-bit via the standard
/// clamp + scale.
///
/// We drop the alpha channel and return `[R, G, B]` — identical
/// projection to the prior `to_rgb8()` call, just without the
/// intermediate full-image allocation.
///
/// **Bounds:** caller must guarantee `x < img.width()` and
/// `y < img.height()` — `DynamicImage::get_pixel` panics on
/// out-of-bounds indices (the `image` crate documents this; the
/// `dynamic_map!` dispatch goes through the per-variant
/// `ImageBuffer::get_pixel` which `panics` rather than returns
/// `Option`). Both callers iterate `0..h` × `0..w` so this is
/// trivially satisfied.
fn dynamic_image_rgb_pixel(img: &::image::DynamicImage, x: u32, y: u32) -> [u8; 3] {
// `GenericImageView` is brought into scope locally so `get_pixel`
// resolves on the opaque `DynamicImage` type without polluting the
// module-level imports.
use ::image::GenericImageView as _;
let p = img.get_pixel(x, y);
[p.0[0], p.0[1], p.0[2]]
}
/// Project a single pixel of any [`::image::DynamicImage`] variant to an
/// 8-bit RGBA quad on the stack (no allocation).
///
/// The RGBA sibling of [`dynamic_image_rgb_pixel`]: keeps the alpha
/// channel instead of dropping it, for the [`resize`] source-buffer
/// materialization (which resizes in `U8x4` for parity with image-rs's
/// `imageops::resize` over a `DynamicImage`). `DynamicImage::get_pixel`
/// returns an `Rgba<u8>` regardless of the backing variant — image's
/// `dynamic_map!` dispatch performs the Luma-broadcast / 16-bit→8-bit /
/// float→u8 / opaque-alpha conversions, exactly what `to_rgba8()` did
/// per-pixel, so the borrowed-`as_rgba8()` fast path and this per-pixel
/// path produce byte-identical RGBA for the same source.
fn dynamic_image_rgba_pixel(img: &::image::DynamicImage, x: u32, y: u32) -> [u8; 4] {
use ::image::GenericImageView as _;
let p = img.get_pixel(x, y);
[p.0[0], p.0[1], p.0[2], p.0[3]]
}
/// Patchify `[H, W, C]` into `[H/p * W/p, p, p, C]` (ViT-style flat
/// patch sequence).
///
/// Mirrors the patchification step shared by every ViT-class VLM
/// encoder. The swift `MediaProcessing` module does not expose a
/// dedicated `patchify` helper (per-model image processors in
/// `MLXVLM/Models/*` perform their own patch extraction); we expose it
/// here as a *uniform-grid* primitive because every model that needs
/// patches needs at least this baseline transform, and exposing it as a
/// primitive keeps the per-model processor a thin caller. The
/// `transformers`-style `patchify` and `mlx-vlm`'s per-model patch
/// extractors are aspect-ratio-aware variants that are out of scope
/// (per-usecase per the no-per-model-arch rule).
///
/// Returns `Err(Error::RankMismatch)` if the input is not rank-3;
/// `Err(Error::OutOfRange)` if `patch_size == 0`; or
/// `Err(Error::DivisibilityConstraint)` if `H % p != 0 || W % p != 0`.
///
/// Layout: input `[H, W, C]` → reshape `[H/p, p, W/p, p, C]` →
/// transpose `[H/p, W/p, p, p, C]` → reshape `[H/p * W/p, p, p, C]`.
pub fn patchify(arr: &Array, patch_size: usize) -> Result<Array> {
let shape = arr.shape();
if shape.len() != 3 {
return Err(Error::RankMismatch(RankMismatchPayload::new(
"patchify: input must be rank-3 [H, W, C]",
shape.len() as u32,
shape,
)));
}
if patch_size == 0 {
return Err(Error::OutOfRange(OutOfRangePayload::new(
"patchify: patch_size",
"must be > 0",
"0",
)));
}
let h = shape[0];
let w = shape[1];
let c = shape[2];
if !h.is_multiple_of(patch_size) {
return Err(Error::DivisibilityConstraint(
DivisibilityConstraintPayload::new(
"patchify: H by patch_size",
"H",
h as u64,
"patch_size",
patch_size as u64,
),
));
}
if !w.is_multiple_of(patch_size) {
return Err(Error::DivisibilityConstraint(
DivisibilityConstraintPayload::new(
"patchify: W by patch_size",
"W",
w as u64,
"patch_size",
patch_size as u64,
),
));
}
let hp = h / patch_size;
let wp = w / patch_size;
// Checked multiply for the final-stage shape — a hostile `(H, W,
// patch_size)` could overflow `usize` on the `hp * wp` product on a
// 32-bit target (or, with extreme inputs, on a 64-bit target via
// genuinely-large images). Surface as recoverable
// `Error::ArithmeticOverflow` rather than silently wrapping to a
// smaller-than-expected first axis (which would later cause
// reshape/broadcast misalignment).
let n_patches = hp.checked_mul(wp).ok_or_else(|| {
Error::ArithmeticOverflow(ArithmeticOverflowPayload::with_operands(
"patchify: n_patches (hp * wp)",
"usize",
[("hp", hp as u64), ("wp", wp as u64)],
))
})?;
// [H, W, C] → [hp, p, wp, p, C] (stack `[usize; 5]` buffer; no Vec
// alloc per `feedback_allocation_discipline`)
let stage1: [usize; 5] = [hp, patch_size, wp, patch_size, c];
let r1 = reshape(arr, &&stage1[..])?;
// → [hp, wp, p, p, C] (move axis 2 ahead of axis 1)
let t = transpose_axes(&r1, &[0, 2, 1, 3, 4])?;
// → [hp * wp, p, p, C]
reshape(&t, &(n_patches, patch_size, patch_size, c))
}
/// End-to-end preprocessing: optional resize → channel-last
/// `[H, W, 3]` f32 → optional `1/255` rescale → optional ImageNet
/// normalization.
///
/// Mirrors the swift `MediaProcessing` pipeline composition documented
/// in the module example (`MediaProcessing.swift:25-39`):
/// ```text
/// resample → normalize → asMLXArray
/// ```
/// We re-order to `resample → asMLXArray → rescale → normalize` because
/// our [`image_to_array`] returns `[0, 255]` f32 (not the `[0, 1]`
/// f32 the swift CIFilter normalize expects on its input — swift's
/// version pre-bakes the `1/255` rescale into the colorMatrix factor),
/// which keeps each primitive single-purpose and lets callers swap
/// individual stages. The composite math is identical:
/// `(x/255 - mean) / std` = `(x - 255*mean) / (255*std)`.
///
/// The output is channel-last `[H, W, 3]`. Per-model processors that
/// need planar `[C, H, W]` or batched `[1, C, H, W]` apply that as a
/// post-step (one lazy `reshape` + `transpose_axes` — see the module
/// `Conventions > Channel layout` block for the rationale).
///
/// **Allocation contract.** This composer returns `Result<Array>` and
/// is fully **bounded-memory**: the source is capped by
/// [`MAX_DECODED_IMAGE_BYTES`] (512 MiB) via [`load_image`], and the
/// [`resize`] destination is capped against the same ceiling by
/// [`resize`]'s target-dimension guard (`cfg.size` now
/// flows from an UNTRUSTED loaded processor config, so an over-budget /
/// zero / overflowing `size` surfaces as a typed
/// [`Error::OutOfRange`] / [`Error::CapExceeded`] / [`Error::ArithmeticOverflow`]
/// from `resize` via `?` — NOT a process abort). The remaining stages — [`image_to_array`] (f32 widening),
/// [`rescale`], and [`normalize_imagenet`] — are end-to-end recoverable
/// ([`Error::OutOfMemory`] / mlx backend `Result`). The only residual
/// abort path is `resize`'s two image-crate-internal `Vec` allocs
/// (`to_rgba8` clone, `Image::new` destination) under a genuine
/// system-wide OOM at a now-bounded ≤512 MiB request.
///
/// See the module-level audit table for the per-site breakdown.
// NOTE: `preprocess` does not return swift's `[1, C, H, W]` directly.
// The cross-model primitive
// always runs the ImageNet pipeline on `[H, W, 3]` (so the `(3,)`
// mean/std broadcast cleanly across the trailing axis — see the
// module-doc `Conventions > Channel layout` block). The trailing
// **planar layout** is now opt-in via [`ImageProcessorConfig::layout`]
// (a `Layout` enum defaulting to `Hwc` — pre-existing callers see no
// change; per-model encoders that want `[1, C, H, W]` request
// `Layout::Bchw` and the composer applies one lazy transpose +
// expand_dims at zero memory cost). See [`Layout`] for the per-arm
// rationale; tracking issue [#120](https://github.com/Findit-AI/mlxrs/issues/120).
pub fn preprocess(img: &::image::DynamicImage, cfg: &ImageProcessorConfig) -> Result<Array> {
let resized;
let src = if cfg.do_resize {
resized = resize(img, cfg.size, cfg.resample)?;
&resized
} else {
img
};
let arr = image_to_array(src, cfg.color_order)?;
let arr = if cfg.do_rescale {
rescale(&arr, cfg.rescale_factor)?
} else {
arr
};
let arr = if cfg.do_normalize {
normalize(&arr, &cfg.mean, &cfg.std)?
} else {
arr
};
apply_layout(arr, cfg.layout)
}
/// Apply the trailing tensor layout post-step to a channel-last
/// `[H, W, 3]` array (the canonical [`preprocess`] output).
///
/// **Rank precondition.** The post-step targets the cross-model
/// preprocessor's rank-3 channel-last output specifically — `Hwc`,
/// `Chw`, and `Bchw` are all defined as `H W 3` permutations / batch
/// expansions of that exact shape. Non-rank-3 / non-trailing-3-channel
/// inputs are rejected with [`Error::RankMismatch`] (wrong ndim) or
/// [`Error::LengthMismatch`] (trailing channel dim != 3) before any FFI
/// call; per-model processors that produce different ranks (patchified
/// `[N, P, P, 3]`, batched `[B, H, W, 3]`, etc.) compose their own
/// trailing layout via `transpose_axes` / `expand_dims_axes` directly.
///
/// **Ownership.** Takes `arr` by **value** so the [`Layout::Hwc`] arm
/// is a literal identity (returns the input as-is) without going
/// through [`Array::try_clone`] — `Array` is intentionally non-`Clone`
/// (every duplication is a fallible mlx FFI bump), and the
/// pre-existing `Hwc` default must not pay a `try_clone` cost just to
/// pass through.
///
/// **Lazy / zero-copy on the non-identity arms.** Both [`transpose_axes`]
/// and `expand_dims_axes` update the [`Array`]'s shape/stride metadata
/// without copying the underlying buffer (mlx's standard no-op view
/// semantics). The `Chw` arm is one metadata-only `transpose_axes`; the
/// `Bchw` arm is the same transpose + one `expand_dims_axes(&[0])`
/// (which inserts a leading unit axis, again metadata-only).
///
/// **Axis convention.** `transpose_axes(&[2, 0, 1])` permutes the
/// rank-3 input `[H, W, 3]` → `[3, H, W]`, matching the swift
/// `MediaProcessing.asMLXArray` line 190
/// (`array.reshape((1, h, w, 3)).transposed(0, 3, 1, 2)`) modulo the
/// batch axis that `expand_dims_axes(&[0])` adds for the `Bchw` arm.
///
/// # Errors
/// - [`Error::RankMismatch`] if the input is not rank-3;
/// [`Error::LengthMismatch`] if the trailing channel dim is not 3.
///
/// Tracking issue: [#120](https://github.com/Findit-AI/mlxrs/issues/120).
pub fn apply_layout(arr: Array, layout: Layout) -> Result<Array> {
use crate::ops::shape::expand_dims_axes;
// Validate rank-3 `[H, W, 3]` shape before any FFI call — surfaces a
// typed `RankMismatch` / `LengthMismatch` for batched / patchified / non-RGB inputs
// rather than letting mlx's permutation-validation error message
// surface (which would be less actionable).
let shape = arr.shape();
if shape.len() != 3 {
return Err(Error::RankMismatch(RankMismatchPayload::new(
"apply_layout: expected rank-3 [H, W, 3] input (per-model processors that produce \
non-rank-3 layouts should compose transpose_axes / expand_dims_axes directly)",
shape.len() as u32,
shape,
)));
}
if shape[2] != 3 {
return Err(Error::LengthMismatch(LengthMismatchPayload::new(
"apply_layout: trailing channel dim (must be 3 for RGB)",
3,
shape[2],
)));
}
match layout {
Layout::Hwc => {
// Identity — historical channel-last output. Pass through by
// value: no `try_clone`, no allocation.
Ok(arr)
}
Layout::Chw => {
// [H, W, 3] → [3, H, W]. One metadata-only transpose.
transpose_axes(&arr, &[2, 0, 1])
}
Layout::Bchw => {
// [H, W, 3] → [3, H, W] → [1, 3, H, W]. Matches swift
// `MediaProcessing.asMLXArray` (`MediaProcessing.swift:190` —
// `array.reshape((1, h, w, 3)).transposed(0, 3, 1, 2)`); we
// compose the equivalent via transpose + expand_dims since the
// input is already `[H, W, 3]` (no reshape needed). Both ops
// are metadata-only on the lazy `Array`.
let chw = transpose_axes(&arr, &[2, 0, 1])?;
expand_dims_axes(&chw, &[0])
}
}
}
/// Private regression tests for [`apply_orientation_fallible`] and
/// the truly-fallible [`rotate_buf`] helper.
///
/// **Rotate coverage.** The four EXIF rotate orientations
/// (Rotate90 / Rotate270 / Rotate90FlipH / Rotate270FlipH) are
/// exercised on all four u8 [`DynamicImage`] variants
/// (Luma8 / LumaA8 / Rgb8 / Rgba8) and compared byte-for-byte
/// against the corresponding `image::imageops` reference output.
/// The non-square 4x3 test image makes 90° dimension swaps visible
/// at the test boundary (output dims must be 3x4, not 4x3).
///
/// These live inline (not in the `tests/vlm_image.rs` integration
/// suite) because [`apply_orientation_fallible`] / [`rotate_buf`]
/// are private — exposing them just to test would widen the public
/// surface for no caller benefit.
#[cfg(test)]
mod apply_orientation_tests;
/// Private unit tests for the `vlm::image` helpers and enum string tags
/// the public-API integration suite (`tests/vlm_image.rs`) cannot reach:
/// the `as_str` / `Display` surface of [`ResizeFilter`] / [`ColorOrder`]
/// / [`Layout`], the private [`make_channel_broadcast`] rank arms, the
/// rank-0 and W-axis validation branches of [`normalize`] / [`patchify`],
/// and the `load_image` parse-error closure.
#[cfg(test)]
mod tests;