edgefirst-codec 0.23.2

Image codec for decoding JPEG/PNG into pre-allocated EdgeFirst tensors
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
# EdgeFirst Codec Architecture

## Overview

The `edgefirst-codec` crate provides image decoding into pre-allocated
tensor buffers. It is designed for real-time vision pipelines where the
anti-pattern of allocating new output buffers on every frame must be
avoided.

The core principle: **allocate once at init, decode in the hot loop**.

## Crate Position in the Workspace

```
edgefirst-tensor ← edgefirst-codec ← edgefirst-image (re-export)
                                    ← edgefirst-hal (re-export)
                                    ← crates/python (bindings)
```

`edgefirst-codec` depends only on `edgefirst-tensor` plus `zune-png`
(for PNG decoding) and `kamadak-exif` (for EXIF orientation). JPEG
decoding uses a custom from-scratch decoder with no external dependencies.
The crate has no dependency on `edgefirst-image` or any GPU libraries,
keeping the dependency graph clean.

## Module Map

| Module       | Purpose                                         |
|--------------|-------------------------------------------------|
| `lib.rs`     | Crate root, public re-exports                   |
| `error.rs`   | `CodecError` enum with capacity/dtype/format/IO |
| `pixel.rs`   | `ImagePixel` trait (u8, u16, i8, i16, f32)      |
| `options.rs` | `DecodeOptions` and `ImageInfo` structs         |
| `decoder.rs` | `ImageDecoder` struct with `JpegDecoderState`   |
| `traits.rs`  | `ImageLoad` extension trait for Tensor/TensorDyn|
| `jpeg/`      | Custom baseline JPEG decoder (see below)        |
| `png.rs`     | PNG decode with format conversion and native 16-bit support |

### JPEG Module Map (`jpeg/`)

| Module           | Purpose                                              |
|------------------|------------------------------------------------------|
| `mod.rs`         | `JpegDecoderState`, `decode_jpeg_into<T>()`, EXIF    |
| `types.rs`       | `Component`, `SamplingFactor`, `ImageHeader`, `QuantTable`, `ZIGZAG` |
| `markers.rs`     | SOF/SOS/DQT/DHT/DRI/APP marker parsing               |
| `bitstream.rs`   | 64-bit bit buffer with FF/00 byte-stuffing, bulk refill |
| `huffman.rs`     | 11-bit lookahead Huffman LUT, `decode_block()` with dequant fusion |
| `idct/mod.rs`    | IDCT dispatcher (scalar/NEON/SSE4.1/SSE2 selection via function pointers) |
| `idct/scalar.rs` | Two-pass Loeffler 8×8 IDCT with DC-only fast path    |
| `idct/neon.rs`   | NEON 8×8 IDCT: 4-wide Loeffler butterfly, 4×4 transpose, DC-only fill |
| `idct/sse2.rs`   | SSE2 8×8 IDCT: 4-wide Loeffler butterfly, emulated mullo_epi32 |
| `idct/sse41.rs`  | SSE4.1 8×8 IDCT: native mullo_epi32, min/max clamping |
| `color/mod.rs`   | Color conversion dispatcher                           |
| `color/scalar.rs`| BT.601 full-range YCbCr→RGB/RGBA/BGRA/Grey           |
| `color/neon.rs`  | NEON YCbCr→RGB/RGBA/BGRA: 8-pixel SIMD with vst3/vst4 |
| `color/sse2.rs`  | SSE2 YCbCr→RGBA/BGRA: 8-pixel SIMD with unpack interleave |
| `color/ssse3.rs` | SSSE3 YCbCr→RGB: 8-pixel SIMD with shuffle-based 3-channel interleave |
| `convert.rs`     | Vectorised u8→f32/u16/i16 pixel conversion (NEON + SSE2) |
| `upsample/mod.rs`| Chroma upsample dispatcher                            |
| `upsample/scalar.rs` | Bilinear 3:1 blend for horizontal 2× upsampling |
| `upsample/neon.rs`   | NEON horizontal 2× upsample: widening multiply-accumulate |
| `upsample/sse2.rs`   | SSE2 horizontal 2× upsample: 16-bit multiply with pack |
| `mcu.rs`         | MCU decode loop, `McuScratch`, strided output, NV12 path |

## Key Design Decisions

### Standalone `ImageDecoder` Struct

The decoder is a standalone struct rather than being embedded in
`ImageProcessor` or stored in thread-local state. This gives callers
explicit ownership and composability — one decoder per pipeline stage,
no hidden global state.

```rust
let mut decoder = ImageDecoder::new();
// Scratch buffers amortize across calls
loop {
    let info = tensor.load_image(&mut decoder, &bytes, &opts)?;
}
```

### `ImageLoad` Extension Trait

The primary user-facing API is the `ImageLoad` trait, implemented for both
`Tensor<T>` (where `T: ImagePixel`) and `TensorDyn`. This keeps the tensor
types in `edgefirst-tensor` unaware of codec internals.

### `&[u8]` as the Hot Path

The decode pipeline takes `&[u8]` as input — the most common case (memory-
mapped files, network buffers, camera frames). `Read`-based wrappers buffer
into `ImageDecoder.input_buffer` before delegating to the `&[u8]` path.

### Strided Output

Decoders write row-by-row using the tensor's `effective_row_stride()`. This
supports tensors with GPU pitch alignment padding (e.g., 64-byte alignment
for Mali DMA-BUF import). The stride gap bytes are untouched.

```
Tensor buffer layout (1280×720 RGB, 64-byte aligned stride = 3840):
┌──────────────────────────────┬────┐
│ row 0: 1280×3 = 3840 bytes  │ 0  │  ← no padding (3840 % 64 == 0)
├──────────────────────────────┼────┤
│ row 1: 1280×3 = 3840 bytes  │ 0  │
├──────────────────────────────┼────┤
│ ...                          │    │
└──────────────────────────────┴────┘
```

For misaligned widths (e.g., 641 pixels × 3 = 1923 bytes, padded to 1984):
```
┌────────────────────────┬──────────┐
│ row 0: 641×3 = 1923    │ 61 pad   │  ← stride = 1984
├────────────────────────┼──────────┤
│ row 1: 641×3 = 1923    │ 61 pad   │
└────────────────────────┴──────────┘
```

### Works Best with `ImageProcessor::create_image()`

While `ImageLoad` works with any `Tensor<T>` or `TensorDyn`, optimal
performance requires tensors allocated by `ImageProcessor::create_image()`:

- **DMA-BUF backing**: Zero-copy path to GPU for `convert()`
- **PBO backing**: When GL is the active transfer path
- **GPU pitch alignment**: Row stride padded for Mali DMA-BUF import

Free-standing `Tensor::new()` or `Tensor::image()` works but:
- Cannot produce PBO tensors (requires GL context)
- May not have GPU-aligned pitch (works, but `convert()` may use CPU path)

### Tensor Dimensions After Decode

When a smaller image (e.g., 640×480) is decoded into a larger tensor
(e.g., 1920×1080), the tensor's physical buffer and shape are unchanged.
`ImageInfo` reports the actual decoded dimensions. Callers use `Crop` with
`ImageProcessor::convert()` to process only the decoded region:

```rust
let info = tensor.load_image(&mut decoder, &bytes, &opts)?;
processor.convert(&tensor, &mut dst, rot, flip,
    Crop::new(0, 0, info.width, info.height))?;
```

## Decode Pipeline

### JPEG Decode Flow

The custom baseline JPEG decoder processes images through these stages:

1. **Marker parsing** (`markers.rs`): Parse SOF0, DQT, DHT, DRI, SOS, APP1
   segments. Build Huffman tables, quantisation tables, and extract EXIF data.
2. **Capacity validation**: Verify tensor dimensions ≥ decoded image size
   (accounting for EXIF rotation if enabled).
3. **MCU decode loop** (`mcu.rs`): For each MCU row:
   a. **Huffman decode** (`huffman.rs`): 11-bit lookahead LUT decodes DC/AC
      coefficients with dequantisation fused into the decode step.
   b. **IDCT** (`idct/`): Two-pass Loeffler 8×8 IDCT with DC-only fast
      path converts frequency coefficients → spatial pixel values.
   c. **Chroma upsample** (`upsample/`): Bilinear 3:1 blend expands
      subsampled Cb/Cr channels to full resolution.
   d. **Color conversion** (`color/`): BT.601 full-range YCbCr→RGB/RGBA/
      BGRA/Grey conversion with clamping.
   e. **Strided output**: Write converted pixels to tensor buffer at
      `effective_row_stride()` offsets.
4. **EXIF rotation/flip**: Apply orientation transform in-place (if enabled).
5. **Type conversion** (`convert.rs`): For non-u8 targets, convert pixel
   data via SIMD-vectorised paths: NEON/SSE2 for f32 (×1/255), u16 (×257),
   i16 (×257 XOR 0x8000); byte-level XOR for i8.
6. **Return** `ImageInfo` with decoded dimensions.

**Key optimisations:**
- `JpegDecoderState` persists across frames — `McuScratch` buffers grow
  to the high-water mark and are reused. After the first decode at a given
  resolution, the JPEG decoder performs zero heap allocations.
- Dequantisation is fused into Huffman decode: `decode_block()` multiplies
  each coefficient by the quant table entry during decode, not as a
  separate pass.
- DC-only IDCT fast path: when all 63 AC coefficients are zero, the IDCT
  reduces to a constant fill (single multiply + shift).
- Function pointer dispatch for IDCT/color/upsample: selected once at init
  based on CPU feature detection (NEON on AArch64, SSE4.1 > SSE2 on x86-64,
  scalar fallback).

### NEON SIMD Kernels (AArch64)

On AArch64, the decoder uses NEON intrinsics for the three hot-path kernels.
Each kernel is selected via `std::arch::is_aarch64_feature_detected!("neon")`
at init time.

| Kernel       | Strategy                                          | Throughput    |
|--------------|---------------------------------------------------|---------------|
| **IDCT**     | 4-wide Loeffler butterfly with int32x4_t, 4×4 transpose via vzip, DC-only fills 8 bytes via vdup/vst1 | 4 cols/rows per iteration |
| **Color**    | 7-bit fixed-point YCbCr→RGB/RGBA/BGRA, vmovl widening, vrshrq rounding shift, vqmovun saturation, vst3/vst4 interleaved store | 8 pixels per iteration |
| **Upsample** | Widening bilinear 3:1 blend via vmulq_n_u16, interleaved output via vst2 | 8→16 samples per iteration |

### SSE2/SSE4.1/SSSE3 SIMD Kernels (x86-64)

On x86-64, the decoder uses a tiered SIMD dispatch: SSE4.1 > SSE2 > scalar
for IDCT, SSSE3 > SSE2 for RGB color conversion. Each tier is selected at
init via `is_x86_feature_detected!()`.

| Kernel       | Tier    | Strategy                                          | Throughput    |
|--------------|---------|---------------------------------------------------|---------------|
| **IDCT**     | SSE4.1  | 4-wide Loeffler with native `_mm_mullo_epi32`, `_mm_min_epi32`/`_mm_max_epi32` clamp | 4 cols/rows per iteration |
| **IDCT**     | SSE2    | 4-wide Loeffler with emulated `mullo_epi32` (4 instructions), comparison-based clamp | 4 cols/rows per iteration |
| **Color RGB**| SSSE3   | 7-bit fixed-point YCbCr→RGB, `_mm_shuffle_epi8` for 3-channel interleave | 8 pixels per iteration |
| **Color RGBA/BGRA** | SSE2 | 7-bit fixed-point, `_mm_unpacklo_epi8` 4-channel interleave | 8 pixels per iteration |
| **Upsample** | SSE2    | 16-bit bilinear 3:1 blend via `_mm_mullo_epi16`, `_mm_packus_epi16` narrow, `_mm_unpacklo_epi8` interleave | 16→32 samples per iteration |

SSE4.1 IDCT improvements over SSE2:
- Native `_mm_mullo_epi32` replaces 4-instruction emulation (2× `_mm_mul_epu32` +
  shuffle + unpack), reducing IDCT instruction count by ~30%.
- `_mm_min_epi32`/`_mm_max_epi32` replaces 5-instruction comparison-based clamp
  with a 2-instruction branchless clamp.

SSSE3 RGB improvements over SSE2:
- `_mm_shuffle_epi8` with precomputed masks interleaves R/G/B bytes into packed
  RGB in 2 shuffles + 1 OR per 16 output bytes, replacing the SSE2 temp-buffer
  scatter (3 stores + 8-iteration scalar loop).

### Vectorised Type Conversion

The u8→T conversion step uses dedicated SIMD kernels (`convert.rs`) instead of
per-element `ImagePixel::from_u8()` calls. This is the critical optimisation for
f32 decode performance (reduced from 4× slower to 1.17× slower than u8).

| Target | NEON Strategy                                   | SSE2 Strategy                            |
|--------|------------------------------------------------|------------------------------------------|
| **f32**| Load 16 bytes, `vmovl`→u16→u32, `vcvtq_f32_u32`, `vmulq_f32(1/255)` | Load 16 bytes, unpack→u32, `_mm_cvtepi32_ps`, `_mm_mul_ps(1/255)` |
| **u16**| Load 16 bytes, `vmovl_u8`→u16, `vmulq_u16(257)` | Load 16 bytes, unpack→u16, `_mm_mullo_epi16(257)` |
| **i16**| Same as u16 + `veorq_u16(0x8000)` XOR         | Same as u16 + `_mm_xor_si128(0x8000)`    |
| **i8** | `copy_from_slice` + bulk XOR 0x80 (auto-vectorised) | Same                                  |

### NV12 Output Path

For NV12 output, the decoder skips YCbCr→RGB color conversion entirely:
- Y plane is copied directly from the IDCT output buffer
- Cb and Cr planes are interleaved pair-wise into the UV plane

This path is faster than RGB/RGBA because it avoids the fixed-point color
conversion entirely. It is intended for hardware video encoders and GPU
pipelines that consume NV12 natively. EXIF rotation is not supported for
NV12 output.

### JPEG Decoder Architecture

```
JpegDecoderState
├── McuScratch (reusable across frames)
│   ├── component_bufs: Vec<Vec<u8>>   — per-component IDCT output
│   ├── cb_row / cr_row: Vec<u8>       — upsampled chroma rows
│   └── output_row: Vec<u8>            — color-converted output row
└── exif_scratch: Vec<u8>              — EXIF rotation workspace
```

The MCU loop processes one MCU row at a time:
1. Decode all blocks (Y, Cb, Cr) into `component_bufs`
2. For each pixel row in the MCU row:
   - Upsample chroma into `cb_row`/`cr_row`
   - Color-convert Y+Cb+Cr → `output_row`
   - Copy `output_row` → tensor at strided offset

### Chroma Subsampling Support

| Sampling | Description     | H/V Ratios | Upsample Path         |
|----------|-----------------|------------|------------------------|
| 4:4:4    | No subsampling  | 1:1 / 1:1  | Direct (no upsample)  |
| 4:2:2    | Horizontal 2×   | 2:1 / 1:1  | `upsample_h2()`       |
| 4:2:0    | Horizontal + Vertical 2× | 2:1 / 2:1 | `upsample_h2()` + row duplication |
| Greyscale| Single component| N/A        | `grey_copy()`          |

### PNG Decode Flow

1. Parse PNG headers via `zune-png` → get dimensions, colorspace, bit depth
2. Validate tensor capacity ≥ decoded dimensions
3. Choose decode strategy based on target type and source bit depth:
   - **u8/i8 targets**: Use `decode_into(&mut [u8])` — fast u8 path with
     optional XOR for i8
   - **u16/i16/f32 targets**: Use `decode()` → `DecodingResult` which
     preserves native 16-bit data from 16-bit PNGs
4. Convert pixel format if needed (e.g., RGBA→RGB, RGB→Grey)
5. Row-copy from decoded data → tensor buffer at stride offsets with pixel
   type conversion via `from_u8()` or `from_u16()` depending on source depth
6. Return `ImageInfo` with decoded dimensions

### Format Auto-Detection

The decoder inspects magic bytes:
- `FF D8 FF` → JPEG
- `89 50 4E 47` → PNG
- Otherwise → `CodecError::InvalidData`

## Tracing Spans

`ImageDecoder::decode_into` (and the trait-method `Tensor::load_image`)
emits a [`tracing::trace_span!`] tree describing each phase of the JPEG/PNG
decode. Spans are captured by
[`edgefirst_hal::trace::start_tracing`](https://github.com/EdgeFirstAI/hal/blob/main/crates/hal/src/trace.rs)
into Chrome JSON for Perfetto. The cost when no subscriber is active is a
single relaxed atomic load per call site.

### Naming convention

Span names follow `<crate>.<function>[.<operation>[.<sub-operation>]]`:

- **`<crate>.<function>`** — top-level span: the public function the user
  invoked. The codec exposes format-specific entry points (`codec.decode_jpeg`,
  `codec.decode_png`) selected automatically from the magic bytes.
- **`<crate>.<function>.<operation>`** — meaningful internal work
  (`codec.decode_jpeg.parse_markers`, `codec.decode_jpeg.mcu_loop`, etc.).
- A span is worth adding when the work inside it is meaningful for
  optimisation and has enough complexity to justify the overhead — roughly
  500 µs on Cortex-A53 as a guideline.

### Span tree

```text
codec.decode_jpeg                                       [user-facing fn]
│ fields: dtype = "u8" | "i8" | "u16" | "i16" | "f32", n_bytes
│
├── codec.decode_jpeg.parse_markers                     ← parse SOF0/DQT/DHT/DRI/SOS/APP1, read EXIF
├── codec.decode_jpeg.mcu_loop                          ← Huffman + IDCT + upsample + colour-convert
├── codec.decode_jpeg.apply_exif                        ← EXIF orientation transform (rotation/flip)
│   fields: rotation_deg, flip_h
└── codec.decode_jpeg.type_convert                      ← u8 → T conversion (skipped for u8 target)
    field: dtype

codec.decode_png                                        [user-facing fn]
│ fields: dtype, n_bytes
│
└── codec.decode_png.zune_decode                        ← delegate to zune-png
    field: path = "u8" | "native_u16"
```

### What each span measures (mapped to the JPEG / PNG decode pipeline)

| Span                              | What is happening inside | Reference equivalent |
|-----------------------------------|--------------------------|----------------------|
| `codec.decode_jpeg`               | Full JPEG decode: marker parsing, MCU decode loop, optional EXIF rotation, optional type conversion to non-u8 targets. | Baseline JPEG decode per ITU T.81 + EXIF 2.32. |
| `codec.decode_jpeg.parse_markers` | Walk the JPEG byte stream once: parse SOF0 (start-of-frame), DQT (quantisation tables), DHT (Huffman tables), DRI (restart interval), SOS (start-of-scan), and APP1 (EXIF) segments. Builds Huffman LUTs and quant tables, extracts EXIF bytes. | Equivalent to libjpeg's `jpeg_read_header` + DHT/DQT table builds. |
| `codec.decode_jpeg.mcu_loop`      | The core decode loop: for each MCU row, Huffman-decode + dequant-fuse blocks → two-pass Loeffler IDCT (scalar / NEON / SSE4.1 / SSE2 selected at init) → bilinear chroma upsample (4:2:0 / 4:2:2 → 4:4:4) → BT.601 YCbCr → RGB/RGBA/BGRA/Grey/NV12 colour conversion → strided write into the tensor. Allocation-free after warmup. | Equivalent to libjpeg's `jpeg_read_scanlines` loop, but with the IDCT / chroma-upsample / colour-convert kernels handwritten and SIMD-dispatched per-CPU. |
| `codec.decode_jpeg.apply_exif`    | In-place rotation / horizontal flip on the decoded scratch buffer using `kamadak-exif`'s orientation tag. Skipped when `apply_exif=false` or for NV12 output (which doesn't support rotation). | Equivalent to libjpeg-turbo's `jpegtran -copy all -rotate ...` orientation transform. |
| `codec.decode_jpeg.type_convert`  | Vectorised u8 → target-type conversion at the row level: `*1/255` for f32, `*257` for u16, `*257 ^ 0x8000` for i16, `^ 0x80` for i8. NEON or SSE2 dispatch per row. Skipped for u8 targets (those write directly into the tensor during the MCU loop). | No libjpeg equivalent — this is the ML-quantisation conversion layer specific to EdgeFirst's tensor types. |
| `codec.decode_png`                | Full PNG decode: header parse, native or 8-bit `zune-png` decode, optional EXIF rotation, format conversion to the requested `PixelFormat`. | PNG decode per ISO/IEC 15948 (RFC 2083). |
| `codec.decode_png.zune_decode`    | The bulk of PNG cost: zlib inflate + PNG filter reversal inside [`zune-png`](https://docs.rs/zune-png). `path = "u8"` is the strided-output fast path; `path = "native_u16"` preserves 16-bit-per-channel PNGs and is used for u16/i16/f32 tensor targets. | Equivalent to libpng's `png_read_image`. |

[`tracing::trace_span!`]: https://docs.rs/tracing/latest/tracing/macro.trace_span.html

## Supported Pixel Formats

| Output Format | JPEG | PNG  | Notes                           |
|---------------|------|------|---------------------------------|
| RGB           | ✓    | ✓    | Native JPEG output              |
| RGBA          | ✓    | ✓    | Alpha = 255 for JPEG            |
| Grey          | ✓    | ✓    | Luminance only                  |
| BGRA          | ✓    | ✓    | B/R channel swap from RGB/RGBA  |
| NV12          | ✓    | —    | Y plane + interleaved UV (4:2:0)|

## Supported Source Features

The codec implements a **strict subset** of the JPEG and PNG specifications.
Inputs that fall outside the subset surface a typed
`CodecError::Unsupported(UnsupportedFeature)`. See the per-feature table in
[`README.md`](README.md#decoder-limitations) for the full matrix and the
typed error variant that each rejected case carries.

The codec does **not** transparently fall back to another decoder for
unsupported inputs and does **not** attempt to transcode them. The
contract is "accept this strict subset; reject everything else with a
precise typed error."

## Data Type Support

| Type  | JPEG               | PNG (8-bit source)   | PNG (16-bit source) |
|-------|--------------------|----------------------|---------------------|
| `u8`  | Direct copy        | Direct copy          | `>> 8`              |
| `u16` | `* 257` scaling    | `* 257` scaling      | Direct copy         |
| `i8`  | XOR 0x80           | XOR 0x80             | `(>> 8) XOR 0x80`  |
| `i16` | `* 257` then XOR   | `* 257` then XOR     | XOR 0x8000          |
| `f32` | `/ 255.0`          | `/ 255.0`            | `/ 65535.0`         |

### XOR Trick for Signed Types

Signed integer decoding uses a bit-flip to convert unsigned pixel data into
the signed range, which is the standard approach for ML quantization:

- **i8**: `(u8_value ^ 0x80) as i8` — maps `0→-128`, `128→0`, `255→127`
- **i16**: `(u16_value ^ 0x8000) as i16` — maps `0→-32768`, `32768→0`, `65535→32767`

### u16 Scaling from u8

When JPEG (8-bit) data is decoded into `u16`, each byte is scaled to the full
16-bit range: `u8_value as u16 * 257`. This maps `0→0`, `128→32896`, `255→65535`
exactly (257 = 0x0101).

## Scratch Buffer Strategy

### JPEG (`JpegDecoderState`)

The custom JPEG decoder uses `JpegDecoderState` which persists across frames.
The internal `McuScratch` buffers grow to the high-water mark and are reused.
After the first decode at a given resolution, subsequent JPEG decodes perform
**zero heap allocations** in the entire decode path.

**Allocation-free after warmup:**
- `McuScratch` component buffers, chroma rows, output row
- Huffman table lookups (tables are rebuilt from marker data each frame using
  pre-allocated `Vec` storage)
- IDCT workspace (stack-allocated `[i32; 64]`)
- Bitstream reader (borrows input `&[u8]`)
- Row-copy and stride padding logic
- Pixel type conversion (u8→u16, u8→i8 XOR, u8→f32)

**EXIF rotation** (`exif_scratch`) uses a reusable `Vec<u8>` that grows to
the high-water mark. `kamadak-exif::Reader::read_raw()` allocates on each
call — disable with `DecodeOptions::with_exif(false)` in the hot loop if
the application handles orientation separately.

### PNG (`zune-png`)

PNG decoding uses `zune-png` which allocates internal decoder state on each
call. The edgefirst-codec PNG layer reuses `ImageDecoder.input_buffer` for
`Read`-based input but the zune-png library itself allocates per-frame.

### Allocation Sources by Layer

| Layer                    | After Warmup     | Notes                        |
|--------------------------|------------------|------------------------------|
| JPEG `McuScratch`        | No allocations   | Grows to high-water mark     |
| JPEG Huffman/quant tables| No allocations   | Rebuilt from marker data     |
| JPEG IDCT workspace      | No allocations   | Stack-allocated `[i32; 64]`  |
| Row-copy / stride        | No allocations   | Operates on pre-allocated buffers |
| Pixel conversion         | No allocations   | In-place or element-wise     |
| EXIF reader              | 1 `Vec` / call   | `to_vec()` on EXIF data; skip with `apply_exif(false)` |
| zune-png `decode()`      | 1 `Vec` / call   | Returns owned `Vec<u16/u8>`  |
| zune-png `decode_into()` | ~3 `brk` / call  | Internal filter state        |