edgefirst-decoder 0.23.2

ML model output decoding for YOLO and ModelPack object detection and segmentation
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
# edgefirst-decoder Architecture

## Overview

`edgefirst-decoder` is the post-processing layer that turns raw inference
outputs from object-detection and segmentation models into typed
`DetectBox` and `Segmentation` values. It supports both floating-point and
quantized (int8 / uint8) inputs without an intermediate dequantization
buffer, configurable NMS modes, end-to-end YOLOv26 models with embedded NMS,
and a fused proto-mask GPU rendering path that bypasses CPU mask
materialization. The crate is configured through a single
[`DecoderBuilder`](https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/struct.DecoderBuilder.html)
that ingests JSON or YAML model metadata and selects the right code path
based on the output tensor layout.

## Module Map

| Module | Source | Responsibility |
|--------|--------|----------------|
| [`lib.rs`]https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/src/lib.rs | local | Public surface: `Decoder`, `DetectBox`, `Segmentation`, `Quantization`, `BoundingBox`, dequantize utilities |
| [`decoder/`]https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/src/decoder/ | local | `DecoderBuilder`, `ConfigOutputs`, model-type selection, per-scale bridge, post-processing |
| [`yolo.rs`]https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/src/yolo.rs | local | YOLOv5/8/11 detection + segmentation kernels |
| [`modelpack.rs`]https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/src/modelpack.rs | local | Au-Zone ModelPack format kernels |
| [`per_scale/`]https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/src/per_scale/ | local | Per-scale split-tensor decoder framework (NEON-optimized hot path) |
| [`schema.rs`]https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/src/schema.rs | local | `SchemaV2` parser — model metadata document used by EdgeFirst Studio |
| [`float.rs`]https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/src/float.rs / [`byte.rs`]https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/src/byte.rs | local | NMS implementations (float and byte-quantized) |
| [`error.rs`]https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/src/error.rs | local | `DecoderError`, `DecoderResult` |

## Key Types and Traits

- [`Decoder`]https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/struct.Decoder.html — built once, then called per inference frame.
- [`DecoderBuilder`]https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/struct.DecoderBuilder.html — fluent configuration with sensible defaults; consumes JSON/YAML or programmatic `ConfigOutputs`.
- [`DetectBox`]https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/struct.DetectBox.html — output bounding box + score + class label.
- [`Segmentation`]https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/struct.Segmentation.html — per-detection mask matrix.
- [`Quantization`]https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/struct.Quantization.html`(scale, zero_point)` for int8/uint8 outputs.
- [`Nms`]https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/configs/enum.Nms.html`Auto` / `ClassAgnostic` / `ClassAware`. Bypass is expressed as `Option<Nms>::None` on the decoder configuration, not a variant of the enum.
- [`SchemaV2`]https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/schema/struct.SchemaV2.html — model metadata document (current schema version).

## Internal Architecture

### Builder → Decoder

```mermaid
flowchart LR
    Builder[DecoderBuilder]
    Decoder[Decoder]
    Det[decode<br/>→ bboxes, scores, classes]
    Seg[decode_segmentation<br/>→ bboxes, scores, classes, masks]
    Proto[decode_quantized_proto<br/>→ raw protos + coefficients]

    Builder --> Decoder
    Decoder --> Det
    Decoder --> Seg
    Decoder --> Proto

    style Builder fill:#e1f5ff
    style Decoder fill:#fff4e1
    style Det fill:#e8f5e9
    style Seg fill:#e8f5e9
    style Proto fill:#87ceeb
```

### Detection pipeline

```mermaid
flowchart TD
    Raw[Model raw output<br/>quantized or float]
    E2E{End-to-end?<br/>decoder_version = yolo26}

    Quant{Quantized?}
    Dequant[Dequantization<br/>scale, zero_point]
    Parse[Parse boxes &amp; scores<br/>XYWH → XYXY]
    NMS[Non-Maximum Suppression<br/>IoU threshold filtering]
    Filter[Filter by score threshold]

    E2EParse[Parse post-NMS output<br/>XYXY + conf + class directly]
    E2EFilter[Filter by score threshold]

    Det[Detection boxes]
    Seg[Segmentation masks]

    Raw --> E2E
    E2E -->|Yes| E2EParse
    E2EParse --> E2EFilter
    E2EFilter --> Det
    E2EFilter --> Seg

    E2E -->|No| Quant
    Quant -->|Yes| Dequant
    Quant -->|No| Parse
    Dequant --> Parse
    Parse --> NMS
    NMS --> Filter
    Filter --> Det
    Filter --> Seg

    style E2E fill:#fff4e1
    style Dequant fill:#fff4e1
    style NMS fill:#ffeb9c
    style E2EParse fill:#87ceeb
    style Det fill:#90ee90
    style Seg fill:#90ee90
```

### Model-type selection

The builder classifies a model's output topology by **shape alone**, not by
output count. The result is one of:

| Variant | Tensors | Format |
|---------|---------|--------|
| `YoloDet` | 1 (detection) | Standard YOLO detection |
| `YoloSegDet` | 2 (detection + protos) | YOLO detection + segmentation |
| `YoloSegDet2Way` | 3 (detection + mask_coefs + protos) | INT8 segmentation with separate mask-coef quant scale |
| `YoloSplitDet` | 2 (boxes + scores) | Split-output detection |
| `YoloSplitSegDet` | 4 (boxes + scores + mask_coefs + protos) | Split-output segmentation |
| `YoloEndToEndDet` | 1 | Post-NMS `[B, N, 6+]` |
| `YoloEndToEndSegDet` | 2 | Post-NMS + protos |
| `YoloSplitEndToEndDet` | 3 | Split post-NMS (boxes + scores + classes) |
| `YoloSplitEndToEndSegDet` | 5 | Split post-NMS + mask_coefs + protos |
| `ModelPackDet` | 2 (boxes + scores) | ModelPack detection |
| `ModelPackSegDet` | 3 (boxes + scores + segmentation) | ModelPack segmentation |
| `ModelPackDetSplit` | N (detection layers) | ModelPack split detection |
| `ModelPackSegDetSplit` | N+1 (detection layers + segmentation) | ModelPack split segmentation |
| `ModelPackSeg` | 1 (segmentation) | ModelPack semantic segmentation |

#### `YoloSegDet2Way` shape-based classification

INT8 TFLite segmentation models split mask coefficients off the combined
detection tensor because the unbounded linear projection that produces them
needs a separate quantization scale from the (bounded, similarly-ranged)
boxes and class scores. The builder identifies the three outputs by shape:

| Output | Shape | Identification rule |
|--------|-------|---------------------|
| `detection` | `[1, nc+4, N]` | 3D, feature_dim ≠ 32 |
| `mask_coefs` | `[1, 32, N]` | 3D, feature_dim == 32 |
| `protos` | `[1, H/4, W/4, 32]` | 4D |

The feature dimension is the smaller of the two non-batch dimensions
(channels-first: `shape[1]`; channels-last: `shape[2]`).

**Edge case (nc=28):** when `num_classes + 4 == 32`, detection and
`mask_coefs` collide on feature dim. In this case the model emits 2 outputs
(unsplit detection + protos) and is decoded as `YoloSegDet`.

#### YOLOv26 end-to-end selection

`decoder_version: "yolo26"` triggers `DecoderVersion::is_end_to_end() == true`
and selects one of the `YoloEndToEnd*` variants. NMS is bypassed entirely;
the model emits post-NMS `[B, N, 6+]` rows of `(x1, y1, x2, y2, conf,
class, ...)`.

For non-end-to-end YOLO26 exports (`end2end=false`), use
`decoder_version: "yolov8"` with an explicit `nms` configuration.

### Output tensor physical-order contract

Every output declared to the decoder — whether programmatically via
`hal_decoder_params_add_output` / `DecoderBuilder`, or through YAML/JSON
config — must list `shape` and `dshape` fields in **physical memory order,
outermost axis first, innermost axis last**. The decoder derives C-contiguous
strides from `shape` and wraps the raw buffer bytes with those strides; it
never reorders bytes. When `dshape` is also supplied, HAL uses it to permute
the stride tuple into the decoder's canonical logical order for the role
(e.g. `[batch, height, width, num_protos]` for protos); only stride indices
change, not bytes.

When `dshape` is omitted, HAL assumes `shape` is already in the canonical
order for the role. This is appropriate for producers like Ultralytics
ONNX/TFLite flat-detection.

`DecoderBuilder::build` validates each output at construction time:

- `dshape.len()` must equal `shape.len()` when `dshape` is present.
- Each `dshape[i].size` must equal `shape[i]` — catches the common mistake
  of declaring `dshape` in a different order than `shape`.
- No axis name may appear twice within a single output's `dshape`.

Mis-declaring physical order causes every element access to index the wrong
byte. This was the root cause of two production bugs: a vertical-stripe mask
artifact on i.MX 8M Plus TFLite segmentation, and a coordinate mis-decode
with Ara-2 anchor-first split-tensor boxes. HAL cannot detect the mismatch
at runtime — it has no visibility into how the inference engine laid out
bytes. **Producer ownership:** the entity emitting the model metadata is
responsible for matching the physical layout it produced.

Common physical layouts by framework:

| Framework | Typical proto layout | Declaration |
|-----------|---------------------|-------------|
| TFLite (NNStreamer) | `[1, H, W, C]` NHWC | `shape=[1,H,W,C]`, `dshape=[batch,height,width,num_protos]` |
| ONNX / PyTorch | `[1, C, H, W]` NCHW | `shape=[1,C,H,W]`, `dshape=[batch,num_protos,height,width]` |
| Ara-2 DVM | `[1, N, 1, 4]` anchor-first | `shape=[1,N,1,4]`, `dshape=[batch,num_boxes,padding,box_coords]` |
| Ultralytics flat | already canonical | `shape=[1,C,N]`, `dshape` may be omitted |

### Quantization on output tensors

The per-scale decode path reads quantization parameters **from the
output `Tensor`** (`tensor.quantization()`), not from the schema
metadata. The integrator is responsible for propagating each output
tensor's `(scale, zero_point)` onto the HAL `Tensor` before calling
`decode_*` / `decode_*_proto`:

```rust
if dtype != DType::F32 {
    let q = edgefirst_decoder::Quantization::from((scale, zero_point));
    tensor.set_quantization(q)?;
}
```

Failure modes when this step is skipped on quantized outputs:

- **Per-scale path:** returns `DecoderError::QuantMissing` for split
  tensors that have no per-scale quantization in the schema, or
  silently identity-dequantizes (scale = 1.0, zero = 0) for the flat
  variants — scores stay in raw int8 / uint8 units and never cross
  thresholds, producing zero detections.
- **Flat-detection path:** reads quantization from the schema if
  present and falls back to identity otherwise; same silent-zero
  failure mode for models whose quantization is only carried on the
  runtime tensor.

A producer that allocates the HAL output tensors should attach
quantization at allocation time. A producer that receives output
tensors from another layer (e.g. a TFLite delegate's output binding)
must read the runtime's quantization metadata and call
`set_quantization` before the first decode of that tensor.

### Per-scale split-tensor framework

The [`per_scale`](https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/src/per_scale/)
module is the high-performance hot path used for split-tensor models
(Ara-2 DVM, certain TFLite exports). Detection headers come pre-split into
per-scale tensors (e.g. 80×80, 40×40, 20×20). The pipeline:

1. **Plan** ([`per_scale/plan.rs`]https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/src/per_scale/plan.rs) — walks the schema once, validating shapes and producing a stride-typed `Plan` that captures every per-scale tensor's role.
2. **Pipeline** ([`per_scale/pipeline.rs`]https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/src/per_scale/pipeline.rs) — drives the per-frame decode using the `Plan`. Hot inner loops are NEON-vectorized on aarch64 (see `kernels/`).
3. **Helper** ([`per_scale/helper.rs`]https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/src/per_scale/helper.rs) — small math primitives (sigmoid, dequant) shared between the planner and the pipeline.

The per-scale path achieved 17–45× mask-materialize speedup on i.MX 8M Plus
in the v0.18 → v0.20 cycle through batched GEMM, NEON FP16 fused-multiply,
and tile-transpose layout changes. See
[`BENCHMARKS.md`](https://github.com/EdgeFirstAI/hal/blob/main/BENCHMARKS.md)
for the empirical numbers and
[`CHANGELOG.md`](https://github.com/EdgeFirstAI/hal/blob/main/CHANGELOG.md)
for the release-by-release history.

### Mask rendering APIs

YOLO segmentation models produce **proto masks** (shared basis at reduced
resolution, typically 160×160) and **mask coefficients** (per-detection
linear combination weights):

```
mask_raw[i] = coefficients[i] @ protos       # (proto_h, proto_w)
```

The decoder exposes three mask APIs that pair with image-side rendering:

| Workflow | Decoder API (public) | Image-side render | Use case |
|----------|----------------------|-------------------|----------|
| Materialized masks | [`Decoder::decode()`]https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/struct.Decoder.html#method.decode | `processor.draw_decoded_masks()` | When you need mask matrices on the CPU side |
| Proto data (preferred for GPU) | [`Decoder::decode_proto()`]https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/struct.Decoder.html#method.decode_proto | `processor.draw_proto_masks()` | Fused proto→pixel GPU path; never materializes full-res masks on CPU |
| Tracked materialized | [`Decoder::decode_tracked()`]https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/struct.Decoder.html#method.decode_tracked | `processor.draw_masks_tracked()` | Single-call decode + track + render |
| Tracked proto | [`Decoder::decode_proto_tracked()`]https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/struct.Decoder.html#method.decode_proto_tracked | `processor.draw_proto_masks()` (with track-augmented detections) | Tracked GPU-fused path |

`decode()` and `decode_proto()` dispatch internally to crate-private
`decode_quantized` / `decode_float` / `decode_quantized_proto` /
`decode_float_proto` based on the model's output dtype; external
callers never invoke those helpers directly.

The proto-data path is the recommended GPU path. It avoids the CPU cost of
materializing full-resolution masks; the GPU evaluates `sigmoid(coeffs @
upsampled_protos)` per output pixel using a fragment shader. See
[`../image/ARCHITECTURE.md`](https://github.com/EdgeFirstAI/hal/blob/main/crates/image/ARCHITECTURE.md)
for the GPU-side fused algorithm.

## Tracing Spans

Every public decode entry point emits a [`tracing::trace_span!`] tree. The
spans are recorded as Chrome JSON when [`edgefirst_hal::trace::start_tracing`](https://github.com/EdgeFirstAI/hal/blob/main/crates/hal/src/trace.rs)
is active and have near-zero overhead otherwise (a single relaxed atomic
load per call site).

### Naming convention

Span names follow `<crate>.<function>[.<operation>[.<sub-operation>]]`:

- **`<crate>.<function>`** — top-level span: the public function the user
  invoked (e.g. `decoder.decode`, `decoder.decode_proto`). For this crate
  those are the entry points on [`Decoder`]https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/struct.Decoder.html.
- **`<crate>.<function>.<operation>`** — meaningful internal work done as
  part of that function (e.g. `decoder.decode.yolo_quant_flat`,
  `decoder.decode.process_masks`).
- **`<crate>.<function>.<operation>.<sub-operation>`** — further
  decomposition where it aids optimisation (e.g.
  `decoder.decode.per_scale_to_masks.process_masks`).
- **`<crate>.<shared_op>[.<sub-operation>]`** — shared building blocks
  reached from *multiple* user-facing functions; the function prefix is
  omitted because a single source location can only carry one static name
  (e.g. `decoder.nms_get_boxes.suppress` and `decoder.per_scale_run.level`
  are both invoked from `decoder.decode` and `decoder.decode_proto`). The
  parent span in the recorded trace identifies the actual caller.

A span is worth adding when the work inside it is **meaningful for
optimisation and has enough complexity to justify the overhead** — roughly
500 µs on Cortex-A53 as a guideline.

### Span tree

```text
decoder.decode                                          [user-facing fn]
│ fields: path = "yolo_seg_det" | "yolo_split_seg_det" | "modelpack" |│         n_outputs
│
├── decoder.decode.yolo_quant_flat                      [flat-tensor kernels]
├── decoder.decode.yolo_float_flat
├── decoder.decode.yolo_quant_split
├── decoder.decode.yolo_float_split
│
├── decoder.decode.process_masks                        [coeff @ protos → per-detect mask, materialised path]
│   fields: n, mode = "float" | "quant"
│
└── decoder.decode.per_scale_to_masks                   [per-scale → materialised masks bridge]
    ├── decoder.decode.per_scale_to_masks.get_boxes     ← wraps decoder.nms_get_boxes
    │   └── decoder.nms_get_boxes
    └── decoder.decode.per_scale_to_masks.process_masks ← wraps decoder.decode.process_masks
        └── decoder.decode.process_masks

decoder.decode_proto                                    [user-facing fn]
│ fields: path, n_outputs
│
├── decoder.decode_proto.extract_proto_data             [build ProtoData; no mask materialisation]
│   fields: n, num_protos, layout, mode = "float" | "quant"
│
└── decoder.decode_proto.per_scale_to_proto_data        [per-scale → ProtoData bridge]
    ├── decoder.decode_proto.per_scale_to_proto_data.get_boxes
    │   └── decoder.nms_get_boxes
    └── decoder.decode_proto.per_scale_to_proto_data.extract
        └── decoder.decode_proto.extract_proto_data

# Shared building blocks (reached from both decode and decode_proto;
# parent in the trace tells you which one)

decoder.per_scale_run                                   [the per-scale NEON hot path]
│ fields: n_levels, encoding, nc, nm
├── decoder.per_scale_run.resolve_bindings
├── decoder.per_scale_run.level                         ← per FPN scale (80×80, 40×40, 20×20)
│   │ fields: li, stride, h, w, anchors, layout
│   ├── decoder.per_scale_run.level.boxes               ← DFL or LTRB box decode
│   │   field: encoding
│   ├── decoder.per_scale_run.level.scores              ← class-score dequant + sigmoid
│   │   field: activation
│   └── decoder.per_scale_run.level.mask_coefs          ← mask coefficient dequant (32-D)
├── decoder.per_scale_run.protos                        ← proto-mask dequant (160×160×32)
└── decoder.per_scale_run.widen_f32                     ← f16 → f32 widen or zero-copy borrow
    field: kind = "f32_borrow" | "f16_widen"

decoder.nms_get_boxes                                   [post-NMS candidate selection]
│ fields: n_candidates, n_after_topk, n_after_nms, n_detections
├── decoder.nms_get_boxes.score_filter                  ← max-class score threshold filter
├── decoder.nms_get_boxes.top_k                         ← partial sort, retain pre_nms_top_k
│   field: k
├── decoder.nms_get_boxes.suppress                      ← class-agnostic / class-aware IoU NMS
└── decoder.nms_get_boxes.dequant_boxes                 ← int8 → f32 on survivors only (quant path)
    field: n
```

### What each span measures (mapped to reference YOLO post-processing)

| Span                                            | Reference equivalent (Ultralytics)                      | What it does |
|-------------------------------------------------|---------------------------------------------------------|--------------|
| `decoder.decode`                                | `non_max_suppression(...)` + segmentation `process_mask`| Top-level decode entry. The `path` field tells you which model topology was selected at builder time. |
| `decoder.decode_proto`                          | `decode` + skip `process_mask` (returns coeffs/protos)  | Returns proto data instead of materialised masks; pair with `image.materialize_masks` or the fused GPU shader. |
| `decoder.decode.yolo_{quant,float}_{flat,split}`| `preds.transpose(-1, -2)` + `xywh2xyxy` + per-row filter| Flat-tensor and legacy split-tensor YOLO post-processing (pre-per-scale). Each kernel walks `(4 + nc, anchors)` once. |
| `decoder.decode.process_masks`                  | `process_mask` / `process_mask_native`                  | `sigmoid(coefficients @ protos)` + crop to detection bbox. The `mode` field is the input dtype (float kernel vs. fused i8/i16 path). |
| `decoder.decode_proto.extract_proto_data`       | n/a (HAL-specific)                                      | Pack per-detection coefficient rows + protos into a `ProtoData` for the GPU fused shader. The `n == 0` early-exit avoids a ~819 KB proto copy. |
| `decoder.decode.per_scale_to_masks`             | n/a (HAL-specific bridge)                               | Plumbing between the per-scale outputs and the materialised-mask post-processing. |
| `decoder.decode_proto.per_scale_to_proto_data`  | n/a (HAL-specific bridge)                               | Same bridge into the proto-extraction post-processing. |
| `decoder.per_scale_run`                         | Anchor-free `dist2bbox` over multi-scale FPN heads      | The NEON-vectorised hot path for Ara-2 DVM and certain TFLite exports where the head is pre-split per FPN scale. Shared across `decode` and `decode_proto`. |
| `decoder.per_scale_run.resolve_bindings`        | Plan-time tensor → role mapping                         | Walks the schema-derived `Plan` once to match input tensors to box / score / mc / protos roles. |
| `decoder.per_scale_run.level`                   | One FPN scale (e.g. 80×80, 40×40, 20×20)                | All per-anchor work for one scale. Use the `li` and `anchors` fields to attribute cost to the dominant scale (level 0 has ~16× more anchors than level 2). |
| `decoder.per_scale_run.level.boxes`             | `dist2bbox(dfl(x[:4*reg_max]))` or `ltrb2bbox`          | Box decode; DFL is the per-anchor bottleneck on Cortex-A53. |
| `decoder.per_scale_run.level.scores`            | `sigmoid(class_logits)` per anchor                      | ~32% of decode time on imx95-evk per the per-stage estimate. The `activation` field distinguishes `sigmoid` from `none`. |
| `decoder.per_scale_run.level.mask_coefs`        | Mask-coefficient dequant (no sigmoid)                   | 32-D coefficient stream per anchor. |
| `decoder.per_scale_run.protos`                  | `protos` head dequant (NHWC or NCHW → NHWC)             | One-time per-frame proto-mask dequant; the GPU shader consumes the resulting f32/f16 array as a texture. |
| `decoder.per_scale_run.widen_f32`               | n/a (HAL-specific)                                      | If the per-scale path produced f16 buffers, widen to f32 for the legacy NMS kernels. `kind = "f32_borrow"` means zero allocation. |
| `decoder.nms_get_boxes`                         | `non_max_suppression`                                   | Composite span over score_filter + top_k + suppress + dequant_boxes. The `n_candidates → n_after_topk → n_after_nms → n_detections` fields tell you where candidates were dropped. Shared across detection paths. |
| `decoder.nms_get_boxes.score_filter`            | `xc = candidates.amax(1) > conf`                        | Per-row max-class score threshold filter. |
| `decoder.nms_get_boxes.top_k`                   | `x[x[:, 4].argsort(descending=True)[:max_nms]]`         | Partial sort to `pre_nms_top_k` candidates (default 300; raise to anchor count for COCO mAP at `conf=0.001`). |
| `decoder.nms_get_boxes.suppress`                | `torchvision.ops.nms` or `batched_nms`                  | IoU-based suppression. Class-agnostic or class-aware per the decoder's `Nms` setting. |
| `decoder.nms_get_boxes.dequant_boxes`           | (quant path only)                                       | Int8 → f32 dequant applied only to survivors of NMS — avoids dequantising filtered candidates. |

[`tracing::trace_span!`]: https://docs.rs/tracing/latest/tracing/macro.trace_span.html

## Performance Considerations

- **Quantized integer math**`decode_quantized` and `decode_quantized_proto` operate directly on int8/uint8 buffers, avoiding the cost of producing an intermediate dequantized `f32` tensor.
- **Vectorized operations via ndarray** — bulk box and score arithmetic uses ndarray's iterator-fused operations.
- **Parallel processing with Rayon** — per-detection mask matmul is parallelized when the candidate pool is large enough to amortize the rayon overhead.
- **Early termination in NMS loops** — once the partial score-sort top-K is reached, NMS exits without examining lower-confidence anchors.
- **NEON FP16 hot paths on aarch64** — see [`per_scale/kernels/`]https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/src/per_scale/kernels/. Stable Rust lacks f16 intrinsics, so the kernels use inline `.arch_extension fp16` assembly. Tile-transpose plus 2^k injection (k ∈ [-14, 15]) keeps the GEMM in F16 throughout the inner loop.
- **Tracing spans** — emitted via `tracing::trace_span!` at every public decode entry point. Near-zero cost when no subscriber is active. See [`README.md#performance-tracing`]https://github.com/EdgeFirstAI/hal/blob/main/README.md#performance-tracing.

### `pre_nms_top_k` for deployment vs. mAP evaluation

The default `pre_nms_top_k = 300` is tuned for deployment workloads where
`score_threshold ≥ 0.25` already filters most candidates. For COCO-style mAP
evaluation at `score_threshold = 0.001`, **raise this cap** (typically to the
full anchor count, e.g. 8400 for 640×640 YOLO) or set it to `0` (no limit).
The default silently truncates ~74% of valid candidates at validation
thresholds, costing ~9 pp box mAP — a measurement artifact, not a model
quality issue. The decoder math is correct in both cases.

## Inter-Crate Interfaces

| Direction | Crate | Interface |
|-----------|-------|-----------|
| Depends on | [`edgefirst-tensor`]https://github.com/EdgeFirstAI/hal/blob/main/crates/tensor/ | `TensorDyn`, `Tensor<T>`, `TensorMap` for reading model output buffers |
| Optional dep | [`edgefirst-tracker`]https://github.com/EdgeFirstAI/hal/blob/main/crates/tracker/ (feature `tracker`) | `Tracker<DetectBox>` for `decode_tracked()` |
| Consumed by | [`edgefirst-image`]https://github.com/EdgeFirstAI/hal/blob/main/crates/image/ | `DetectBox`, `Segmentation`, proto data for `draw_*` rendering APIs |
| Consumed by | [`edgefirst-hal`]https://github.com/EdgeFirstAI/hal/blob/main/crates/hal/ | re-export as `edgefirst_hal::decoder` |
| Consumed by | [`edgefirst-hal-capi`]https://github.com/EdgeFirstAI/hal/blob/main/crates/capi/ | C-API bindings for `Decoder`, `DetectBox`, `Segmentation` |

## Cross-References

- Project architecture: [../../ARCHITECTURE.md]https://github.com/EdgeFirstAI/hal/blob/main/ARCHITECTURE.md
- Image-side mask rendering: [../image/ARCHITECTURE.md]https://github.com/EdgeFirstAI/hal/blob/main/crates/image/ARCHITECTURE.md
- Tracker integration: [../tracker/ARCHITECTURE.md]https://github.com/EdgeFirstAI/hal/blob/main/crates/tracker/ARCHITECTURE.md
- Performance tracing usage: [README.md#performance-tracing]https://github.com/EdgeFirstAI/hal/blob/main/README.md#performance-tracing
- Optimization guide (cross-crate user rules): [README.md#optimization-guide]https://github.com/EdgeFirstAI/hal/blob/main/README.md#optimization-guide