# edgefirst-image Architecture
## Overview
`edgefirst-image` provides hardware-accelerated image format conversion,
resizing, rotation, cropping, and segmentation-mask rendering for EdgeFirst
inference pipelines. The crate's central type is
[`ImageProcessor`](https://docs.rs/edgefirst-image/latest/edgefirst_image/struct.ImageProcessor.html),
an orchestrator that probes available hardware once at construction time and
then dispatches per-call to the most efficient backend in the chain
**OpenGL → G2D → CPU**. The processor owns the lifecycle of the GL thread,
the EGL/PBO caches, and the GPU shader programs that implement the visual
operations.
This crate carries the largest body of platform-specific code in the
EdgeFirst HAL. Most of the architectural surface area is concerned with
keeping zero-copy paths working across i.MX 8M Plus (Vivante), i.MX 95
(Mali/Panfrost), and desktop Mesa, while respecting the lifecycle and
shutdown quirks of each driver stack.
## Module Map
| [`lib.rs`](https://github.com/EdgeFirstAI/hal/blob/main/crates/image/src/lib.rs) | local | Public surface: `ImageProcessor`, `ImageProcessorTrait`, `Rotation`, `Flip`, `Crop`, `MaskOverlay`, `load_image` / `save_jpeg` / `save_png` |
| [`cpu/`](https://github.com/EdgeFirstAI/hal/tree/main/crates/image/src/cpu) | local | `CPUProcessor` — fast_image_resize + rayon, plus the f16 mask kernels |
| [`g2d.rs`](https://github.com/EdgeFirstAI/hal/blob/main/crates/image/src/g2d.rs) | local | `G2DProcessor` — NXP i.MX G2D 2D-engine bindings |
| [`gl/`](https://github.com/EdgeFirstAI/hal/tree/main/crates/image/src/gl) | local | OpenGL backend: threaded wrapper, context, EGL+PBO caches, shaders, DMA-BUF import |
| [`error.rs`](https://github.com/EdgeFirstAI/hal/blob/main/crates/image/src/error.rs) | local | `Error` (with `From<std::io::Error>` for ergonomic `?` propagation in user code) |
## Key Types and Traits
- [`ImageProcessor`](https://docs.rs/edgefirst-image/latest/edgefirst_image/struct.ImageProcessor.html) — the orchestrator. Owns CPU + G2D + GL backends and dispatches per call.
- [`ImageProcessorTrait`](https://docs.rs/edgefirst-image/latest/edgefirst_image/trait.ImageProcessorTrait.html) — the convert/draw API common to every backend.
- [`Rotation`](https://docs.rs/edgefirst-image/latest/edgefirst_image/enum.Rotation.html), [`Flip`](https://docs.rs/edgefirst-image/latest/edgefirst_image/enum.Flip.html), [`Crop`](https://docs.rs/edgefirst-image/latest/edgefirst_image/struct.Crop.html) — geometric parameters; `Crop::letterbox()` preserves aspect ratio.
- [`MaskOverlay`](https://docs.rs/edgefirst-image/latest/edgefirst_image/struct.MaskOverlay.html) — composite control for mask-rendering APIs (`background`, `opacity`).
- [`load_image`](https://docs.rs/edgefirst-image/latest/edgefirst_image/fn.load_image.html) / [`save_jpeg`](https://docs.rs/edgefirst-image/latest/edgefirst_image/fn.save_jpeg.html) / [`save_png`](https://docs.rs/edgefirst-image/latest/edgefirst_image/fn.save_png.html) — JPEG/PNG decode/encode with EXIF orientation handling.
## Internal Architecture
### Backend dispatch
```mermaid
classDiagram
class ImageProcessorTrait {
<<trait>>
+convert(src, dst, rotation, flip, crop)
+draw_decoded_masks(dst, detections, segmentations)
+draw_proto_masks(dst, detections, proto_data)
+set_class_colors(colors)
}
class ImageProcessor {
cpu: Option~CPUProcessor~
g2d: Option~G2DProcessor~
opengl: Option~GLProcessorThreaded~
+new() orchestrator with fallback chain
+create_image(w, h, PixelFormat, DType, mem) GPU-optimal alloc
}
class G2DProcessor { NXP i.MX G2D hardware }
class GLProcessorThreaded { Owns the GL thread + message channel }
class GLProcessorST { Single-threaded GL impl, owns EGL + GL state }
class CPUProcessor { fast_image_resize + rayon }
ImageProcessorTrait <|.. ImageProcessor
ImageProcessorTrait <|.. G2DProcessor
ImageProcessorTrait <|.. GLProcessorThreaded
ImageProcessorTrait <|.. GLProcessorST
ImageProcessorTrait <|.. CPUProcessor
ImageProcessor o-- G2DProcessor
ImageProcessor o-- GLProcessorThreaded
ImageProcessor o-- CPUProcessor
GLProcessorThreaded *-- GLProcessorST : owns via thread
```
`ImageProcessor` dispatch priority is **OpenGL (GPU) → G2D (where supported)
→ CPU (always available)**. Environment variables `EDGEFIRST_DISABLE_GL`,
`EDGEFIRST_DISABLE_G2D`, `EDGEFIRST_DISABLE_CPU` and `EDGEFIRST_FORCE_BACKEND`
override this chain at runtime; see [`README.md` Environment
Variables](https://github.com/EdgeFirstAI/hal/blob/main/crates/image/README.md#environment-variables).
### TensorDyn as the image type
The image-side type system reuses [`edgefirst_tensor::TensorDyn`](https://docs.rs/edgefirst-tensor/latest/edgefirst_tensor/struct.TensorDyn.html)
as the dtype-erased image carrier. `TensorDyn` wraps a `Tensor<T>` and a
`PixelFormat`; the format describes the spatial layout, the `DType` describes
element storage. Width / height / channels are **not stored** — they
are computed from shape + format on every access. Row stride is **optional
metadata** (`Tensor::row_stride()` / `set_row_stride()` /
`effective_row_stride()`); it is left unset for tightly packed buffers and
is required for padded DMA-BUF imports where the producer's stride differs
from `width * bytes_per_pixel`.
| `Rgb`, `Rgba`, `Bgra`, `Grey`, `Yuyv`, `Vyuy` | `[H, W, C]` | Interleaved (channels-last) |
| `PlanarRgb`, `PlanarRgba` | `[C, H, W]` | Channels-first |
| `Nv12` | `[H*3/2, W]` | 2D — Y plane (H rows) + UV (H/2 rows) |
| `Nv16` | `[H*2, W]` | 2D — Y plane (H rows) + UV (H rows) |
For multi-plane DMA-BUF NV12/NV16 (V4L2 `NV12M` from VPU/NeoISP), Y and UV
live in separate allocations. `Tensor::from_planes(luma, chroma,
PixelFormat::Nv12)` keeps each plane's fd independent for zero-copy GPU
import via per-plane EGL attributes (`DMA_BUF_PLANE0_FD` / `DMA_BUF_PLANE1_FD`).
### `create_image()` and zero-copy memory selection
`ImageProcessor::create_image()` is the preferred way to allocate destination
tensors for `convert()`. It selects the optimal backend based on a probe done
at `ImageProcessor::new()` time:
```mermaid
flowchart TD
Create["create_image(w, h, PixelFormat, DType, mem)"]
DMA{DMA-buf roundtrip<br/>verified at init?}
PBO{OpenGL PBO<br/>available?}
Mem[MemTensor<br/>heap fallback]
Create --> DMA
DMA -->|Yes| UseDMA["DmaTensor<br/>Zero-copy EGLImage import"]
DMA -->|No| PBO
PBO -->|Yes| UsePBO["PboTensor<br/>Zero-copy GL buffer binding"]
PBO -->|No| Mem
style UseDMA fill:#90ee90
style UsePBO fill:#87ceeb
style Mem fill:#ffeb9c
```
| DMA-buf | GPU supports `EGL_EXT_image_dma_buf_import` | Zero-copy: GPU reads/writes the DMA buffer directly | NXP i.MX 8M Plus (Vivante), i.MX 95 (Mali/Panfrost) |
| PBO | GLES 3.0 available, DMA-buf roundtrip fails | Zero-copy GL: `GL_PIXEL_UNPACK_BUFFER` / `GL_PIXEL_PACK_BUFFER` | NVIDIA desktop, hosts without DMA-heap permissions |
| Mem | No GPU or GL unavailable | CPU `memcpy` via `glTexImage2D` / `glReadnPixels` | Universal fallback |
**Why PBO matters:** on desktop Linux with NVIDIA GPUs, DMA-buf allocation
succeeds (`/dev/dma_heap/system`) but the NVIDIA EGL driver cannot import
those buffers — the `verify_dma_buf_roundtrip()` check catches this at init.
Without PBO, every `convert()` would fall back to CPU `memcpy` for upload and
readback. PBO keeps the data in GPU-accessible memory, enabling the same
zero-copy shader pipeline used on DMA platforms.
### GL transfer backend selection
| `DmaBuf` | `verify_dma_buf_roundtrip()` passes | `EGL_EXT_image_dma_buf_import` (zero-copy) | EGLImage export (zero-copy) |
| `Pbo` | GLES 3.0 available, DMA-buf fails | `GL_PIXEL_UNPACK_BUFFER` | `GL_PIXEL_PACK_BUFFER` |
| `Sync` | Final fallback | `glTexImage2D` (host pointer) | `glReadnPixels` (host pointer) |
### GL thread architecture
`GLProcessorThreaded` is the public, thread-safe wrapper. It spawns a
dedicated OS thread that owns the EGL context and all GL state
(`GLProcessorST`). All operations are sent as `GLProcessorMessage` enum
variants through a channel and block on a oneshot reply. This design is
required because EGL contexts are thread-local — every GL call must happen
on the thread that created the context.
The [`PboOps`](https://docs.rs/edgefirst-tensor/latest/edgefirst_tensor/trait.PboOps.html)
trait bridges the tensor crate and the GL thread. `PboTensor` (defined in
the tensor crate) holds an `Arc<dyn PboOps>`; the image crate's
`GlPboOps` implementation of that trait is what owns the `WeakSender` to
the GL-thread channel. When the tensor needs to map / unmap / delete the
PBO, it calls into the `PboOps` impl, which sends a message through the
channel. The weak sender ensures PBO tensors don't prevent GL thread
shutdown — see
[`crates/tensor/ARCHITECTURE.md#pbo-tensors-and-the-weaksender-pattern`](https://github.com/EdgeFirstAI/hal/blob/main/crates/tensor/ARCHITECTURE.md#pbo-tensors-and-the-weaksender-pattern).
### GLES 3.1 context and the optional compute path
At context creation time, the GL thread attempts a GLES 3.1 context first;
on failure it falls back to GLES 3.0 (no compute shaders).
When GLES 3.1 is available, an opt-in compute shader path can perform the
HWC→CHW proto-tensor repack on the GPU. Enable it with
`EDGEFIRST_PROTO_COMPUTE=1`. If compilation fails at runtime, the
implementation logs a warning and falls back to CPU repack transparently —
no API changes.
### PBO convert dispatch
When `convert()` is called with PBO-backed images, the GL thread must not
call `tensor.map()` on those images — that would send a message back to
itself and deadlock. The convert path dispatches to specialized methods:
| DMA → DMA | `convert_dest_dma` (EGLImage on both sides) |
| PBO → PBO | `convert_pbo_to_pbo` (UNPACK + PACK) |
| Mem/DMA → PBO | `convert_any_to_pbo` (texture + PACK) |
| PBO → Mem | `convert_pbo_to_mem` (UNPACK + ReadnPixels) |
| Mem/DMA → Mem/DMA | `convert_dest_non_dma` (texture + memcpy) |
**PBO destination renderbuffer setup:** When the destination is a PBO tensor
and the shader needs a pre-initialized renderbuffer (e.g., for letterbox
padding), the code must use `setup_renderbuffer_from_pbo()` — **never**
`setup_renderbuffer_non_dma()`. The latter calls `dst.map()`, which sends a
`PboMap` message to the GL thread via the channel. Since we are already
executing on the GL thread (processing an `ImageConvert` message), this
deadlocks. `setup_renderbuffer_from_pbo()` avoids this by binding the PBO as
`GL_PIXEL_UNPACK_BUFFER` and loading data with `glTexImage2D(NULL)`, which
reads directly from the PBO without any channel round-trip.
`convert_pbo_to_mem()` is not affected because its destination is a Mem
tensor, so `dst.map()` is a direct memory operation with no GL thread
message.
### EGL image cache
The OpenGL backend maintains two independent LRU caches of EGLImages —
`src_egl_cache` for source tensors and `dst_egl_cache` for destination
tensors. Each entry is keyed by `(BufferIdentity.id, chroma_id)`.
`BufferIdentity` is defined in
[`crates/tensor/src/lib.rs`](https://github.com/EdgeFirstAI/hal/blob/main/crates/tensor/src/lib.rs)
and pairs a globally unique monotonic ID with an `Arc<()>` liveness guard:
```text
BufferIdentity {
id: u64, // unique per allocation; new value from each from_fd() call
guard: Arc<()>, // cache entries hold Weak<()>; when tensor drops, weak dies
}
```
When a tensor is freed, its `Arc<()>` guard drops. The cache holds only a
`Weak<()>` reference, so `sweep()` detects dead entries without an explicit
removal call.
| Same tensor object reused across frames | Hit on every frame | Fast — no EGLImage re-import |
| New tensor wrapping the same fd each frame | Miss on every frame | Slow — re-imports each call |
| `dst` of call N reused as `src` of call N+1 | Hit in both caches | Two separate entries, no collision |
Key implications:
- `hal_tensor_from_fd()` and `hal_import_image()` always allocate a new
`BufferIdentity`. Callers that re-wrap the same fd each frame will see a
cache miss every `convert()`. **Hold the tensor alive across frames.**
- Content written into a DMA-BUF between calls (e.g. by a V4L2 decoder) is
visible on the next call. EGLImage is a handle to live physical memory,
not a snapshot.
- The `last_bound_src_egl` optimization skips
`glEGLImageTargetTexture2DOES` when the source EGLImage has not changed
between consecutive calls. Safe — the texture already points at the
correct DMA-BUF memory.
- A `glFinish()` is issued at the end of every `convert()`. This guarantees
GPU reads of source and writes to destination are complete before return,
making it safe to chain calls.
See [Appendix C: DMA-BUF Identity and Tensor Caching](https://github.com/EdgeFirstAI/hal/blob/main/ARCHITECTURE.md#appendix-c-dma-buf-identity-and-tensor-caching)
in the project ARCHITECTURE.md for the cross-crate cache story (V4L2
fd recycling, inode-keyed cache, GStreamer adaptor integration).
### Vivante NV12 → PlanarRgb two-pass workaround
A single-pass NV12 → PlanarRgb shader causes a GPU hang on the Vivante
GC7000UL (NXP i.MX 8M Plus). The workaround splits the conversion:
```text
Pass 1: NV12 → RGBA (intermediate)
All geometry: resize, crop, rotation, flip, letterbox
Pass 2: RGBA → PlanarRgb (at destination resolution)
Deinterleaves RGBA to three planes via sampler2D variants
```
Pass 1 reuses the existing `packed_rgb_intermediate_tex` texture — no new
GPU resources allocated. Pass 2 uses the same shader infrastructure as
direct RGBA → PlanarRgb. The two-pass path is selected automatically when
`is_vivante && src_fmt == Nv12 && dst_fmt.layout() == Planar`. No API
changes required from callers.
## Mask Rendering
YOLO segmentation models produce **proto masks** (shared basis at reduced
resolution, typically 160×160) and per-detection **mask coefficients**:
```text
mask_raw[i] = coefficients[i] @ protos # (proto_h, proto_w)
```
The image crate exposes three rendering pipelines paired with the decoder's
mask APIs:
| Materialized | [`Decoder::decode()`](https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/struct.Decoder.html#method.decode) | `draw_decoded_masks` | Already have mask matrices |
| Fused proto path | [`Decoder::decode_proto()`](https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/struct.Decoder.html#method.decode_proto) | `draw_proto_masks` | Real-time GPU overlay (preferred) |
| Tracked materialized | [`Decoder::decode_tracked()`](https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/struct.Decoder.html#method.decode_tracked) (via [`edgefirst-tracker`](https://github.com/EdgeFirstAI/hal/blob/main/crates/tracker/)) | `draw_masks_tracked` | Single-call decode + track + render |
| Tracked proto | [`Decoder::decode_proto_tracked()`](https://docs.rs/edgefirst-decoder/latest/edgefirst_decoder/struct.Decoder.html#method.decode_proto_tracked) | `draw_proto_masks` (with track-augmented detections) | Tracked GPU-fused path |
### MaskOverlay
```rust,ignore
pub struct MaskOverlay<'a> {
pub background: Option<&'a TensorDyn>, // blit before drawing masks
pub opacity: f32, // 0.0 invisible, 1.0 opaque
pub letterbox: Option<[f32; 4]>, // [xmin, ymin, xmax, ymax] in
// model-input normalized space;
// maps decoder output back to
// original image coords when set
pub color_mode: ColorMode, // Class | Instance | Track
// (Track currently behaves
// like Instance — see below)
}
```
### Fused proto→pixel algorithm (`draw_proto_masks`)
Instead of computing the matmul at proto resolution and upsampling the
result, the fused path upsamples the proto field itself and evaluates the
dot product at every output pixel:
```text
For each output pixel (x, y) inside detection bbox at 640×640:
bilinear_sample(protos, proto_coords(x, y)) → 32 interpolated values
dot(coefficients, interpolated_protos) → raw logit
sigmoid(raw) → mask value [0, 1]
threshold at 0.5 → blend color onto pixel
```
Algebraically equivalent to bilinear-after-matmul (both bilinear
interpolation and the dot product are linear), but avoids materializing the
intermediate tensors. Key design choices:
- **No proto-resolution crop** — the full 160×160 proto field is sampled,
avoiding the boundary erosion artifact of crop-before-upsample approaches.
- **Sigmoid after interpolation** — sigmoid is nonlinear, so applying it
after spatial operations preserves dynamic range through interpolation.
- The draw path uses the sigmoid value directly for alpha-blend weighting.
This is mathematically equivalent to Ultralytics' `retina_masks=True`
(`process_mask_native`) for binary mask output. Empirical validation across
26 matched detections on COCO val2017 confirms **0.993 mean mask IoU**
between the two methods.
### GPU implementation (OpenGL)
Draw path (`draw_proto_masks`) — sigmoid shaders with alpha blending:
The fragment shader computes `sigmoid(logit)` and blends the detection color
onto the framebuffer using `GL_SRC_ALPHA / GL_ONE_MINUS_SRC_ALPHA`. The GPU
renders one quad per detection; the fragment shader evaluates the mask at
every output pixel.
#### Shader variants
| int8-nearest | `R8I` quantized | Nearest neighbor |
| int8-bilinear | `R8I` quantized | Manual 4-tap bilinear |
| f32 | `R32F` float | Hardware `GL_LINEAR` |
| f16 | `R16F` half | Hardware `GL_LINEAR` |
Control quantized proto interpolation via
`processor.set_int8_interpolation_mode(...)`.
> **G2D limitation:** the NXP G2D hardware accelerator does not support
> mask rendering. On platforms where G2D is the primary backend (e.g.
> i.MX 8M Plus without EGL), all `draw_*` methods return
> `NotImplemented`. Use an OpenGL-capable processor (pass an
> `egl_display`) or fall back to CPU rendering.
#### Vivante shader performance cliff and the eager-materialize workaround
The fused `draw_proto_masks` path is the preferred GPU pipeline on
Mali / Panfrost (i.MX 95) and on desktop GPUs, where the per-pixel
sigmoid + dot product runs comfortably under real-time even at large
detection counts. **On Vivante GC7000UL (i.MX 8M Plus)** the same
shader can fall off a cliff for models with many detections or large
proto fields, reaching multi-second per-frame latency. The driver's
fragment-shader scheduling on this part does not amortise the
per-quad-per-pixel work the way mainline GPUs do.
The safe pattern on Vivante (and as a defensive choice for any
deployment that may run on it) is to **materialise masks once on the
CPU** via `materialize_masks` immediately after decode, free the proto
data, and then render with the cheap `draw_decoded_masks` blit:
```rust
// On the decode thread:
let masks = processor.materialize_masks(
&boxes, &scores, &classes, &proto_data,
letterbox_norm,
MaskResolution::Scaled { width: dst_w, height: dst_h },
)?;
drop(proto_data); // free the proto tensor immediately
// On the render thread (or the same thread, just later):
processor.draw_decoded_masks(&mut frame, &boxes, &masks, overlay)?;
```
`materialize_masks` runs the same batched-GEMM kernel exercised by
`mask_benchmark` (see [`TESTING.md`](https://github.com/EdgeFirstAI/hal/blob/main/crates/image/TESTING.md#benchmarks)),
which is well-amortised across detections; `draw_decoded_masks`
becomes a cheap RGBA blit. The combined CPU + GPU cost on i.MX 8M
Plus is comfortably real-time at typical COCO detection counts. On
Mali / desktop GPUs, switching back to the fused `draw_proto_masks`
is straightforward — the data flow is the same up to the choice of
render call.
## Process-Shutdown Resource Cleanup
The OpenGL headless renderer
([`crates/image/src/gl/`](https://github.com/EdgeFirstAI/hal/tree/main/crates/image/src/gl))
loads EGL and OpenGL ES via dynamically loaded shared libraries
(`libEGL.so.1`, `libGLESv2.so`). When the process exits — particularly from
a Python interpreter running PyO3 extensions — EGL resource cleanup can
crash with heap corruption, segfaults, or panics. This is a well-documented
industry-wide problem with no clean solution.
The crash arises from a fundamental conflict between four systems:
1. **Python finalization order is non-deterministic.** During
`Py_FinalizeEx()`, Python destroys modules and objects in arbitrary
order. A PyO3 `#[pyclass]` wrapping `GlContext` may have its Rust `Drop`
invoked after dependent state is gone.
2. **Linux `atexit` handler ordering is unreliable.** glibc's
`__cxa_finalize` interacts with `dlclose` in non-deterministic ways
between handlers registered by different shared libraries
([glibc #21032](https://sourceware.org/bugzilla/show_bug.cgi?id=21032)).
3. **Mesa's `_eglAtExit` use-after-free.** Mesa's atexit handler frees
per-thread EGL state (`_EGLThreadInfo`). If `Drop` calls
`eglReleaseThread()` after this handler runs, it dereferences freed
memory ([Ubuntu Bug #1946621](https://bugs.launchpad.net/ubuntu/+source/mesa/+bug/1946621)).
4. **Vendor EGL driver bugs.** Some vendor drivers (Qualcomm Adreno, older
NVIDIA) misbehave during cleanup. The `catch_unwind` guard absorbs
driver-side panics.
### Defense-in-depth solution
```text
┌─────────────────────────────────────────────────────────┐
│ Layer 1: Box::leak — EGL library handle │
│ Prevents dlclose from unmapping shared library code │
├─────────────────────────────────────────────────────────┤
│ Layer 2: ManuallyDrop<Rc<Egl>> — EGL instance │
│ Prevents khronos-egl Drop from calling │
│ eglReleaseThread() into freed Mesa state │
├─────────────────────────────────────────────────────────┤
│ Layer 3: catch_unwind — EGL cleanup calls │
│ Catches panics from eglDestroyContext/eglMakeCurrent │
│ if function pointers are invalidated │
├─────────────────────────────────────────────────────────┤
│ Layer 4: Skip eglTerminate entirely │
│ Display lives in a process-global OnceLock and is │
│ intentionally leaked; eglTerminate is never called │
└─────────────────────────────────────────────────────────┘
```
`GlContext::drop` calls `eglMakeCurrent(EGL_NO_CONTEXT)` and
`eglDestroyContext` inside `catch_unwind`, then intentionally skips
dropping the `Rc<Egl>` wrapper. The EGL library handle is leaked via
`Box::leak` at load time so it is never `dlclose`'d. The EGL display
itself lives in the `SHARED_DISPLAY` `OnceLock` (`SharedEglDisplay`) and
is never terminated — the EGL spec ref-counts `eglTerminate` against
`eglInitialize`, but calling it from `GlContext::drop` would tear the
display down for every other live `GlContext` in the process, so the
HAL leaks it on purpose.
### Industry precedent
- **Chromium/ANGLE** skips full EGL teardown on GPU process exit, treating
cleanup as more dangerous than no cleanup during shutdown.
- **wgpu** wraps `glow::Context` in `ManuallyDrop` and loads EGL with
`RTLD_NODELETE` to prevent library unloading. The `khronos-egl` crate
itself adopted `RTLD_NOW | RTLD_NODELETE` after
[khronos-egl #14](https://github.com/timothee-haudebourg/khronos-egl/issues/14),
citing the same glibc root cause.
- **Smithay** supports skipping `eglTerminate` via a feature flag for the
same class of driver bugs.
### Limitations
`catch_unwind` catches Rust panics but cannot catch fatal signals
(`SIGABRT`, `SIGSEGV`, `SIGBUS`) from heap corruption inside the C driver.
The defense layers prevent the most common failure modes, but a
sufficiently broken driver could still crash the process — none has been
observed on the supported platforms (Vivante, Mali/Panfrost, Mesa x86_64).
### G2D resource cleanup
On NXP i.MX, `libg2d.so.2` and `libEGL.so.1` share kernel state through the
Vivante `galcore` device (`/dev/galcore`). When both libraries are loaded,
calling `dlclose` on either one can trigger heap corruption (`corrupted
double-linked list`) during process exit — the atexit handlers from the
shared `galcore` driver become inconsistent.
For production code, `G2DProcessor` is dropped normally; the EGL
`Box::leak` (Layer 1) keeps shared `galcore` state intact. For benchmark
code (where many G2D processors are created/destroyed in one process), the
`crates/bench` harness wraps G2D processor instances in `ManuallyDrop` to
avoid repeated `g2d_close` + `dlclose` cycles that exhaust driver
resources.
### Resource cleanup policy
- **EGL displays** — never terminated. The process-global
`SharedEglDisplay` is created once via `OnceLock` and leaked on
process exit. `eglTerminate` is the *EGL way* to release a display
but in practice causes driver crashes (see the defense-in-depth
section above), so the HAL never calls it.
- **EGL contexts** — `eglDestroyContext` in `GlContext::drop`, inside
`catch_unwind`. No EGL surfaces are created — the HAL uses
surfaceless contexts (`EGL_KHR_surfaceless_context` +
`EGL_KHR_no_config_context`) and renders exclusively through FBOs
backed by EGLImages.
- **DMA buffers** — fd `close()` in `Drop`.
- **G2D contexts** — `g2d_close` in `G2DProcessor::drop`.
Intentional leaks: the EGL library handle (`Box::leak`), the `Rc<Egl>`
wrapper (`ManuallyDrop`), and the shared EGL display
(`SHARED_DISPLAY: OnceLock`). All three are process-lifetime objects;
the OS reclaims them at exit. GPU contexts, DMA buffers, and G2D
contexts are released eagerly by their `Drop` impls.
## GL Command Serialization (GL_MUTEX)
Multiple `ImageProcessor` instances can coexist in the same process. EGL
and OpenGL ES specify that independent contexts on separate threads should
not interfere, but several embedded GPU drivers violate this:
- **Vivante `galcore` (i.MX 8M Plus)** — concurrent `eglInitialize`,
`eglCreateContext`, DMA-BUF import ioctls, and `eglTerminate` from
multiple threads corrupt driver-internal state. Causes SIGSEGV (null
pointer at offset `0x18` in `galcore` ioctl) and futex deadlocks.
- **Broadcom V3D 7.1.10.2 (Raspberry Pi 5)** — concurrent `eglTerminate`
breaks ref-counting, causing `EGL(NotInitialized)` on surviving
contexts; subsequent GL operations fail with `GL_INVALID_OPERATION`.
- **ARM Mali-G310 (i.MX 95)** — Panfrost handles concurrent EGL/GL
correctly. No issues observed.
### Solution
A global `GL_MUTEX` (`std::sync::Mutex<()>` in
[`crates/image/src/gl/context.rs`](https://github.com/EdgeFirstAI/hal/blob/main/crates/image/src/gl/context.rs))
serializes **all** EGL and GL operations across every `GLProcessorST`
instance. Acquired in three places:
1. **Initialization** — wraps `GLProcessorST::new()` (display creation,
context setup, shader compilation, DMA-BUF roundtrip verification).
2. **Message dispatch** — wraps every incoming GL-thread message
(convert, draw masks, PBO create/download, etc.) so only one GL thread
executes driver calls at a time.
3. **Teardown** — wraps `GLProcessorST::drop()` → `GlContext::drop()`
so `eglDestroyContext` (and `eglMakeCurrent(EGL_NO_CONTEXT)`) are
serialized. The shared display is never terminated, so teardown
does not race against display state.
The mutex uses `unwrap_or_else(|e| e.into_inner())` to recover from
poisoning: if a prior GL operation panicked, subsequent operations on
other instances can still proceed rather than propagating a poison error.
### Performance implications
All GL operations are serialized — no concurrent GPU execution across
`ImageProcessor` instances. Acceptable because:
- The primary use case (edge AI inference pipelines) typically uses a
single processor per pipeline. Multiple instances exist mainly in test
scenarios.
- GPU operations are I/O-bound on embedded targets; mutex overhead
(microseconds) is negligible compared to DMA transfers and shader
execution (milliseconds).
- The alternative (concurrent GPU access) crashes on Vivante.
Future work could relax to init/teardown-only serialization on drivers
known to be safe for concurrent runtime ops (e.g. Mali), but the current
approach prioritizes correctness across all targets.
## Performance Considerations
| Reuse tensors across frames | Each new tensor allocates a fresh `BufferIdentity`. EGL image cache is keyed by `(BufferIdentity.id, chroma_id)`. New ID → cache miss → full `eglCreateImageKHR` import (~100–300 µs). Hold tensors alive. |
| Allocate via `create_image()` | The processor selects DMA-buf, PBO, or heap based on the runtime GPU probe at `new()` time. Bypassing with `Tensor::new(memory=...)` forces a slow transfer path on every `convert()`. |
| One `ImageProcessor` per pipeline | Each instance owns its OpenGL context, GL thread, and per-thread caches (the EGL display is process-global and shared). Multiple instances still serialize on `GL_MUTEX`, so concurrent use across instances buys nothing. |
| Native CPU feature builds (Rule 6) | A build-time concern. `RUSTFLAGS` controls whether the f16 mask kernel at [`crates/image/src/cpu/masks.rs`](https://github.com/EdgeFirstAI/hal/blob/main/crates/image/src/cpu/masks.rs) compiles to native widening instructions or to the soft-float `__extendhfsf2` helper. Distributed binaries stay on triple baseline ISA; benchmark hosts opt in via `RUSTFLAGS` overrides. |
See the [Optimization Guide](https://github.com/EdgeFirstAI/hal/blob/main/README.md#optimization-guide)
in the project README for the user-facing rules and validation patterns.
## Inter-Crate Interfaces
| Depends on | [`edgefirst-tensor`](https://github.com/EdgeFirstAI/hal/blob/main/crates/tensor/) | `TensorDyn`, `Tensor<T>`, `BufferIdentity`, `PboOps` impl |
| Depends on (unconditional) | [`edgefirst-decoder`](https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/) | `DetectBox`, `Segmentation`, proto data for `draw_*` |
| Depends on (feature `tracker`) | [`edgefirst-tracker`](https://github.com/EdgeFirstAI/hal/blob/main/crates/tracker/) | `Tracker<DetectBox>` for `draw_masks_tracked` |
| Consumed by | [`edgefirst-hal`](https://github.com/EdgeFirstAI/hal/blob/main/crates/hal/) | re-export as `edgefirst_hal::image` |
| Consumed by | [`edgefirst-hal-capi`](https://github.com/EdgeFirstAI/hal/blob/main/crates/capi/) | C bindings for `ImageProcessor` and rendering APIs (does **not** bridge to Python) |
| Consumed by | [`crates/python`](https://github.com/EdgeFirstAI/hal/blob/main/crates/python/) | PyO3 binding over the Rust umbrella crate (does not go through the C API) |
## Platform-Specific Notes
| Linux NXP i.MX 8M Plus (Vivante GC7000UL) | OpenGL, G2D, CPU | NV12 → PlanarRgb requires the two-pass workaround |
| Linux NXP i.MX 95 (Mali-G310 / Panfrost) | OpenGL, CPU | Concurrent GL works; `EDGEFIRST_OPENGL_RENDERSURFACE=1` required for Neutron NPU DMA-BUF destinations |
| Linux desktop / NVIDIA | OpenGL (PBO path), CPU | DMA-buf import unsupported; PBO path provides zero-copy |
| Linux desktop / Mesa x86_64 | OpenGL, CPU | DMA-heap permission required for DMA path |
| macOS | CPU | No GPU/G2D |
| Other Unix | CPU | No GPU/G2D |
## Cross-References
- Project architecture: [../../ARCHITECTURE.md](https://github.com/EdgeFirstAI/hal/blob/main/ARCHITECTURE.md)
- Tensor architecture (DMA-BUF, BufferIdentity, PboOps): [../tensor/ARCHITECTURE.md](https://github.com/EdgeFirstAI/hal/blob/main/crates/tensor/ARCHITECTURE.md)
- Decoder architecture (proto mask APIs): [../decoder/ARCHITECTURE.md](https://github.com/EdgeFirstAI/hal/blob/main/crates/decoder/ARCHITECTURE.md)
- DMA-BUF identity story: [ARCHITECTURE.md#appendix-c-dma-buf-identity-and-tensor-caching](https://github.com/EdgeFirstAI/hal/blob/main/ARCHITECTURE.md#appendix-c-dma-buf-identity-and-tensor-caching)
- Optimization guide: [README.md#optimization-guide](https://github.com/EdgeFirstAI/hal/blob/main/README.md#optimization-guide)
- Performance tracing usage: [README.md#performance-tracing](https://github.com/EdgeFirstAI/hal/blob/main/README.md#performance-tracing)