eorst 1.0.1 - Docs.rs

# Filters Benchmark

Compares the zero-copy OpenCV path (`Mat::new_rows_cols_with_data` + `Mat::new_rows_cols_with_data_mut`) against the old copy-based path (`arrayview2_to_mat` + `mat_to_array2`) for each filter operation.

## How to Run

```bash
cargo bench -p eorst --features use_opencv --bench filters_benchmark
```

Each benchmark group runs zero-copy and copy variants across three raster sizes.
Total runtime is approximately 3-5 minutes.

## What's Being Compared

**Zero-copy path** (current `Filters` trait implementation):
- Input: `Mat::new_rows_cols_with_data` borrows `Array2` data (0 copies)
- Output: `Mat::new_rows_cols_with_data_mut` writes directly into pre-allocated `Array2` (0 copies)
- Total: 1 allocation (output `Array2::zeros`), 0 memcpys

**Copy path** (deprecated `arrayview2_to_mat` + `mat_to_array2`):
- Input: allocates new `Mat`, copies all data from `Array2` into it
- Output: OpenCV allocates result `Mat`, then copies all data back into a new `Array2`
- Total: 3 allocations, 2 memcpys (input + output)

## Results

Benchmarked on the CI machine (Linux, optimized release build). Times are median values from 100 samples.

### Erode (u8, 3×3 ellipse kernel)

| Size | Zero-Copy | Copy | Speedup |
|------|-----------|------|---------|
| 1024×1024 | 61.5 µs | 106.4 µs | **1.73×** |
| 2048×2048 | 300.4 µs | 3.879 ms | **12.91×** |
| 4096×4096 | 1.519 ms | 20.042 ms | **13.20×** |

### Dilate (u8, 3×3 ellipse kernel)

| Size | Zero-Copy | Copy | Speedup |
|------|-----------|------|---------|
| 1024×1024 | 60.8 µs | 105.2 µs | **1.73×** |
| 2048×2048 | 275.8 µs | 548.5 µs | **1.99×** |
| 4096×4096 | 1.452 ms | 20.146 ms | **13.88×** |

### Median Blur (u8, kernel 3)

| Size | Zero-Copy | Copy | Speedup |
|------|-----------|------|---------|
| 1024×1024 | 120.0 µs | 156.1 µs | **1.30×** |
| 2048×2048 | 496.8 µs | 761.9 µs | **1.53×** |
| 4096×4096 | 2.226 ms | 20.012 ms | **8.99×** |

### Gaussian Blur (f32, 5×5 kernel, σ=1.0)

| Size | Zero-Copy | Copy | Speedup |
|------|-----------|------|---------|
| 1024×1024 | 359.2 µs | 641.5 µs | **1.79×** |
| 2048×2048 | 1.950 ms | 20.802 ms | **10.67×** |
| 4096×4096 | 22.694 ms | 68.777 ms | **3.03×** |

## Analysis

### Why the speedup grows with size

At 1024×1024 the copy overhead is small relative to the OpenCV computation itself — you get a modest 1.3×–1.8× gain from eliminating two `memcpy` calls.

At 2048×2048 and 4096×4096 the effect compounds:

1. **More data to copy**: 2048² u8 = 4 MB per memcpy (8 MB total round-trip). 4096² u8 = 16 MB per memcpy (32 MB total).
2. **Allocation overhead**: The copy path allocates 3 large buffers per call (input Mat, output Mat, result Array2). At 4096×4096 that's three 16 MB (u8) or 64 MB (f32) allocations per iteration. The allocator's bookkeeping and potential page faults add significant overhead.
3. **Cache pressure**: Three separate large allocations scatter data across memory, reducing cache locality for the OpenCV kernel.

The zero-copy path has only 1 allocation (the output `Array2`) and the OpenCV kernel reads/writes the same contiguous memory regions.

### Why the 11.5× number from the initial report was misleading

The original benchmark at 1024×1024 f32 gaussian reported an 11.5× speedup (366 µs vs 4.2 ms). That 4.2 ms figure was a **stale criterion baseline** from a previous run — criterion compares each new result against its cached historical baseline, and the old baseline happened to be an outlier. Re-running with fresh baselines gives the correct 1.79× speedup at that size.

The large-size results (2048+, 4096+) are genuine and reproducible across runs.