av-denoise 0.1.0

# av-denoise

Fast and efficient NLMEANS video denoising using CubeCL.

This project is heavily inspired by [KNLmeansCL](https://github.com/Khanattila/KNLMeansCL) alongside FFmpeg's nlmeans 
implementation but is built to be a more standalone tool and also make use or more modern tooling to better 
leverage modern hardware instead of relying on the now rather outdated OpenCL.

## Table of contents

- [Features](#features)
- [Benchmarks](#benchmarks)
  - [Apples-to-apples spatial NL-means (strength 1.0)](#apples-to-apples-spatial-nl-means-strength-10)
  - [av-denoise feature cost (strength 1.0, default patch/search)](#av-denoise-feature-cost-strength-10-default-patchsearch)
- [Hardware support](#hardware-support)
  - [Notes about the JIT](#notes-about-the-jit)
- [Installing](#installing)
  - [Cargo install](#cargo-install)
  - [From source](#from-source)
  - [As a library](#as-a-library)
- [Example commands](#example-commands)
- [Binary usage](#binary-usage)

## Features

- Library and binary offering.
  * The binary supports both STDIN (y4m) and FFMS2 ingestion and emits y4m frames to STDOUT.
- **Spatial** and **Temporal** support
- **Luma**, **Chroma** and **YUV*** specific denoising kernels.
   * YUV is for YUV 4:4:4 only.
- Adjustable nlmeans tuning paramters with sensible defaults.
- **Prefilter support** for more accurate denoising with less detail loss.
   * Includes an on-gpu bilateral filter out of the box
   * You can specify the reference frame yourself using the library rather than CLI.
- **Motion compensation** for high-quality temporal denoising on heavy motion.
   * MVTools-inspired hierarchical block matching, fully on-GPU.
   * Enabled with `--motion-compensation`, tuned via `--mc-blksize`,
     `--mc-overlap`, `--mc-search`, `--mc-pyramid-levels`.
- _**Fast!**_ - Around **2x** faster than FFmpeg's OpenCL implementation.
   * Be aware that the `STDIN` mode for the binary cannot fully utilise larger modern GPUs, it will
     likely be just as fast as FFmpeg using much less GPU compute, but we cannot parallelize across scenes.

## Benchmarks

Numbers below come from `scripts/bench_runs.py` (`just compare-perf`), which pipes
each tool to `ffmpeg -f null -` so the encoder is not measured. Throughput is
total frames divided by wall-clock elapsed.

- Input is a 3,450-frame 1080p FFV1 clip.
- `av-denoise` using the `vulkan` backend.
- Running on a `AMD AI Pro R9700` (AMD 9070XT equivalent) GPU.

### Apples-to-apples spatial NL-means (strength 1.0)

Matched patch and search sizes on both tools, av-denoise uses radii compared to
ffmpeg which takes the absolute size.

| patch / search | av-denoise (fps) | ffmpeg nlmeans_opencl (fps) | speedup |
|----------------|-----------------:|----------------------------:|--------:|
| p=5, r=11      |        **72.57** |                       30.25 |  ~2.40x |
| p=7, r=15      |        **42.41** |                       16.33 |  ~2.60x |
| p=9, r=15      |        **41.84** |                       16.26 |  ~2.57x |

### av-denoise feature cost (strength 1.0, default patch/search)

All luma+chroma. Spatial baseline is the reference. _Lower fps = more work._

| run                              |   fps | notes                               |
|----------------------------------|------:|-------------------------------------|
| spatial baseline                 | 97.25 | `--temporal-radius 0`               |
| spatial + bilateral prefilter    | 93.50 | adds one on-GPU pass per frame      |
| temporal r=1                     | 72.73 | 3-frame window                      |
| temporal r=2                     | 62.07 | 5-frame window                      |
| temporal r=1 + motion comp       | 64.03 | hierarchical block matching enabled |
| temporal r=2 + motion comp       | 54.29 |                                     |
| temporal r=1 + prefilter         | 69.58 |                                     |
| full r=1 (temporal+MC+prefilter) | 60.97 |                                     |
| full r=2 (temporal+MC+prefilter) | 52.18 |                                     |

Reproduce with `just compare-perf` (config: `scripts/bench_runs.toml`).

## Hardware support

The project supports the following accelerators/gpus:

- **AMD GPUs** (via the `rocm` or `vulkan` features)
- **Intel GPUs** (via the `vulkan` feature)
- **Nvidia GPUs** (via the `cuda` or `vulkan` features)
- **Apple Silicon** (via the `metal` feature)
- **CPU** (via the `cpu` feature)
  * _WARNING! The CPU backend within CubeCL is still very new, and is not as optimised as a manually written kernel.<br/>
    As such, I do not recommend using this backend outside of testing._

### Notes about the JIT

It is important to note that `av-denoise` internally uses a JIT (Just In Time) compiler for its kernels; this means
that the kernels are compiled and optimised for your specific hardware _at runtime._ As such, the first a couple of
calls will have significant overhead as the system compiles, optimises and caches the kernels.

Additionally, because the kernels are compiled at runtime, whatever environment you run the tool in,
must also provide access to the hardware specific headers and compilers.

This primarily has the following impacts:

- The `rocm` backend requires the AMD HIP compiler and headers, typically vendored via the ROCm dev SDK.
- The `cuda` backend requires the NVIDIA CUDA headers and nvcc, typically vendored via the CUDA devel toolkit.
- The `cpu` backend should not require any special dependencies directly, as it should already be vendored.
- The `vulkan` and `metal` backends should "just work" on non-containerised hosts. If you are building for
  docker, then the vulkan backend requires `vulkan-icd-loader` and then the relevant GPU specific driver,
  i.e. `vulkan-radeon` or `vulkan-intel`.

Since both the CUDA and ROCm backends are very heavy in terms of dependencies, I recommend just using the `vulkan`
backend for those devices. It should be more or less the same performance, without all the library headache.

## Installing

`av-denoise` is available both in library _and_ binary format, by default only the `cpu` and `vulkan` features
are enabled, since they are typically the default accelerators you will want to use.

When compiling the binary, you want to enable the `binary` feature at minimum, but I recommend for most users
to enable the `binary-full` feature instead if you are ever unsure about how you are going to be ingesting frames.

The following (non-accelerator) features are available:

- `binary` - Enables the dependencies and code required to compile `av-denoise` as a binary.
   * This pulls in `ffms2` as hard dependencies. This means you must install `ffms2` before you can compile and link
     the binary.

### Cargo install

```bash
cargo install --locked av-denoise --features binary
```

### From source

```bash
git clone https://github.com/ChillFish8/av-denoise.git
cargo build --release --features binary
cp ./target/release/av-denoise ./av-denoise
```

### As a library

```bash
cargo add av-denoise
```

## Example commands

**Y/UV Denoise - ROCm/Vulkan - On GPU 1 - Light Denoise - Spatial - strength=luma:1.2,choma:1.2**
```bash
av-denoise file \
  --accelerators rocm,vulkan \
  --device discrete:1 \
  --channel-mode luma,chroma \
  --strength 1.2 \
  --input ./sample.mkv \
    | ffmpeg -hide_banner -loglevel info -y -f yuv4mpegpipe -i - -c:v ffv1 ./output.mkv
```

**Y/UV Denoise - Vulkan - On iGPU 0 - Split Denoise - Temporal (radius=1) - strength=luma:2.0,choma:1.5**
```bash
av-denoise file \
  --accelerators vulkan \
  --device integrated:0 \
  --channel-mode luma,chroma \
  --temporal-radius 1 \
  --luma-strength 2.0 \
  --chroma-strength 1.5 \
  --input ./sample.mkv \
    | ffmpeg -hide_banner -loglevel info -y -f yuv4mpegpipe -i - -c:v ffv1 ./output.mkv
```

**Y-Only Denoise - Metal - On GPU 0 - Heavy Denoise - Spatial - strength=luma:3.0**
```bash
av-denoise file \
  --accelerators metal \
  --device discrete:0 \
  --channel-mode luma \
  --strength 3.0 \
  --input ./sample.mkv \
    | ffmpeg -hide_banner -loglevel info -y -f yuv4mpegpipe -i - -c:v ffv1 ./output.mkv
```

**YUV Fused Denoise - Vulkan - On Default GPU - Medium Denoise - Spatial - strength=yuv:2.0**
```bash
av-denoise file \
  --accelerators vulkan \
  --channel-mode yuv \
  --strength 2.0 \
  --input ./sample.mkv \
    | ffmpeg -hide_banner -loglevel info -y -f yuv4mpegpipe -i - -c:v ffv1 ./output.mkv
```

**Y/UV Denoise - Vulkan - On GPU 0 - Temporal (radius=2) + Motion Compensation - Anime / Heavy Motion**
```bash
av-denoise file \
  --accelerators vulkan \
  --device discrete:0 \
  --channel-mode luma,chroma \
  --temporal-radius 2 \
  --motion-compensation \
  --strength 1.5 \
  --input ./anime.mkv \
    | ffmpeg -hide_banner -loglevel info -y -f yuv4mpegpipe -i - -c:v ffv1 ./output.mkv
```

## Binary usage

```text
Fast and efficient video denoising

Usage: av-denoise [OPTIONS] <COMMAND>

Commands:
  file   Denoise a video file, splitting work by scene
  stdin  Denoise a y4m stream coming in on stdin, writing y4m on stdout
  help   Print this message or the help of the given subcommand(s)

Options:
  -a, --algorithm <ALGORITHM>
          Denoising algorithm to run.

          Only `nlmeans` is currently available.

          [default: nlmeans]

  -A, --accelerators <ACCELERATORS>
          Which hardware backends to try, in order of preference.

          The first backend that initialises is used. If none work the program exits with an error. The list is comma-separated, for example `vulkan,cpu`.

          [default: vulkan cpu]

  -d, --device <DEVICE>
          Which device to use on the chosen backend.

          Accepted values:

          `default` lets the backend pick.

          `discrete[:N]` picks the Nth discrete GPU (default 0). Works on CUDA, ROCm, and Vulkan.

          `integrated[:N]` picks the Nth integrated GPU. Vulkan only.

          `virtual[:N]` picks the Nth virtual GPU. Vulkan only.

          `cpu` uses the software backend.

          [default: default]

      --channel-mode <CHANNEL_MODE>
          Which planes of the video to clean (comma-separated).

          `luma` cleans only the brightness plane.

          `chroma` cleans only the colour planes at their native size.

          `luma,chroma` cleans both as two independent passes, which is usually what you want for noisy footage.

          `yuv` cleans all three planes in one fused pass. This needs a YUV444 source and cannot be combined with the other modes.

          Possible values:
          - luma:   Clean only the brightness plane (Y). Colour passes through
          - chroma: Clean only the colour planes (U, V). Brightness passes through
          - yuv:    Clean all three planes together in one pass. Needs a YUV444 source and cannot be combined with the other modes

          [default: luma]

      --prefilter <PREFILTER>
          Reference image used when comparing patches.

          `none` uses the noisy input directly (the cheapest option).

          `bilateral:<sigma_s>,<sigma_r>` runs a quick on-GPU bilateral blur first, then compares patches against that cleaner image. `sigma_s` is the spatial blur radius in pixels and `sigma_r` is the colour-similarity threshold in `[0, 1]`. A good starting point is `bilateral:3.0,0.02`.

          Prefiltering keeps more detail at the cost of one extra GPU pass per frame.

          [default: none]

      --temporal-radius <TEMPORAL_RADIUS>
          How many neighbouring frames to look at on each side when cleaning a frame.

          `0` (default) means no temporal blending: each frame is cleaned on its own.

          Values above `0` look at that many frames before and after the current one. Larger values give stronger cleanup but use more memory and add latency.

          In `file` mode this is reset at every scene change, so raising it never causes blending across cuts.

          [default: 0]

      --search-radius <SEARCH_RADIUS>
          How far away to look for similar patches inside a frame.

          Larger values find more matches but cost quadratically more work. Library default is 2.

      --patch-radius <PATCH_RADIUS>
          Size of each patch being compared. The patch is `(2*patch_radius + 1)` pixels square.

          Larger patches preserve fine structure better but cost more GPU memory. Library default is 4.

      --strength <STRENGTH>
          Cleaning strength. Higher numbers smooth more.

          Must be a finite number greater than 0. Library default is 1.2.

          This value applies to both planes unless `--luma-strength` or `--chroma-strength` is set.

      --luma-strength <LUMA_STRENGTH>
          Strength override for the brightness plane only.

          Falls back to `--strength` (or the library default) when not set. Ignored when luma is not being denoised, or when `--channel-mode yuv` is used.

      --chroma-strength <CHROMA_STRENGTH>
          Strength override for the colour planes only.

          Falls back to `--strength` (or the library default) when not set. Ignored when chroma is not being denoised, or when `--channel-mode yuv` is used.

      --self-weight <SELF_WEIGHT>
          How much weight to give the centre pixel itself when averaging.

          Library default is 1.0. Must be a finite number `>= 0`. Setting to 0 gives pure NLM (centre pixel only counts if a similar patch was found nearby).

      --motion-compensation
          Turn on motion compensation for temporal denoising.

          When the camera or content moves between frames, the brightness at the same `(x, y)` is different content in each frame. Without help, temporal cleanup will blur moving edges.

          Motion compensation looks at where each block of pixels moved between frames, then shifts neighbour frames to line up with the current frame before cleaning. This keeps detail sharp on anime, fast pans, and action footage.

          Has no effect when `--temporal-radius 0`.

      --mc-blksize <MC_BLKSIZE>
          Size of each motion-search block, in pixels. Must be even.

          Larger blocks are more stable but track motion less accurately on small details.

          [default: 16]

      --mc-overlap <MC_OVERLAP>
          How many pixels neighbouring motion blocks may overlap.

          Must be less than `--mc-blksize`. Higher overlap smooths the transitions between blocks but does more work.

          [default: 8]

      --mc-search <MC_SEARCH>
          How many pixels of motion to search for at the finest level.

          The coarse pyramid pass reaches further (search radius times 2 for a 2-level pyramid), so for typical content the default is fine. Raise it for very fast motion.

          [default: 4]

      --mc-pyramid-levels <MC_PYRAMID_LEVELS>
          How many levels the motion-search pyramid uses.

          `1` does a single full-resolution search (cheaper, weaker on large motion).

          `2` (default) does a coarse pass on a half-size image first, then refines at full resolution. This handles much larger motion at modest extra cost.

          [default: 2]

  -h, --help
          Print help (see a summary with '-h')
```