Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
av-denoise
Fast and efficient NLMEANS video denoising using CubeCL.
This project is heavily inspired by KNLmeansCL alongside FFmpeg's nlmeans implementation but is built to be a more standalone tool and also make use or more modern tooling to better leverage modern hardware instead of relying on the now rather outdated OpenCL.
Table of contents
Features
- Library and binary offering.
- The binary supports both STDIN (y4m) and FFMS2 ingestion and emits y4m frames to STDOUT.
- Spatial and Temporal support
- Luma, Chroma and YUV* specific denoising kernels.
- YUV is for YUV 4:4:4 only.
- Adjustable nlmeans tuning paramters with sensible defaults.
- Prefilter support for more accurate denoising with less detail loss.
- Includes an on-gpu bilateral filter out of the box
- You can specify the reference frame yourself using the library rather than CLI.
- Motion compensation for high-quality temporal denoising on heavy motion.
- MVTools-inspired hierarchical block matching, fully on-GPU.
- Enabled with
--motion-compensation, tuned via--mc-blksize,--mc-overlap,--mc-search,--mc-pyramid-levels.
- Fast! - Around 2x faster than FFmpeg's OpenCL implementation.
- Be aware that the
STDINmode for the binary cannot fully utilise larger modern GPUs, it will likely be just as fast as FFmpeg using much less GPU compute, but we cannot parallelize across scenes.
- Be aware that the
Benchmarks
Numbers below come from scripts/bench_runs.py (just compare-perf), which pipes
each tool to ffmpeg -f null - so the encoder is not measured. Throughput is
total frames divided by wall-clock elapsed.
- Input is a 3,450-frame 1080p FFV1 clip.
av-denoiseusing thevulkanbackend.- Running on a
AMD AI Pro R9700(AMD 9070XT equivalent) GPU.
Apples-to-apples spatial NL-means (strength 1.0)
Matched patch and search sizes on both tools, av-denoise uses radii compared to ffmpeg which takes the absolute size.
| patch / search | av-denoise (fps) | ffmpeg nlmeans_opencl (fps) | speedup |
|---|---|---|---|
| p=5, r=11 | 72.57 | 30.25 | ~2.40x |
| p=7, r=15 | 42.41 | 16.33 | ~2.60x |
| p=9, r=15 | 41.84 | 16.26 | ~2.57x |
av-denoise feature cost (strength 1.0, default patch/search)
All luma+chroma. Spatial baseline is the reference. Lower fps = more work.
| run | fps | notes |
|---|---|---|
| spatial baseline | 97.25 | --temporal-radius 0 |
| spatial + bilateral prefilter | 93.50 | adds one on-GPU pass per frame |
| temporal r=1 | 72.73 | 3-frame window |
| temporal r=2 | 62.07 | 5-frame window |
| temporal r=1 + motion comp | 64.03 | hierarchical block matching enabled |
| temporal r=2 + motion comp | 54.29 | |
| temporal r=1 + prefilter | 69.58 | |
| full r=1 (temporal+MC+prefilter) | 60.97 | |
| full r=2 (temporal+MC+prefilter) | 52.18 |
Reproduce with just compare-perf (config: scripts/bench_runs.toml).
Hardware support
The project supports the following accelerators/gpus:
- AMD GPUs (via the
rocmorvulkanfeatures) - Intel GPUs (via the
vulkanfeature) - Nvidia GPUs (via the
cudaorvulkanfeatures) - Apple Silicon (via the
metalfeature) - CPU (via the
cpufeature)- WARNING! The CPU backend within CubeCL is still very new, and is not as optimised as a manually written kernel. As such, I do not recommend using this backend outside of testing.
Notes about the JIT
It is important to note that av-denoise internally uses a JIT (Just In Time) compiler for its kernels; this means
that the kernels are compiled and optimised for your specific hardware at runtime. As such, the first a couple of
calls will have significant overhead as the system compiles, optimises and caches the kernels.
Additionally, because the kernels are compiled at runtime, whatever environment you run the tool in, must also provide access to the hardware specific headers and compilers.
This primarily has the following impacts:
- The
rocmbackend requires the AMD HIP compiler and headers, typically vendored via the ROCm dev SDK. - The
cudabackend requires the NVIDIA CUDA headers and nvcc, typically vendored via the CUDA devel toolkit. - The
cpubackend should not require any special dependencies directly, as it should already be vendored. - The
vulkanandmetalbackends should "just work" on non-containerised hosts. If you are building for docker, then the vulkan backend requiresvulkan-icd-loaderand then the relevant GPU specific driver, i.e.vulkan-radeonorvulkan-intel.
Since both the CUDA and ROCm backends are very heavy in terms of dependencies, I recommend just using the vulkan
backend for those devices. It should be more or less the same performance, without all the library headache.
Installing
av-denoise is available both in library and binary format, by default only the cpu and vulkan features
are enabled, since they are typically the default accelerators you will want to use.
When compiling the binary, you want to enable the binary feature at minimum, but I recommend for most users
to enable the binary-full feature instead if you are ever unsure about how you are going to be ingesting frames.
The following (non-accelerator) features are available:
binary- Enables the dependencies and code required to compileav-denoiseas a binary.- This pulls in
ffms2as hard dependencies. This means you must installffms2before you can compile and link the binary.
- This pulls in
Cargo install
From source
As a library
Example commands
Y/UV Denoise - ROCm/Vulkan - On GPU 1 - Light Denoise - Spatial - strength=luma:1.2,choma:1.2
|
Y/UV Denoise - Vulkan - On iGPU 0 - Split Denoise - Temporal (radius=1) - strength=luma:2.0,choma:1.5
|
Y-Only Denoise - Metal - On GPU 0 - Heavy Denoise - Spatial - strength=luma:3.0
|
YUV Fused Denoise - Vulkan - On Default GPU - Medium Denoise - Spatial - strength=yuv:2.0
|
Y/UV Denoise - Vulkan - On GPU 0 - Temporal (radius=2) + Motion Compensation - Anime / Heavy Motion
|
Binary usage
Fast and efficient video denoising
Usage: av-denoise [OPTIONS] <COMMAND>
Commands:
file Denoise a video file, splitting work by scene
stdin Denoise a y4m stream coming in on stdin, writing y4m on stdout
help Print this message or the help of the given subcommand(s)
Options:
-a, --algorithm <ALGORITHM>
Denoising algorithm to run.
Only `nlmeans` is currently available.
[default: nlmeans]
-A, --accelerators <ACCELERATORS>
Which hardware backends to try, in order of preference.
The first backend that initialises is used. If none work the program exits with an error.
The list is comma-separated, for example `vulkan,cpu`.
[default: vulkan cpu]
-d, --device <DEVICE>
Which device to use on the chosen backend.
Accepted values:
`default` lets the backend pick.
`discrete[:N]` picks the Nth discrete GPU (default 0). Works on CUDA, ROCm, and Vulkan.
`integrated[:N]` picks the Nth integrated GPU. Vulkan only.
`virtual[:N]` picks the Nth virtual GPU. Vulkan only.
`cpu` uses the software backend.
[default: default]
--channel-mode <CHANNEL_MODE>
Which planes of the video to clean (comma-separated).
`luma` cleans only the brightness plane.
`chroma` cleans only the colour planes at their native size.
`luma,chroma` cleans both as two independent passes, which is usually what you want for noisy footage.
`yuv` cleans all three planes in one fused pass.
`yuv` needs a YUV444 source and cannot be combined with the other modes.
Possible values:
- luma: Clean only the brightness plane (Y). Colour passes through
- chroma: Clean only the colour planes (U, V). Brightness passes through
- yuv: Clean all three planes together in one pass. Needs a YUV444 source and cannot be combined with the other modes
[default: luma]
--prefilter <PREFILTER>
Reference image used when comparing patches.
`none` uses the noisy input directly (the cheapest option).
`bilateral:<sigma_s>,<sigma_r>` runs a quick on-GPU bilateral blur first, then compares patches against that cleaner image.
`sigma_s` is the spatial blur radius in pixels.
`sigma_r` is the colour-similarity threshold in `[0, 1]`.
A good starting point is `bilateral:3.0,0.02`.
Prefiltering keeps more detail at the cost of one extra GPU pass per frame.
[default: none]
--temporal-radius <TEMPORAL_RADIUS>
How many neighbouring frames to look at on each side when cleaning a frame.
`0` (default) means no temporal blending: each frame is cleaned on its own.
Values above `0` look at that many frames before and after the current one.
Larger values give stronger cleanup but use more memory and add latency.
In `file` mode this is reset at every scene change, so raising it never causes blending across cuts.
[default: 0]
--search-radius <SEARCH_RADIUS>
How far away to look for similar patches inside a frame.
Larger values find more matches but cost quadratically more work. Library default is 2.
--patch-radius <PATCH_RADIUS>
Size of each patch being compared. The patch is `(2*patch_radius + 1)` pixels square.
Larger patches preserve fine structure better but cost more GPU memory. Library default is 4.
--strength <STRENGTH>
Cleaning strength. Higher numbers smooth more.
Must be a finite number greater than 0.
Library default is 1.2.
This value applies to both planes unless `--luma-strength` or `--chroma-strength` is set.
--luma-strength <LUMA_STRENGTH>
Strength override for the brightness plane only.
Falls back to `--strength` (or the library default) when not set.
Ignored when luma is not being denoised, or when `--channel-mode yuv` is used.
--chroma-strength <CHROMA_STRENGTH>
Strength override for the colour planes only.
Falls back to `--strength` (or the library default) when not set.
Ignored when chroma is not being denoised, or when `--channel-mode yuv` is used.
--self-weight <SELF_WEIGHT>
How much weight to give the centre pixel itself when averaging.
Library default is 1.0. Must be a finite number `>= 0`.
Setting to 0 gives pure NLM (centre pixel only counts if a similar patch was found nearby).
--motion-compensation
Turn on motion compensation for temporal denoising.
When the camera or content moves between frames, the brightness at the same `(x, y)` is different content in each frame.
Without help, temporal cleanup will blur moving edges.
Motion compensation looks at where each block of pixels moved between frames, then shifts neighbour frames to line up with the current frame before cleaning.
This keeps detail sharp on anime, fast pans, and action footage.
Has no effect when `--temporal-radius 0`.
--mc-blksize <MC_BLKSIZE>
Size of each motion-search block, in pixels. Must be even.
Larger blocks are more stable but track motion less accurately on small details.
[default: 16]
--mc-overlap <MC_OVERLAP>
How many pixels neighbouring motion blocks may overlap.
Must be less than `--mc-blksize`. Higher overlap smooths the transitions between blocks but does more work.
[default: 8]
--mc-search <MC_SEARCH>
How many pixels of motion to search for at the finest level.
The coarse pyramid pass reaches further (search radius times 2 for a 2-level pyramid), so for typical content the default is fine.
Raise it for very fast motion.
[default: 4]
--mc-pyramid-levels <MC_PYRAMID_LEVELS>
How many levels the motion-search pyramid uses.
`1` does a single full-resolution search (cheaper, weaker on large motion).
`2` (default) does a coarse pass on a half-size image first, then refines at full resolution.
This handles much larger motion at modest extra cost.
[default: 2]
-h, --help
Print help (see a summary with '-h')