oxideav-videotoolbox

macOS VideoToolbox hardware decode/encode bridge for the oxideav framework.

Why a bridge crate?

Apple's VideoToolbox exposes the dedicated media engine on Apple Silicon (and the equivalent IP on Intel Macs). For codecs the chip supports natively this is 5-50× faster than software decoding and orders of magnitude more energy-efficient.

This crate is a thin runtime-loaded bridge — no compile-time link dependency on VideoToolbox, no Objective-C / Swift. The framework is opened via [libloading] on first use.

Fallback behaviour

Two distinct failure paths fall back automatically to the pure-Rust codec:

Load failure — older macOS, missing framework, sandboxed environment without VT entitlements. register() logs and returns without registering, so the SW codec is the only candidate at dispatch.
Init failure — VTDecompressionSessionCreate / VTCompressionSessionCreate returns a non-zero OSStatus for the requested parameters. Common triggers: stream above the device's max resolution, hardware encoder slot already busy (concurrent-session cap), unsupported pixel format, codec profile the device doesn't accelerate. The factory returns Err; the registry's make_decoder_with / make_encoder_with retries the next-priority impl (typically the SW one).

Pipelines that require hardware (e.g. real-time low-latency capture where the SW path can't keep up) can opt out of the SW fallback by setting CodecPreferences { require_hardware: true, .. } — the registry will then surface the OSStatus error instead of degrading silently.

Platform gating

The whole crate is #![cfg(target_os = "macos")]. On Linux / Windows it compiles to an empty rlib; the umbrella oxideav crate gates the register call behind the same cfg.

Priority

Hardware factories register with CodecCapabilities::with_priority(10) — lower numbers win at resolution time, so on macOS hardware paths are preferred over the pure-Rust impls (which sit at priority 100+).

Opt-out

Users who want to force the pure-Rust path globally can pass --no-hwaccel to the oxideav CLI; this sets CodecPreferences { no_hardware: true }, which the pipeline forwards to make_decoder_with / make_encoder_with so HW factories are skipped at dispatch time. The runtime context still registers VT — oxideav list shows the *_videotoolbox rows regardless of the flag — only resolution is biased.

Coverage roadmap

Codec	Decode (M-series)	Encode (M-series)	Status
H.264	hardware	hardware	wired (≈ 51 dB PSNR_Y)
HEVC	hardware	hardware	wired (≈ 54 dB PSNR_Y)
ProRes	hardware	hardware	wired (≈ 52 dB PSNR_Y)
JPEG (MJPEG)	hardware	hardware	wired (≈ 36 dB PSNR_Y)
MPEG-2	hardware	— (no VT encoder)	wired (≈ 61 dB PSNR_Y, decode-only)
VP9	hardware (M1+)	— (no VT encoder)	wired (decode-only)
MPEG-4 Pt 2	hardware	— (no VT encoder)	wired (decode-only, VOL→ESDS extension atoms, ≈ 72 dB PSNR_Y)
AV1	hardware (M3+) / VT-internal SW elsewhere	hardware (M3+)	decode wired (round 8); encode roadmap

Round 1: scaffolding. Round 2: H.264 + HEVC decode + encode. Round 3: JPEG (MJPEG) + ProRes decode + encode via a shared blob-codec module (blob.rs) — single-blob frames built on CMVideoFormatDescriptionCreate(width, height, codecType) rather than the parameter-set extraction H.264/HEVC need. Round 4: MPEG-2 video decode (kCMVideoCodecType_MPEG2Video) — decode-only, since VideoToolbox exposes an MPEG-2 decoder but no encoder; an elementary-stream framer (FrameSplit::Mpeg2Es) carves the incoming bitstream into per-picture access units. Round 5: VP9 decode (kCMVideoCodecType_VP9 = 'vp09') — decode-only (no VT VP9 encoder); hardware decode on M1+ Apple Silicon, with VT falling back to software on older Macs that lack the dedicated VP9 IP. VP9 frames are container-framed (IVF / Matroska / MP4) so FrameSplit::Whole applies unchanged — no per-picture splitter is needed. Round 6: MPEG-4 Part 2 video decode (kCMVideoCodecType_MPEG4Video = 'mp4v') — the DivX / Xvid / Visual ASP / SP family, not H.264 (which is MPEG-4 Part 10 and ships via 'avc1'). Decode-only as well: VideoToolbox exposes no MPEG-4 Pt 2 compression session. A new FrameSplit::Mpeg4PartTwoEs framer splits the elementary stream on VOP start codes (00 00 01 B6) and attaches preceding VOS / Visual Object / VO / VOL / GOV / user-data headers to the first VOP. Round 7: VOL→ESDS extension-atom path — on hosts where VT enforces VOL-via-extradata for MPEG-4 Pt 2, the decoder extracts the configuration prefix (everything up to the first VOP start code) from the first packet, wraps it in a proper ISO/IEC 14496-1 ES_Descriptor + DecoderConfigDescriptor + DecoderSpecificInfo blob (ObjectTypeIndication = 0x20, streamType = VisualStream), and supplies it to CMVideoFormatDescriptionCreate via kCMFormatDescriptionExtension_SampleDescriptionExtensionAtoms keyed by "esds". Measured PSNR_Y vs ffmpeg's software decode jumps to ≈ 72.8 dB (sample-exact within IDCT tolerance) on the integration fixture. Round 8 (this commit): AV1 video decode (kCMVideoCodecType_AV1 = 'av01') — decode-only for now; hardware decode is gated to Apple Silicon M3+, with VT falling back to its internal software AV1 path on older hardware where that path exists (and to oxideav-av1 via the registry's SW-fallback elsewhere). AV1 frames are container-framed (IVF / Matroska / MP4 / WebM / RTP) so the existing FrameSplit::Whole path applies unchanged. Remaining roadmap: AV1 encode (the M3+ 'av01' compression session, macOS 14+), plus an optional av1C extension-atom path analogous to round 7's ESDS for hosts that require the AV1 Sequence Header out-of-band.

Round 8 implementation notes

AV1 decode wiring via VideoToolbox. make_av1_decoder registers a decoder against CodecId::new("av1") (tags av01 / AV01 / V_AV1) backed by a BlobDecoder over kCMVideoCodecType_AV1 = 'av01' = 0x6176_3031. The factory is decode-only — a VT AV1 compression session exists on macOS 14+ for M3+ hardware but needs its own callback/pixel-buffer wiring and is a follow-up round.
Hardware path is M3+; older silicon falls back gracefully. Apple Silicon M3 (and later) carries dedicated AV1 hardware-decode IP that VT routes to automatically. On M1 / M2 and Intel hosts VideoToolbox either takes its internal software AV1 path (where the macOS build includes one) or returns a non-zero OSStatus at VTDecompressionSessionCreate. The registry's lower-priority pure-Rust oxideav-av1 decoder is the fallback in the latter case — the round-8 wiring does not invent a software path, it just exposes the hardware path when present.
Framing is FrameSplit::Whole — no in-codec splitter. AV1 is container-framed (IVF / Matroska / MP4 / WebM / RTP). Each demuxed Packet carries exactly one AV1 temporal unit (one or more OBUs that together compose a single decoded frame), and that temporal unit goes straight into a CMSampleBuffer for VT. AV1 has no Annex-B / picture-start-code mechanism that would require an in-codec carve like MPEG-2's or MPEG-4 Part 2's.
Configuration record (av1C) is a follow-up round. AV1 in ISOBMFF / Matroska carries an av1C configuration record whose payload is the AV1 Sequence Header OBU (per the AV1 ISOBMFF mapping under docs/container/mpeg4/av1-isobmff/). On hosts where VT requires the Sequence Header via kCMFormatDescriptionExtension_SampleDescriptionExtensionAtoms rather than extracted from the first packet, supplying the av1C blob via the extension atoms is the same pattern as MPEG-4 Part 2's ESDS wired in round 7 — the round-7 ESDS plumbing in BlobDecoder::ensure_session already supports an arbitrary extension-atom key, so adding av1C is a small follow-up once a host needs it.
Validated against ffmpeg as a black-box. av1_decode_against_ffmpeg asks ffmpeg -c:v libaom-av1 -f ivf to produce a 320×240 / 10-frame gradient stream (the same shape as the VP9 fixture), parses the IVF container (32-byte file header + per-frame 12-byte header + payload), feeds each frame as one Packet through make_av1_decoder, and compares to ffmpeg's own software decode at PSNR_Y ≥ 30 dB. The test self-skips when ffmpeg / libaom-av1 / the framework / the VT AV1 decoder is unavailable on the host — older OS, M1/M2 hosts without a VT AV1 fallback, ffmpeg builds without libaom-av1.
Two new unit/integration tests. register_installs_av1_decode_only (decoder registered for CodecId::new("av1"), no encoder); av1_codec_type_is_av01_fourcc (the codec-type constant equals u32::from_be_bytes(b"av01") = 0x6176_3031). Plus the integration av1_decode_against_ffmpeg covering the end-to-end IVF→VT→PSNR path.

Round 7 implementation notes

VOL→ESDS extension-atom path for MPEG-4 Part 2. Round 6 wired the decoder against (codec_type, width, height) only — fine on hosts where VT's MPEG-4 Pt 2 decoder is lenient about extracting the VOL from the bitstream prefix, but on stricter VT hosts session creation returned kVTVideoDecoderBadDataErr (the registry then fell back to the pure-Rust impl). Round 7 closes that gap. Before opening the session, BlobDecoder (under FrameSplit::Mpeg4PartTwoEs) sniffs the leading bytes of the first packet up to but not including the first VOP start code (00 00 01 B6), wraps them in a full ISO/IEC 14496-1 ESDS descriptor (ES_Descriptor 0x03 → DecoderConfigDescriptor 0x04 [ObjectTypeIndication 0x20 = MPEG-4 Visual, streamType 0x04 = VisualStream] → DecoderSpecificInfo 0x05 = VOL bytes, with a 1-byte SLConfigDescriptor 0x06 predefined=2 tail), and feeds the result through kCMFormatDescriptionExtension_SampleDescriptionExtensionAtoms = { "esds": CFData } on the format-description creation call. The ffmpeg-fixture decode test (mpeg4_part_two_decode_against_ffmpeg) now reaches ≈ 72.8 dB PSNR_Y vs ffmpeg's reference decode on the gradient fixture (10/10 frames returned).
Hosts that don't need the ESDS extension still work. The extraction is best-effort: if the first packet starts with a VOP (no headers to capture) or the buffer has no VOP marker, extract_mpeg4_part_two_vol returns None and the decoder falls back to the round-6 plain (codec_type, width, height) path. Existing JPEG / ProRes / MPEG-2 / VP9 framers go through the same code path without an extensions dictionary, so their session creation is byte-for-byte unchanged from round 6.
ISO/IEC 14496-1 reference. The ESDS descriptor structure is documented in docs/container/mpeg4/ISO_IEC_14496-1-System-2010.pdf §7.2.6; the oxideav-mp4 muxer uses the same shape for AAC mp4a sample entries, so the BER length helper and field layout are consistent with the rest of the workspace.
CoreFoundation symbol added. sys.rs now resolves CFDataCreate and exposes a cf_data(vt, &[u8]) helper that copies the slice into a fresh CFData. Used only by the MPEG-4 Pt 2 path so far; available for any future codec needing a binary-blob extension atom (FLAC dfLa, AV1 av1C, …).
Six new unit tests. mpeg4_extract_vol_* (5 cases) cover the VOL-prefix extractor (returns prefix before VOP, includes GOV/user-data, None when no VOP, None when buffer starts with VOP, None on empty buffer); esds_* (5 cases) cover the ESDS builder (FullBox header zeros, ES_Descriptor tag, DCD tag + ObjectTypeIndication + streamType byte, DSI carries the VOL verbatim, SLC predefined = 2).

Round 6 implementation notes

MPEG-4 Part 2 is decode-only. VideoToolbox ships an MPEG-4 Part 2 decoder (historically used for DivX / Xvid playback) but no MPEG-4 Pt 2 compression session, so make_mpeg4_part_two_decoder registers a decoder against CodecId::new("mpeg4") (tags mp4v / MP4V / M4S2 / m4s2 / DIVX / divx / DX50 / XVID / xvid / FMP4 / fmp4 / V_MPEG4/ISO/ASP) and there is deliberately no matching encoder factory. This is MPEG-4 Pt 2 — distinct from MPEG-4 Pt 10 (H.264), which uses 'avc1' and stays on its own CodecId::new("h264") row.
Elementary-stream framer. Like MPEG-2 ES, an MPEG-4 Pt 2 ES is not pre-framed. FrameSplit::Mpeg4PartTwoEs splits on the VOP start code (00 00 01 B6), attaching any leading VOS (B0) / Visual Object (B5) / VO (00..1F) / VOL (20..2F) / GOV (B3) / user-data (B2) bytes to the first VOP so the VOL travels with it. This is intrinsic bitstream framing (codec's job, not container's).
Session-creation caveat. VT's MPEG-4 Pt 2 decoder typically requires the VOL configuration to be supplied via kCMFormatDescriptionExtension_* extension atoms (the ESDS DecoderSpecificInfo shape), not extracted from the bitstream as it would be for MPEG-2. Building the format description from just (codec_type, width, height) therefore can return -12909 / kVTVideoDecoderBadDataErr at session create. When that happens the registry's SW fallback takes over and the pure-Rust MPEG-4 Pt 2 decoder handles the stream. A follow-up round will extract the VOL prefix from the elementary stream and supply it via the format-description extension atoms to enable the hardware path on those hosts.
Validated against ffmpeg as a black-box. The decode test generates an MPEG-4 Pt 2 elementary stream with ffmpeg -c:v mpeg4 -f m4v, feeds it through the VT bridge, and (when the session creates successfully) compares to ffmpeg's own software decode at PSNR_Y ≥ 30 dB. The test self-skips when ffmpeg or the framework is unavailable, and self-skips gracefully when the session-creation caveat above triggers — so CI on a runner without VOL-extradata leniency still passes.
Splitter unit tests. Four new unit tests cover the MPEG-4 Pt 2 access-unit splitter: single VOP with headers, two VOPs with first inheriting headers, no-VOP-found pass-through, and a regression test that confirms non-VOP start codes (B0, B3) don't trigger spurious splits.

Round 5 implementation notes

VP9 is decode-only. VideoToolbox ships a VP9 decoder (M1+) but no VP9 compression session, so make_vp9_decoder registers a decoder against CodecId::new("vp9") (tags vp09 / VP90 / V_VP9) and there is deliberately no matching encoder factory.
Container-framed, no in-codec splitter. Unlike MPEG-2's elementary-stream input, VP9 has no Annex-B / picture-start-code mechanism — frames are framed by the surrounding container (IVF, Matroska, MP4) and arrive as one self-contained payload per Packet. BlobDecoder is therefore instantiated with FrameSplit::Whole; bytes flow straight from Packet::data into the CMSampleBuffer without any in-codec carving.
Validated against ffmpeg. The decode test asks ffmpeg -c:v libvpx-vp9 -f ivf to produce a 320×240 / 10-frame gradient stream, parses the IVF container (32-byte file header + per-frame 12-byte header + payload) to recover individual VP9 frames, feeds each as one Packet, and compares to ffmpeg's own software decode (PSNR_Y ≥ 30 dB threshold). The test self-skips when ffmpeg / libvpx-vp9 / the framework / the VT VP9 decoder is unavailable — older OS, Intel Macs without VP9 IP, ffmpeg builds without libvpx.

Round 4 implementation notes

MPEG-2 is decode-only. VideoToolbox ships an MPEG-2 decoder but no MPEG-2 compression session, so make_mpeg2_decoder registers a decoder against CodecId::new("mpeg2video") (tags mp2v / MPG2 / mpg2 / hdv2 / m2v1 / V_MPEG2) and there is deliberately no matching encoder factory.
Elementary-stream framer. Unlike the container-framed JPEG/ProRes path (one Packet == one frame), an MPEG-2 elementary stream is not pre-framed. BlobDecoder gained a FrameSplit mode; FrameSplit::Mpeg2Es splits on the picture start code (00 00 01 00), attaching any leading sequence (b3) / GOP (b8) / extension (b5) headers to the first picture so VT can size the decoder. This is intrinsic bitstream framing (the codec's job), not container parsing.
Validated against ffmpeg as a black-box. The decode test generates an MPEG-2 elementary stream with ffmpeg (opaque validator), decodes it through VideoToolbox, and compares the result to ffmpeg's own software decode: PSNR_Y ≈ 61 dB on a 320×240 / 10-frame gradient. The test self-skips when ffmpeg or the framework is unavailable.

Round 3 implementation notes

Blob codec module (src/blob.rs) — BlobDecoder and BlobEncoder share one VTDecompression/VTCompression driver for every codec whose format description is (width, height, codecType) with no parameter sets. Currently used by MJPEG ('jpeg') and ProRes ('apcn').
Pixel-format adaptive callback — VT decoders return different CVPixelBuffer formats depending on the codec: H.264/HEVC honour the NV12 destination-attribute request ('420v'), but ProRes returns 16-bit biplanar 4:2:2 ('sv22') regardless. The blob decoder callback inspects CVPixelBufferGetPixelFormatType and dispatches to one of four converters: NV12 ('420v'/'420f'), packed UYVY ('2vuy'), packed YUY2 ('yuvs'), or biplanar 16-bit 4:2:2 ('sv22').
ProRes profile selection — defaults to ProRes 422 ('apcn') for both encode and decode. The decoder format description carries the codec-type, and VT internally dispatches to the right ProRes flavour when it sees the frame header ('icpf' magic at offset 4). Explicit profile selection via CodecParameters::tag is a future-round item.
Roundtrip tests use a smooth diagonal gradient — the previous test pattern (col + row/2 + frame*10) % 255 had a modulo-wraparound discontinuity that JPEG's DCT could not represent without ~10 dB of error. The new gradient (clipped to video-range [16, 235]) reaches ≥ 36 dB on every codec.

Workspace policy

Calling a system OS framework via FFI is the same shape as calling libc::malloc — it's the platform, not a copied algorithm. The workspace's clean-room rule (no embedding source from libvpx, libwebp, libjxl, etc.) does not apply to this crate.

License

MIT.

oxideav-videotoolbox 0.0.3