linear-srgb 0.4.0

Fast linear↔sRGB color space conversion with FMA acceleration and LUT support
Documentation

linear-srgb

Fast linear↔sRGB color space conversion with runtime CPU dispatch.

Crates.io Docs.rs License

Quick Start

use linear_srgb::default::*;

// Single values
let linear = srgb_to_linear(0.5f32);
let srgb = linear_to_srgb(linear);

// Slices (SIMD-accelerated)
let mut values = vec![0.5f32; 10000];
srgb_to_linear_slice(&mut values);
linear_to_srgb_slice(&mut values);

// u8 ↔ f32 (image processing)
let linear = srgb_u8_to_linear(128);
let srgb_byte = linear_to_srgb_u8(linear);

Which Function Should I Use?

                    ┌─────────────────────────┐
                    │ How many values?        │
                    └───────────┬─────────────┘
                                │
              ┌─────────────────┼─────────────────┐
              ▼                 ▼                 ▼
         ┌────────┐        ┌────────┐        ┌────────────┐
         │ One    │        │ Slice  │        │ Building   │
         │ value  │        │ [f32]  │        │ own SIMD?  │
         └───┬────┘        └───┬────┘        └─────┬──────┘
             │                 │                   │
             ▼                 ▼                   ▼
    ┌─────────────────┐  ┌──────────────┐   ┌─────────────────┐
    │ srgb_to_linear  │  │ *_slice()    │   │ Inside your own │
    │ linear_to_srgb  │  │              │   │ #[multiversed]? │
    │ srgb_u8_to_     │  │ Dispatch once│   └────────┬────────┘
    │   linear (LUT)  │  │ loop is fast │            │
    └─────────────────┘  └──────────────┘     ┌──────┴──────┐
                                              ▼             ▼
                                           ┌─────┐      ┌─────┐
                                           │ Yes │      │ No  │
                                           └──┬──┘      └──┬──┘
                                              │            │
                                              ▼            ▼
                                    ┌──────────────┐  ┌──────────────┐
                                    │ default::    │  │ *_x8() or    │
                                    │ inline::*    │  │ *_x8_slice() │
                                    │              │  │              │
                                    │ No dispatch, │  │ Has dispatch │
                                    │ #[inline]    │  │ (that's fine)│
                                    └──────────────┘  └──────────────┘

Quick reference:

Your situation Use this
One f32 value srgb_to_linear(x) / linear_to_srgb(x)
One u8 value srgb_u8_to_linear(x) (LUT, 20x faster than scalar)
&mut [f32] slice srgb_to_linear_slice() / linear_to_srgb_slice()
&[u8]&mut [f32] srgb_u8_to_linear_slice()
&[f32]&mut [u8] linear_to_srgb_u8_slice()
&mut [f32x8] slice linear_to_srgb_x8_slice() (dispatch once)
Inside #[multiversed] default::inline::* (no dispatch)
Standalone x8 call linear_to_srgb_x8() (has dispatch, that's fine)

Performance Guide

This crate is carefully tuned for maximum throughput. The default module exposes the fastest implementation for each conversion type, chosen based on extensive benchmarking.

Why Each Default Was Chosen

Conversion Default Implementation Why
u8 → f32 LUT direct lookup 3-4 Gelem/s. 256-entry table fits in L1 cache. Beats both scalar (170 Melem/s) and SIMD.
u16 → f32 LUT direct lookup 450-820 Melem/s. 2.5-16x faster than scalar powf.
f32 → f32 (sRGB→linear) SIMD with dispatch 1.6 Gelem/s. ~15-20% faster than scalar powf (1.4 Gelem/s).
f32 → f32 (linear→sRGB) SIMD with dispatch 440-480 Melem/s. ~2x faster than scalar for this direction.
f32 → u8 SIMD with dispatch 270-275 Melem/s. ~1.8x faster than scalar.
f32 → u16 Scalar powf 145-200 Melem/s. Beats LUT interpolation due to interpolation overhead.

Dispatch Overhead

The _dispatch variants use runtime CPU feature detection (AVX2, SSE4.1, NEON, etc.) via multiversed. This adds ~1-3ns per call, which is fully amortized even at 8 elements.

Bottom line: Always use the slice functions for batches. The dispatch cost is negligible.

API Reference

Single Values

use linear_srgb::default::*;

// f32 conversions (scalar - fast for individual values)
let linear = srgb_to_linear(0.5f32);
let srgb = linear_to_srgb(0.214f32);

// f64 high-precision
let linear = srgb_to_linear_f64(0.5f64);

// u8 conversions (LUT-based)
let linear = srgb_u8_to_linear(128u8);           // u8 → f32
let srgb_byte = linear_to_srgb_u8(0.214f32);     // f32 → u8

Slice Processing (Recommended for Batches)

use linear_srgb::default::*;

// In-place f32 conversion (SIMD-accelerated)
let mut values = vec![0.5f32; 10000];
srgb_to_linear_slice(&mut values);  // Modifies in-place
linear_to_srgb_slice(&mut values);

// u8 → f32 (LUT-based, extremely fast)
let srgb_bytes: Vec<u8> = (0..=255).collect();
let mut linear = vec![0.0f32; 256];
srgb_u8_to_linear_slice(&srgb_bytes, &mut linear);

// f32 → u8 (SIMD-accelerated)
let linear_values: Vec<f32> = (0..256).map(|i| i as f32 / 255.0).collect();
let mut srgb_bytes = vec![0u8; 256];
linear_to_srgb_u8_slice(&linear_values, &mut srgb_bytes);

x8 SIMD Functions

For processing exactly 8 values with explicit SIMD:

use linear_srgb::default::*;
use wide::f32x8;

// With CPU dispatch (recommended for standalone use)
let srgb = f32x8::from([0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7]);
let linear = srgb_to_linear_x8(srgb);  // Uses _dispatch internally

// u8 array → f32x8
let srgb_bytes = [0u8, 32, 64, 96, 128, 160, 192, 255];
let linear = srgb_u8_to_linear_x8(srgb_bytes);

Custom Gamma (Non-sRGB)

For pure power-law gamma without the sRGB linear segment:

use linear_srgb::default::*;

// gamma 2.2 (common in legacy workflows)
let linear = gamma_to_linear(0.5f32, 2.2);
let encoded = linear_to_gamma(linear, 2.2);

// Also available for slices
let mut values = vec![0.5f32; 1000];
gamma_to_linear_slice(&mut values, 2.2);

LUT for Custom Bit Depths

use linear_srgb::lut::{LinearTable16, EncodingTable16, lut_interp_linear_float};

// 16-bit linearization (65536 entries)
let lut = LinearTable16::new();
let linear = lut.lookup(32768);  // Direct lookup

// Interpolated encoding
let encode_lut = EncodingTable16::new();
let srgb = lut_interp_linear_float(0.5, encode_lut.as_slice());

Advanced: Using default::inline with #[multiversed]

If you're building your own SIMD-accelerated function with multiversed, use default::inline::* to avoid nested dispatch overhead:

use linear_srgb::default::inline::*;  // Clean names, no _inline suffix
use multiversed::multiversed;
use wide::f32x8;

#[multiversed]  // Your function handles dispatch
#[inline]
pub fn process_pixels(data: &mut [f32]) {
    for chunk in data.chunks_exact_mut(8) {
        let v = f32x8::from([
            chunk[0], chunk[1], chunk[2], chunk[3],
            chunk[4], chunk[5], chunk[6], chunk[7],
        ]);

        // No dispatch here - your #[multiversed] already handled it
        let linear = srgb_to_linear_x8(v);
        let processed = linear * f32x8::splat(1.5);  // Your processing
        let result = linear_to_srgb_x8(processed);

        let arr: [f32; 8] = result.into();
        chunk.copy_from_slice(&arr);
    }
}

Why this matters:

  • default::* x8 functions: Include CPU feature detection (~1-3ns overhead per call)
  • default::inline::*: Pure SIMD code, #[inline(always)], zero overhead

If you call dispatched functions inside a loop within your own #[multiversed] function, you pay dispatch cost per iteration. Use default::inline::* to avoid this.

Benchmark Results

Measured on AMD Ryzen / Intel with AVX2. Results show median time per element.

sRGB → Linear (Linearization)

Input Output Method Throughput Notes
u8 f32 LUT8 direct 3.0-4.3 Gelem/s Fastest. Used by default.
u8 f32 Scalar powf 170-180 Melem/s 20x slower than LUT
u16 f32 LUT16 direct 450-820 Melem/s 2.5-16x faster than scalar
f32 f32 SIMD dispatch ~1.6 Gelem/s Fastest. Used by default.
f32 f32 Scalar powf 1.3-1.4 Gelem/s ~15-20% slower than SIMD

Linear → sRGB (Encoding)

Input Output Method Throughput Notes
f32 f32 SIMD dispatch 440-480 Melem/s Fastest. Used by default.
f32 f32 Scalar powf 190-200 Melem/s 2.4x slower
f32 u8 SIMD dispatch 270-310 Melem/s Fastest. Used by default.
f32 u8 Scalar powf 145-160 Melem/s 1.8x slower
f32 u8 LUT12 interp 125-135 Melem/s Slowest due to interp overhead
f32 u16 Scalar powf 145-200 Melem/s Fastest. Beats LUT interp.
f32 u16 LUT16 interp 120-130 Melem/s Interpolation overhead

Dispatch Overhead

At small sizes (8-64 elements), dispatch overhead is measurable but acceptable:

Size Slice dispatch once x8 dispatch per chunk x8 inline (no dispatch)
8 27.5 ns 31.0 ns 28.2 ns
64 144 ns 165 ns 151 ns
1024 2116 ns 2487 ns 2377 ns

Conclusion: Slice functions (dispatch once) have essentially no overhead vs inline at practical sizes.

Module Organization

  • default - Recommended API. Re-exports optimal implementations.
  • default::inline - Dispatch-free variants for use inside #[multiversed].
  • simd - Full SIMD API with _dispatch and _inline variants.
  • scalar - Single-value functions. Use for individual conversions.
  • lut - Lookup tables for custom bit depths.

Deprecated Functions

These functions are marked #[deprecated] because faster alternatives exist. They remain available for benchmarking and compatibility.

Deprecated Speed vs Alternative Use Instead
scalar::srgb_u8_to_linear 20x slower simd::srgb_u8_to_linear (LUT)
SrgbConverter::linear_to_srgb_u8 2x slower simd::linear_to_srgb_u8_slice
SrgbConverter::batch_linear_to_srgb 2x slower simd::linear_to_srgb_u8_slice

Feature Flags

[dependencies]
linear-srgb = "0.4"  # std enabled by default

# no_std (requires alloc for LUT generation)
linear-srgb = { version = "0.3", default-features = false }

# Enable unsafe optimizations
linear-srgb = { version = "0.3", features = ["unsafe_simd"] }
  • std (default): Required for runtime SIMD dispatch
  • unsafe_simd: Union-based bit manipulation, unchecked indexing

Accuracy

Implements IEC 61966-2-1:1999 sRGB transfer functions with:

  • C0-continuous piecewise function (no discontinuity at threshold)
  • Constants derived from moxcms reference implementation
  • f32: ~1e-5 roundtrip accuracy
  • f64: ~1e-10 roundtrip accuracy

License

MIT OR Apache-2.0

AI-Generated Code Notice

Developed with Claude (Anthropic). All code has been reviewed and benchmarked, but verify critical paths for your use case.