linear-srgb
Fast linear↔sRGB color space conversion with runtime CPU dispatch.
Quick Start
use *;
// Single values
let linear = srgb_to_linear;
let srgb = linear_to_srgb;
// Slices (SIMD-accelerated)
let mut values = vec!;
srgb_to_linear_slice;
linear_to_srgb_slice;
// u8 ↔ f32 (image processing)
let linear = srgb_u8_to_linear;
let srgb_byte = linear_to_srgb_u8;
Which Function Should I Use?
┌─────────────────────────┐
│ How many values? │
└───────────┬─────────────┘
│
┌─────────────────┼─────────────────┐
▼ ▼ ▼
┌────────┐ ┌────────┐ ┌────────────┐
│ One │ │ Slice │ │ Building │
│ value │ │ [f32] │ │ own SIMD? │
└───┬────┘ └───┬────┘ └─────┬──────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌──────────────┐ ┌─────────────────┐
│ srgb_to_linear │ │ *_slice() │ │ Inside your own │
│ linear_to_srgb │ │ │ │ #[multiversed]? │
│ srgb_u8_to_ │ │ Dispatch once│ └────────┬────────┘
│ linear (LUT) │ │ loop is fast │ │
└─────────────────┘ └──────────────┘ ┌──────┴──────┐
▼ ▼
┌─────┐ ┌─────┐
│ Yes │ │ No │
└──┬──┘ └──┬──┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ default:: │ │ *_x8() or │
│ inline::* │ │ *_x8_slice() │
│ │ │ │
│ No dispatch, │ │ Has dispatch │
│ #[inline] │ │ (that's fine)│
└──────────────┘ └──────────────┘
Quick reference:
| Your situation | Use this |
|---|---|
| One f32 value | srgb_to_linear(x) / linear_to_srgb(x) |
| One u8 value | srgb_u8_to_linear(x) (LUT, 20x faster than scalar) |
&mut [f32] slice |
srgb_to_linear_slice() / linear_to_srgb_slice() |
&[u8] → &mut [f32] |
srgb_u8_to_linear_slice() |
&[f32] → &mut [u8] |
linear_to_srgb_u8_slice() |
&mut [f32x8] slice |
linear_to_srgb_x8_slice() (dispatch once) |
Inside #[multiversed] |
default::inline::* (no dispatch) |
| Standalone x8 call | linear_to_srgb_x8() (has dispatch, that's fine) |
Performance Guide
This crate is carefully tuned for maximum throughput. The default module exposes the fastest implementation for each conversion type, chosen based on extensive benchmarking.
Why Each Default Was Chosen
| Conversion | Default Implementation | Why |
|---|---|---|
| u8 → f32 | LUT direct lookup | 3-4 Gelem/s. 256-entry table fits in L1 cache. Beats both scalar (170 Melem/s) and SIMD. |
| u16 → f32 | LUT direct lookup | 450-820 Melem/s. 2.5-16x faster than scalar powf. |
| f32 → f32 (sRGB→linear) | SIMD with dispatch | 1.6 Gelem/s. ~15-20% faster than scalar powf (1.4 Gelem/s). |
| f32 → f32 (linear→sRGB) | SIMD with dispatch | 440-480 Melem/s. ~2x faster than scalar for this direction. |
| f32 → u8 | SIMD with dispatch | 270-275 Melem/s. ~1.8x faster than scalar. |
| f32 → u16 | Scalar powf | 145-200 Melem/s. Beats LUT interpolation due to interpolation overhead. |
Dispatch Overhead
The _dispatch variants use runtime CPU feature detection (AVX2, SSE4.1, NEON, etc.) via multiversed. This adds ~1-3ns per call, which is fully amortized even at 8 elements.
Bottom line: Always use the slice functions for batches. The dispatch cost is negligible.
API Reference
Single Values
use *;
// f32 conversions (scalar - fast for individual values)
let linear = srgb_to_linear;
let srgb = linear_to_srgb;
// f64 high-precision
let linear = srgb_to_linear_f64;
// u8 conversions (LUT-based)
let linear = srgb_u8_to_linear; // u8 → f32
let srgb_byte = linear_to_srgb_u8; // f32 → u8
Slice Processing (Recommended for Batches)
use *;
// In-place f32 conversion (SIMD-accelerated)
let mut values = vec!;
srgb_to_linear_slice; // Modifies in-place
linear_to_srgb_slice;
// u8 → f32 (LUT-based, extremely fast)
let srgb_bytes: = .collect;
let mut linear = vec!;
srgb_u8_to_linear_slice;
// f32 → u8 (SIMD-accelerated)
let linear_values: = .map.collect;
let mut srgb_bytes = vec!;
linear_to_srgb_u8_slice;
x8 SIMD Functions
For processing exactly 8 values with explicit SIMD:
use *;
use f32x8;
// With CPU dispatch (recommended for standalone use)
let srgb = from;
let linear = srgb_to_linear_x8; // Uses _dispatch internally
// u8 array → f32x8
let srgb_bytes = ;
let linear = srgb_u8_to_linear_x8;
Custom Gamma (Non-sRGB)
For pure power-law gamma without the sRGB linear segment:
use *;
// gamma 2.2 (common in legacy workflows)
let linear = gamma_to_linear;
let encoded = linear_to_gamma;
// Also available for slices
let mut values = vec!;
gamma_to_linear_slice;
LUT for Custom Bit Depths
use ;
// 16-bit linearization (65536 entries)
let lut = new;
let linear = lut.lookup; // Direct lookup
// Interpolated encoding
let encode_lut = new;
let srgb = lut_interp_linear_float;
Advanced: Using default::inline with #[multiversed]
If you're building your own SIMD-accelerated function with multiversed, use default::inline::* to avoid nested dispatch overhead:
use *; // Clean names, no _inline suffix
use multiversed;
use f32x8;
// Your function handles dispatch
Why this matters:
default::*x8 functions: Include CPU feature detection (~1-3ns overhead per call)default::inline::*: Pure SIMD code,#[inline(always)], zero overhead
If you call dispatched functions inside a loop within your own #[multiversed] function, you pay dispatch cost per iteration. Use default::inline::* to avoid this.
Benchmark Results
Measured on AMD Ryzen / Intel with AVX2. Results show median time per element.
sRGB → Linear (Linearization)
| Input | Output | Method | Throughput | Notes |
|---|---|---|---|---|
| u8 | f32 | LUT8 direct | 3.0-4.3 Gelem/s | Fastest. Used by default. |
| u8 | f32 | Scalar powf | 170-180 Melem/s | 20x slower than LUT |
| u16 | f32 | LUT16 direct | 450-820 Melem/s | 2.5-16x faster than scalar |
| f32 | f32 | SIMD dispatch | ~1.6 Gelem/s | Fastest. Used by default. |
| f32 | f32 | Scalar powf | 1.3-1.4 Gelem/s | ~15-20% slower than SIMD |
Linear → sRGB (Encoding)
| Input | Output | Method | Throughput | Notes |
|---|---|---|---|---|
| f32 | f32 | SIMD dispatch | 440-480 Melem/s | Fastest. Used by default. |
| f32 | f32 | Scalar powf | 190-200 Melem/s | 2.4x slower |
| f32 | u8 | SIMD dispatch | 270-310 Melem/s | Fastest. Used by default. |
| f32 | u8 | Scalar powf | 145-160 Melem/s | 1.8x slower |
| f32 | u8 | LUT12 interp | 125-135 Melem/s | Slowest due to interp overhead |
| f32 | u16 | Scalar powf | 145-200 Melem/s | Fastest. Beats LUT interp. |
| f32 | u16 | LUT16 interp | 120-130 Melem/s | Interpolation overhead |
Dispatch Overhead
At small sizes (8-64 elements), dispatch overhead is measurable but acceptable:
| Size | Slice dispatch once | x8 dispatch per chunk | x8 inline (no dispatch) |
|---|---|---|---|
| 8 | 27.5 ns | 31.0 ns | 28.2 ns |
| 64 | 144 ns | 165 ns | 151 ns |
| 1024 | 2116 ns | 2487 ns | 2377 ns |
Conclusion: Slice functions (dispatch once) have essentially no overhead vs inline at practical sizes.
Module Organization
default- Recommended API. Re-exports optimal implementations.default::inline- Dispatch-free variants for use inside#[multiversed].simd- Full SIMD API with_dispatchand_inlinevariants.scalar- Single-value functions. Use for individual conversions.lut- Lookup tables for custom bit depths.
Deprecated Functions
These functions are marked #[deprecated] because faster alternatives exist. They remain available for benchmarking and compatibility.
| Deprecated | Speed vs Alternative | Use Instead |
|---|---|---|
scalar::srgb_u8_to_linear |
20x slower | simd::srgb_u8_to_linear (LUT) |
SrgbConverter::linear_to_srgb_u8 |
2x slower | simd::linear_to_srgb_u8_slice |
SrgbConverter::batch_linear_to_srgb |
2x slower | simd::linear_to_srgb_u8_slice |
Feature Flags
[]
= "0.4" # std enabled by default
# no_std (requires alloc for LUT generation)
= { = "0.3", = false }
# Enable unsafe optimizations
= { = "0.3", = ["unsafe_simd"] }
std(default): Required for runtime SIMD dispatchunsafe_simd: Union-based bit manipulation, unchecked indexing
Accuracy
Implements IEC 61966-2-1:1999 sRGB transfer functions with:
- C0-continuous piecewise function (no discontinuity at threshold)
- Constants derived from moxcms reference implementation
- f32: ~1e-5 roundtrip accuracy
- f64: ~1e-10 roundtrip accuracy
License
MIT OR Apache-2.0
AI-Generated Code Notice
Developed with Claude (Anthropic). All code has been reviewed and benchmarked, but verify critical paths for your use case.