jpegli-rs 0.12.0

# JpegEncoder API Design

## Current API Surface (v0.4)

### Configuration (StreamingEncoderBuilder)

```rust
// Entry point - creates builder
JpegEncoder::new(width: u32, height: u32) -> StreamingEncoderBuilder

// Builder methods (all return Self for chaining)
.quality(impl Into<Quality>)     // 1-100 or Quality enum
.distance(f32)                   // Butteraugli distance (0.5-3.0)
.progressive(bool)               // Enable progressive JPEG
.subsampling(Subsampling)        // S444, S422, S420, S440
.pixel_format(PixelFormat)       // Rgb, Rgba, Gray, Rgb16, etc.
.mode(JpegMode)                  // Baseline, Progressive
.optimize_huffman(bool)          // Two-pass Huffman optimization
.chroma_downsampling(ChromaDownsampling)  // Box, GammaAwareIterative
.sharp_yuv(bool)                 // Alias for GammaAwareIterative
.restart_interval(u16)           // MCUs between restart markers
.custom_quant_matrices(CustomQuantMatrices)  // Custom quant tables
.use_xyb(bool)                   // XYB color space mode

// Feature-gated
.parallel(bool)                  // [parallel] Multi-threaded encoding
.hybrid_trellis(bool)            // [experimental-hybrid-trellis]
.hybrid_config(HybridConfig)     // [experimental-hybrid-trellis]
.aq_map(AQStrengthMap)           // [experimental-hybrid-trellis]

// Terminal methods
.start() -> Result<StreamingEncoder>           // Start streaming
.encode(data: &[u8]) -> Result<Vec<u8>>        // One-shot encode
.encode_all_with_stop(data, stop) -> Result<Vec<u8>>  // With cancellation
.estimate_memory_usage() -> usize              // Memory estimate
```

### Encoder State (StreamingEncoder)

```rust
// Inspection
.rows_pushed() -> usize          // Rows received so far
.bytes_per_row() -> usize        // Expected bytes per row
.height() -> usize               // Total image height
.strip_height() -> usize         // Internal strip size

// Push data
.push_row(&[u8]) -> Result<()>
.push_row_with_stop(&[u8], impl Stop) -> Result<()>
.push_rows(&[u8], num_rows) -> Result<()>
.push_rows_with_stop(&[u8], num_rows, impl Stop) -> Result<()>

// Direct YCbCr input (bypass RGB conversion)
.push_ycbcr_strip_f32(y, cb, cr, num_rows) -> Result<()>
.push_ycbcr_strip_f32_subsampled(y, cb, cr, num_rows) -> Result<()>

// Finalize
.finish() -> Result<Vec<u8>>
.finish_with_stop(impl Stop) -> Result<Vec<u8>>
```

### Convenience Functions

```rust
jpegli::encode_rgb(w, h, data, quality) -> Result<Vec<u8>>
jpegli::encode_rgba(w, h, data, quality) -> Result<Vec<u8>>
jpegli::encode_gray(w, h, data, quality) -> Result<Vec<u8>>
jpegli::decode(data) -> Result<DecodedImage>
jpegli::decode_f32(data) -> Result<DecodedImageF32>
jpegli::decode_to_format(data, format) -> Result<DecodedImage>
```

---

## Proposed API Improvements (v0.5)

### Goal: Separate Config from Encoder

Current problem: `JpegEncoder::new(w, h)` creates a builder that has dimensions baked in.
This prevents reusing configuration across different image sizes.

### New Design

```rust
/// Reusable encoding configuration (no dimensions)
#[derive(Clone, Debug)]
pub struct EncoderConfig {
    quality: Quality,
    subsampling: Subsampling,
    pixel_format: PixelFormat,
    mode: JpegMode,
    optimize_huffman: bool,
    chroma_downsampling: ChromaDownsampling,
    restart_interval: u16,
    custom_quant_matrices: Option<CustomQuantMatrices>,
    use_xyb: bool,
    #[cfg(feature = "parallel")]
    parallel: bool,
}

impl EncoderConfig {
    /// Default config: Q85, 4:4:4, RGB, baseline, optimize on
    pub fn new() -> Self;

    // Builder methods (same as current, return &mut Self or Self)
    pub fn quality(&mut self, q: impl Into<Quality>) -> &mut Self;
    pub fn distance(&mut self, d: f32) -> &mut Self;
    pub fn progressive(&mut self, enable: bool) -> &mut Self;
    pub fn subsampling(&mut self, s: Subsampling) -> &mut Self;
    pub fn pixel_format(&mut self, f: PixelFormat) -> &mut Self;
    // ... etc

    /// Estimate memory for specific dimensions
    pub fn estimate_memory(&self, width: u32, height: u32) -> MemoryEstimate;

    /// Estimate output size range (min, typical, max)
    pub fn estimate_output_size(&self, width: u32, height: u32) -> OutputSizeEstimate;

    /// Create encoder for specific dimensions
    pub fn encoder(&self, width: u32, height: u32) -> Result<Encoder>;

    /// One-shot encode with this config
    pub fn encode(&self, width: u32, height: u32, data: &[u8]) -> Result<Vec<u8>>;

    /// One-shot encode with cancellation
    pub fn encode_with_stop(
        &self,
        width: u32,
        height: u32,
        data: &[u8],
        stop: impl Stop
    ) -> Result<Vec<u8>>;
}

/// Detailed memory breakdown
#[derive(Clone, Debug)]
pub struct MemoryEstimate {
    /// Peak memory during encoding
    pub peak_bytes: usize,
    /// Memory for input buffering (one strip)
    pub input_buffer: usize,
    /// Memory for DCT coefficient storage
    pub coefficient_storage: usize,
    /// Memory for internal working buffers
    pub working_buffers: usize,
    /// Estimated output buffer (before actual encoding)
    pub output_buffer_estimate: usize,
}

impl MemoryEstimate {
    /// Total including output buffer estimate
    pub fn total(&self) -> usize;

    /// Peak during encoding (excludes final output)
    pub fn encoding_peak(&self) -> usize;
}

/// Output size estimate
#[derive(Clone, Debug)]
pub struct OutputSizeEstimate {
    /// Minimum likely size (highly compressible content)
    pub min_bytes: usize,
    /// Typical size for photographic content
    pub typical_bytes: usize,
    /// Maximum likely size (incompressible content)
    pub max_bytes: usize,
    /// Bits per pixel estimate
    pub typical_bpp: f32,
}

/// Stateful encoder (created from config + dimensions)
pub struct Encoder {
    // ... internal state
}

impl Encoder {
    /// Create with default config
    pub fn new(width: u32, height: u32) -> Result<Self>;

    /// Create from config
    pub fn with_config(config: &EncoderConfig, width: u32, height: u32) -> Result<Self>;

    // Status
    pub fn rows_pushed(&self) -> usize;
    pub fn rows_remaining(&self) -> usize;
    pub fn bytes_per_row(&self) -> usize;
    pub fn progress(&self) -> f32;  // 0.0 - 1.0
    pub fn is_complete(&self) -> bool;

    // Memory tracking
    pub fn current_memory_usage(&self) -> usize;
    pub fn peak_memory_usage(&self) -> usize;

    // Push data
    pub fn push_row(&mut self, row: &[u8]) -> Result<()>;
    pub fn push_row_cancellable(&mut self, row: &[u8], stop: &impl Stop) -> Result<()>;
    pub fn push_rows(&mut self, data: &[u8], num_rows: usize) -> Result<()>;
    pub fn push_rows_cancellable(&mut self, data: &[u8], num_rows: usize, stop: &impl Stop) -> Result<()>;

    // Direct YCbCr (for transcoding pipelines)
    pub fn push_ycbcr_f32(&mut self, y: &[f32], cb: &[f32], cr: &[f32], rows: usize) -> Result<()>;

    // Finish
    pub fn finish(self) -> Result<Vec<u8>>;
    pub fn finish_cancellable(self, stop: impl Stop) -> Result<Vec<u8>>;

    // Write to existing buffer (zero-copy for proxy servers)
    pub fn finish_into(self, output: &mut Vec<u8>) -> Result<usize>;
    pub fn finish_into_cancellable(self, output: &mut Vec<u8>, stop: impl Stop) -> Result<usize>;
}
```

### Proxy Server Usage Pattern

```rust
use jpegli::{EncoderConfig, Subsampling};
use std::sync::Arc;

// Create shared config once at startup
let config = Arc::new(
    EncoderConfig::new()
        .quality(85)
        .subsampling(Subsampling::S420)
        .progressive(true)
        .optimize_huffman(true)
        .clone()
);

// Per-request handler
async fn handle_resize(
    config: Arc<EncoderConfig>,
    source: Image,
    target_width: u32,
    target_height: u32,
    cancel: CancellationToken,
) -> Result<Vec<u8>> {
    // Estimate memory before committing
    let estimate = config.estimate_memory(target_width, target_height);
    if estimate.peak_bytes > MAX_MEMORY_PER_REQUEST {
        return Err(Error::ImageTooLarge);
    }

    // Pre-allocate output buffer
    let output_estimate = config.estimate_output_size(target_width, target_height);
    let mut output = Vec::with_capacity(output_estimate.typical_bytes);

    // Create encoder
    let mut encoder = config.encoder(target_width, target_height)?;

    // Stream rows from resizer, checking cancellation
    let stop = cancel.as_stop();
    for row in source.resize_rows(target_width, target_height) {
        encoder.push_row_cancellable(&row, &stop)?;
    }

    // Finish into pre-allocated buffer
    let size = encoder.finish_into_cancellable(&mut output, stop)?;
    output.truncate(size);

    Ok(output)
}
```

### Memory Estimate Accuracy

Current `estimate_memory_usage()` is approximate. Proposed breakdown:

```rust
impl EncoderConfig {
    pub fn estimate_memory(&self, width: u32, height: u32) -> MemoryEstimate {
        let w = width as usize;
        let h = height as usize;
        let strip_h = self.subsampling.strip_height();
        let mcu = self.subsampling.mcu_size();

        // Pad to MCU boundaries
        let pw = (w + mcu - 1) / mcu * mcu;
        let ph = (h + mcu - 1) / mcu * mcu;

        // Block counts
        let y_blocks = (pw / 8) * (ph / 8);
        let c_blocks = self.subsampling.chroma_blocks(pw, ph);

        // Input buffer: one strip of RGB
        let input_buffer = w * strip_h * self.pixel_format.bytes_per_pixel();

        // Working buffers: f32 YCbCr planes for one strip
        let strip_f32 = pw * strip_h * 4 * 3;  // Y, Cb, Cr

        // Coefficient storage: all blocks as i16
        let coeff_storage = (y_blocks + 2 * c_blocks) * 64 * 2;  // i16 = 2 bytes

        // DCT working buffers (double-buffered f32)
        let dct_buffers = (pw / 8) * 2 * 256 * 2;  // 2 iMCU rows, f32

        // AQ map
        let aq_map = y_blocks * 4;  // f32 per block

        // Huffman optimization (if enabled)
        let huffman_buffers = if self.optimize_huffman {
            y_blocks * 2 + c_blocks * 4  // frequency counts
        } else {
            0
        };

        // Output estimate (quality-dependent)
        let bpp = self.quality.estimated_bpp();
        let output_estimate = (w * h) as f32 * bpp / 8.0;

        MemoryEstimate {
            peak_bytes: input_buffer + strip_f32 + coeff_storage + dct_buffers + aq_map,
            input_buffer,
            coefficient_storage: coeff_storage,
            working_buffers: strip_f32 + dct_buffers + aq_map + huffman_buffers,
            output_buffer_estimate: output_estimate as usize,
        }
    }
}
```

### Output Size Estimation

```rust
impl Quality {
    /// Estimated bits per pixel for this quality level
    pub fn estimated_bpp(&self) -> f32 {
        let d = self.to_distance();
        // Empirical formula from benchmarks
        match d {
            d if d < 0.5 => 4.5,   // Very high quality
            d if d < 1.0 => 3.0,   // High quality
            d if d < 1.5 => 2.0,   // Medium-high
            d if d < 2.0 => 1.5,   // Medium
            d if d < 3.0 => 1.0,   // Low
            _ => 0.7,             // Very low
        }
    }
}

impl EncoderConfig {
    pub fn estimate_output_size(&self, width: u32, height: u32) -> OutputSizeEstimate {
        let pixels = (width * height) as f32;
        let base_bpp = self.quality.estimated_bpp();

        // Adjust for subsampling
        let subsample_factor = match self.subsampling {
            Subsampling::S444 => 1.0,
            Subsampling::S422 => 0.85,
            Subsampling::S420 => 0.75,
            Subsampling::S440 => 0.85,
        };

        // Adjust for progressive (typically 3% smaller)
        let prog_factor = if self.mode == JpegMode::Progressive { 0.97 } else { 1.0 };

        let typical_bpp = base_bpp * subsample_factor * prog_factor;

        OutputSizeEstimate {
            min_bytes: (pixels * typical_bpp * 0.3 / 8.0) as usize,
            typical_bytes: (pixels * typical_bpp / 8.0) as usize,
            max_bytes: (pixels * typical_bpp * 2.0 / 8.0) as usize,
            typical_bpp,
        }
    }
}
```

---

## Migration Path

### v0.4 (Current)
```rust
let jpeg = JpegEncoder::new(800, 600)
    .quality(85)
    .progressive(true)
    .encode(&pixels)?;
```

### v0.5 (Proposed) - Same syntax still works
```rust
// Option 1: Existing syntax (backwards compatible)
let jpeg = JpegEncoder::new(800, 600)
    .quality(85)
    .progressive(true)
    .encode(&pixels)?;

// Option 2: Separate config (new)
let config = EncoderConfig::new()
    .quality(85)
    .progressive(true);

let jpeg = config.encode(800, 600, &pixels)?;

// Option 3: Reusable config across sizes
let small = config.encode(400, 300, &small_pixels)?;
let large = config.encode(1600, 1200, &large_pixels)?;
```

---

---

## Resource Estimation API (TODO)

### Requirements

Memory estimation depends on:
- **Input method**: streaming (row-by-row) vs one-shot vs YCbCr direct
- **Dimensions**: width × height
- **Config**: subsampling, optimize_huffman, parallel, etc.

### Proposed Structs

```rust
/// Pre-encode resource estimate
pub struct ResourceEstimate {
    /// Peak memory required (only public field)
    pub peak_bytes: usize,

    // Internal tracking (not exposed, used for validation)
    // - total_alloc_count: usize
    // - total_alloc_bytes: usize
    // - max_single_alloc: usize
}

impl ResourceEstimate {
    /// Estimated compute time in milliseconds for current architecture
    pub fn compute_cost_ms(&self) -> f32;
}

/// Input method affects memory profile
pub enum InputMethod {
    /// One-shot: entire image in memory
    OneShot,
    /// Streaming: row-by-row, lower peak memory
    Streaming,
    /// Direct YCbCr: pre-converted planes
    YCbCrDirect,
    /// Direct YCbCr with pre-subsampled chroma
    YCbCrSubsampled,
}

impl EncoderConfig {
    /// Estimate resources for specific dimensions and input method
    pub fn estimate_resources(
        &self,
        width: u32,
        height: u32,
        input_method: InputMethod,
    ) -> ResourceEstimate;
}
```

### Post-Encode Metrics

```rust
/// Actual resource usage after encoding completes
pub struct EncodeMetrics {
    /// Actual peak memory during encoding
    pub peak_bytes: usize,
    /// Total allocations made
    pub alloc_count: usize,
    /// Total bytes allocated (may exceed peak due to churn)
    pub total_alloc_bytes: usize,
    /// Wall-clock time spent encoding
    pub elapsed_ms: f32,
    /// Output size in bytes
    pub output_bytes: usize,
}

impl Encoder {
    /// Finish encoding and return metrics along with output
    pub fn finish_with_metrics(self) -> Result<(Vec<u8>, EncodeMetrics)>;

    /// Finish into existing buffer, return metrics
    pub fn finish_into_with_metrics(
        self,
        output: &mut Vec<u8>,
    ) -> Result<EncodeMetrics>;
}
```

### Compute Cost Estimation

```rust
impl ResourceEstimate {
    /// Estimated encode time based on:
    /// - Image dimensions (pixels)
    /// - Config (progressive 2x slower, parallel speedup)
    /// - Current CPU (detected at runtime)
    pub fn compute_cost_ms(&self) -> f32 {
        // Base: ~90 MP/s sequential, ~45 MP/s progressive
        // Adjusted for: parallel (1.4x for large), sharp_yuv (slower)
        // CPU detection: AVX2 vs SSE vs scalar fallback
    }
}
```

### Usage Pattern for Proxy Servers

```rust
async fn handle_resize(
    config: &EncoderConfig,
    width: u32,
    height: u32,
    pixels: &[u8],
    cancel: CancellationToken,
) -> Result<(Vec<u8>, EncodeMetrics)> {
    // Pre-flight check
    let estimate = config.estimate_resources(width, height, InputMethod::OneShot);

    if estimate.peak_bytes > MAX_MEMORY_PER_REQUEST {
        return Err(Error::ImageTooLarge);
    }

    if estimate.compute_cost_ms() > MAX_ENCODE_TIME_MS {
        return Err(Error::WouldTakeToLong);
    }

    // Encode
    let mut encoder = config.encoder(width, height)?;
    // ... push rows with cancellation ...

    // Get actual metrics for logging/billing
    let (jpeg, metrics) = encoder.finish_with_metrics()?;

    log::info!(
        "Encoded {}x{} in {:.1}ms, peak {}KB, output {}KB",
        width, height,
        metrics.elapsed_ms,
        metrics.peak_bytes / 1024,
        metrics.output_bytes / 1024,
    );

    Ok((jpeg, metrics))
}
```

---

## Implementation TODO

- [ ] Extract `EncoderConfig` as dimension-independent config
- [ ] Add `InputMethod` enum
- [ ] Implement `estimate_resources()` with accurate memory modeling
- [ ] Add `compute_cost_ms()` with CPU detection
- [ ] Add allocation tracking (behind feature flag for zero overhead)
- [ ] Add `EncodeMetrics` returned from `finish_with_metrics()`
- [ ] Add `finish_into()` for zero-copy output to existing buffer
- [ ] Benchmark to calibrate `compute_cost_ms()` estimates

---

## Open Questions

1. Should `EncoderConfig` be `Clone + Send + Sync` for easy sharing?
2. Should we add `encode_to_writer(impl Write)` for zero-copy streaming output?
3. Should memory/timing tracking be opt-in via feature flag (slight overhead)?
4. Should we expose strip-level progress callbacks for large images?
5. How to handle allocation tracking without global allocator hooks?