Skip to main content

ZstdCompressor

Struct ZstdCompressor 

Source
pub struct ZstdCompressor { /* private fields */ }
Expand description

Zstandard compressor with optional pre-trained dictionary.

This compressor wraps the zstd crate and provides both raw compression (no dictionary) and dictionary-enhanced compression for improved ratios on structured data.

§Dictionary Lifecycle

When a dictionary is provided:

  1. The dictionary bytes are cloned and leaked to obtain 'static lifetime
  2. Both encoder and decoder dictionaries are constructed from the leaked bytes
  3. The dictionary memory persists for the process lifetime
  4. Multiple compressor instances can share the same dictionary bytes

This design trades memory (leaked dictionary) for simplicity and safety. In typical Hexz usage, one compressor instance exists per snapshot file, so the overhead is ~450 KB per open snapshot (110 KB dict × ~4x internal structures).

§Thread Safety

ZstdCompressor is Send + Sync. Compression and decompression operations do not mutate the compressor state, allowing safe concurrent use from multiple threads. Each operation allocates its own temporary encoder/decoder.

§Constraints

  • Dictionary compatibility: Blocks compressed with a dictionary MUST be decompressed with the exact same dictionary bytes. Attempting to decompress with a different or missing dictionary will fail with a compression error.
  • Level consistency: The compression level is stored in the encoder dictionary. Changing the level requires training a new dictionary.

§Examples

use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};

// Create compressor without dictionary
let compressor = ZstdCompressor::new(3, None);
let data = b"test data";
let compressed = compressor.compress(data).unwrap();
let decompressed = compressor.decompress(&compressed).unwrap();
assert_eq!(data.as_slice(), decompressed.as_slice());

Implementations§

Source§

impl ZstdCompressor

Source

pub fn new(level: i32, dict: Option<Vec<u8>>) -> Self

Creates a new Zstandard compressor with the specified compression level and optional dictionary.

§Parameters
  • level - Compression level from 1 to 22:

    • 1: Fastest compression (~450 MB/s), lower ratio
    • 3: Default, balanced speed/ratio (~350 MB/s)
    • 9: High compression (~85 MB/s), good ratio
    • 19-22: Maximum compression (~10-30 MB/s), best ratio
  • dict - Optional pre-trained dictionary bytes:

    • None: Use raw zstd compression
    • Some(dict_bytes): Use dictionary-enhanced compression
§Dictionary Handling

When a dictionary is provided, this function:

  1. Copies the dictionary bytes internally via EncoderDictionary::copy
  2. Parses the bytes into native zstd encoder/decoder dictionaries
  3. Manages the dictionary lifetime automatically (no leaks)

The dictionary memory (~110 KB for typical dictionaries) is properly managed and freed when the compressor is dropped. This provides:

  • Proper memory management without leaks
  • Dictionary reuse across millions of blocks
  • Memory safety with automatic cleanup
§Memory Usage

Approximate memory overhead per compressor instance:

  • No dictionary: ~10 KB (minimal bookkeeping)
  • With 110 KB dictionary: ~450 KB (leaked bytes + encoder/decoder structures)
§Examples
use hexz_core::algo::compression::zstd::ZstdCompressor;

// Fast compression, no dictionary
let fast = ZstdCompressor::new(1, None);

// Balanced compression with dictionary
let dict = vec![0u8; 1024]; // Placeholder dictionary
let balanced = ZstdCompressor::new(3, Some(dict));

// Maximum compression for archival
let max = ZstdCompressor::new(22, None);
§Performance Notes

Creating a compressor is relatively expensive (~1 ms with dictionary due to parsing). Reuse compressor instances rather than creating them per-operation.

Source

pub fn train(samples: &[Vec<u8>], max_size: usize) -> Result<Vec<u8>>

Trains a Zstandard dictionary from representative sample blocks.

Dictionary training analyzes a collection of sample data to identify common patterns, sequences, and statistical distributions. The resulting dictionary acts as a “seed” for the compressor, enabling better compression ratios on small blocks that would otherwise lack sufficient data to build effective models.

§Training Algorithm

The training process:

  1. Concatenates all samples into a training corpus
  2. Analyzes byte-level patterns using suffix arrays and frequency analysis
  3. Selects the most valuable patterns up to max_size bytes
  4. Optimizes dictionary layout for fast lookup during compression
  5. Returns the trained dictionary as a byte vector

This is a CPU-intensive operation (O(n log n) where n is total sample bytes) and should be done once during snapshot creation, not per-block.

§Parameters
  • samples - A slice of representative data blocks. Requirements:

    • Minimum count: 10 samples (20+ recommended, 50+ ideal)
    • Minimum total size: 100x max_size (e.g., 10 MB for 100 KB dictionary)
    • Representativeness: Must match production data patterns
    • Diversity: Include variety of structures, not just repeated copies
  • max_size - Maximum dictionary size in bytes. Recommendations:

    • Small blocks (16-32 KB): 64 KB dictionary
    • Medium blocks (64 KB): 110 KB dictionary (zstd’s recommended max)
    • Large blocks (128+ KB): Diminishing returns, consider skipping dictionary
§Returns

Returns Ok(Vec<u8>) containing the trained dictionary bytes, or Err if training fails. The actual dictionary size may be less than max_size if fewer patterns were found.

§Errors

Returns Error::Compression if:

  • Samples are empty or too small (less than ~1 KB total)
  • max_size is invalid (0 or excessively large)
  • Internal zstd training algorithm fails (corrupted samples, out of memory)
§Performance Characteristics

Training time on AMD Ryzen 9 5950X:

  • 1 MB samples, 64 KB dict: ~50 ms
  • 10 MB samples, 110 KB dict: ~200 ms
  • 100 MB samples, 110 KB dict: ~2 seconds

Training is approximately O(n log n) in total sample size.

§Compression Ratio Impact

Expected compression ratio improvements with trained dictionary vs. raw compression:

Block SizeRaw Zstd-3With DictImprovement
16 KB1.5x2.4x+60%
32 KB2.1x3.2x+52%
64 KB2.8x3.9x+39%
128 KB3.2x3.7x+16%
256 KB+3.5x3.6x+3%

Measured on typical VM disk image blocks (ext4 filesystem data).

§Examples
§Basic Training
use hexz_core::algo::compression::zstd::ZstdCompressor;

// Collect 20 representative 64 KB blocks
let samples: Vec<Vec<u8>> = (0..20)
    .map(|i| vec![((i * 13) % 256) as u8; 65536])
    .collect();

// Train 110 KB dictionary
let dict = ZstdCompressor::train(&samples, 110 * 1024)?;
println!("Trained dictionary: {} bytes", dict.len());

// Use dictionary for compression
let compressor = ZstdCompressor::new(3, Some(dict));
§Training from File Samples
use hexz_core::algo::compression::zstd::ZstdCompressor;
use std::fs::File;
use std::io::Read;

// Read samples from disk (e.g., sampled from VM disk image)
let mut samples = Vec::new();
let mut file = File::open("disk.raw")?;
let file_size = file.metadata()?.len();
let block_size = 65536;
let sample_count = 50;
let step = file_size / sample_count;

for i in 0..sample_count {
    let mut buffer = vec![0u8; block_size];
    let offset = i * step;
    // Seek to offset and read block
    // (simplified, real code needs error handling)
    samples.push(buffer);
}

let dict = ZstdCompressor::train(&samples, 110 * 1024)?;
println!("Trained from {} samples: {} bytes", samples.len(), dict.len());
§Validating Dictionary Quality
use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};

let samples: Vec<Vec<u8>> = vec![vec![42u8; 32768]; 30];
let dict = ZstdCompressor::train(&samples, 64 * 1024)?;

// Compare compression ratios
let without_dict = ZstdCompressor::new(3, None);
let with_dict = ZstdCompressor::new(3, Some(dict));

let test_data = vec![42u8; 32768];
let compressed_raw = without_dict.compress(&test_data)?;
let compressed_dict = with_dict.compress(&test_data)?;

let improvement = (compressed_raw.len() as f64 / compressed_dict.len() as f64 - 1.0) * 100.0;
println!("Dictionary improved compression by {:.1}%", improvement);

// If improvement < 10%, dictionary may not be beneficial
§When Dictionary Training Fails or Performs Poorly

Dictionary training may produce poor results if:

  • Samples are unrepresentative: Training on zeros, compressing real data
  • Data is random/encrypted: No patterns exist to learn
  • Samples are too few: Less than 10 samples or less than 100x dict size
  • Data is already highly compressible: Text/logs may not benefit
  • Blocks are too large: 256 KB+ blocks have enough context without dictionary

If dictionary compression performs worse than raw compression, fall back to ZstdCompressor::new(level, None).

§Memory Usage During Training

Temporary memory allocated during training:

  • Input buffer: Sum of all sample sizes (e.g., 10 MB for 50 × 200 KB samples)
  • Working memory: ~10x max_size (e.g., ~1.1 MB for 110 KB dict)
  • Output dictionary: max_size (e.g., 110 KB)

Total peak memory: input_size + 10×max_size. For typical usage (10 MB samples, 110 KB dict), peak memory is ~12 MB.

Trait Implementations§

Source§

impl Compressor for ZstdCompressor

Source§

fn compress(&self, data: &[u8]) -> Result<Vec<u8>>

Compresses a block of data using Zstandard compression.

This method compresses data using the compression level and dictionary configured during construction. The output is a self-contained compressed block in zstd frame format.

§Parameters
  • data - The uncompressed input data to compress. Can be any size from 0 bytes to multiple gigabytes, though blocks of 64 KB to 1 MB are typical in Hexz.
§Returns

Returns Ok(Vec<u8>) containing the compressed data. The compressed size depends on:

  • Input data compressibility (random data: ~100%, structured data: 20-50%)
  • Compression level (higher levels = smaller output, slower compression)
  • Dictionary usage (can reduce output by 10-40% for small blocks)
§Errors

Returns Error::Compression if:

  • Internal zstd encoder initialization fails (rare, typically OOM)
  • Compression process fails (extremely rare with valid input)
§Dictionary Behavior
  • With dictionary: Uses streaming encoder with pre-parsed dictionary for maximum throughput. The dictionary is not embedded in the output; the decompressor must have the same dictionary.
  • Without dictionary: Uses simple one-shot encoding with zstd’s default dictionary learning from the input itself.
§Performance

Approximate throughput on modern hardware (AMD Ryzen 9 5950X):

  • Level 1: ~450 MB/s
  • Level 3: ~350 MB/s (default)
  • Level 9: ~85 MB/s
  • Level 19: ~28 MB/s

Dictionary overhead: ~5% slower than raw compression due to initialization.

§Examples
use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};

let compressor = ZstdCompressor::new(3, None);
let data = b"Hello, world! Compression test data.";

let compressed = compressor.compress(data).unwrap();
println!("Compressed {} bytes to {} bytes", data.len(), compressed.len());

// Compressed data is self-contained and can be stored/transmitted
§Thread Safety

This method can be called concurrently from multiple threads on the same ZstdCompressor instance. Each call creates an independent encoder.

Source§

fn compress_into(&self, data: &[u8], out: &mut Vec<u8>) -> Result<()>

Decompresses a Zstandard-compressed block into a new buffer.

This method reverses the compression performed by compress(), restoring the original uncompressed data. The decompressed output is allocated dynamically based on the compressed frame’s metadata.

§Parameters
  • data - The compressed input data in zstd frame format. Must have been compressed by a compatible ZstdCompressor (same dictionary, any level).
§Returns

Returns Ok(Vec<u8>) containing the decompressed data. The output size is determined by the compressed frame’s content size field (embedded during compression).

§Errors

Returns Error::Compression if:

  • data is not valid zstd-compressed data (corrupted or wrong format)
  • data was compressed with a dictionary, but this compressor has no dictionary
  • data was compressed without a dictionary, but this compressor has a dictionary
  • data was compressed with a different dictionary than this compressor
  • Internal decompression fails (checksum mismatch, corrupted data)
§Dictionary Compatibility

Critical: Dictionary-compressed data MUST be decompressed with the exact same dictionary bytes. The zstd format includes a dictionary ID checksum; mismatched dictionaries will cause decompression to fail with an error.

Compressed WithDecompressed WithResult
No dictionaryNo dictionarySuccess
Dictionary ADictionary ASuccess
No dictionaryDictionary AError
Dictionary ANo dictionaryError
Dictionary ADictionary BError
§Performance

Decompression speed is independent of compression level (level affects only compression time). Typical throughput on modern hardware:

  • Without dictionary: ~1100 MB/s
  • With dictionary: ~950 MB/s (10% overhead from dictionary lookups)

Decompression is roughly 3x faster than compression at level 3.

§Memory Allocation

This method allocates a new Vec<u8> to hold the decompressed output. For hot paths where the decompressed size is known, consider using decompress_into() to reuse buffers and avoid allocations.

§Examples
use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};

let compressor = ZstdCompressor::new(3, None);
let original = b"Test data for compression";

let compressed = compressor.compress(original).unwrap();
let decompressed = compressor.decompress(&compressed).unwrap();

assert_eq!(original.as_slice(), decompressed.as_slice());
§Thread Safety

This method can be called concurrently from multiple threads on the same ZstdCompressor instance. Each call creates an independent decoder.

Source§

fn decompress_into(&self, data: &[u8], out: &mut [u8]) -> Result<usize>

Decompresses a Zstandard-compressed block into a caller-provided buffer.

This is a zero-allocation variant of decompress() that writes decompressed data directly into a pre-allocated buffer. This is ideal for hot paths where the decompressed size is known and buffers can be reused across multiple decompression operations.

§Parameters
  • data - The compressed input data in zstd frame format. Must have been compressed by a compatible ZstdCompressor (same dictionary, any level).
  • out - The output buffer to receive decompressed bytes. Must be large enough to hold the entire decompressed payload.
§Returns

Returns Ok(usize) containing the number of bytes written to out. This is always ≤ out.len().

§Errors

Returns Error::Compression if:

  • data is not valid zstd-compressed data
  • Dictionary mismatch (same rules as decompress())
  • out is too small to hold the decompressed data (buffer overflow protection)
  • Internal decompression fails (checksum mismatch, corrupted data)
§Buffer Sizing

The output buffer must be large enough to hold the full decompressed payload. If the buffer is too small, decompression will fail with an error rather than truncating output.

To determine the required size:

  • If you compressed the data, you know the original size
  • If reading from Hexz snapshots, the block size is in the index
  • The zstd frame header contains the content size (can be parsed)
§Performance

This method avoids heap allocation of the output buffer, making it suitable for high-throughput scenarios:

  • With reused buffer: 0 allocations per decompression
  • Throughput: Same as decompress() (~1000 MB/s)
  • Latency: ~5% lower than decompress() due to eliminated allocation

Recommended usage pattern for hot paths:

use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};

let compressor = ZstdCompressor::new(3, None);
let original = vec![42u8; 65536]; // 64 KB data
let compressed = compressor.compress(&original).unwrap();
let mut reusable_buffer = vec![0u8; 65536]; // 64 KB buffer

// Reuse buffer for multiple decompressions
for _ in 0..1000 {
    let size = compressor.decompress_into(&compressed, &mut reusable_buffer).unwrap();
    // Process reusable_buffer[..size]
}
§Examples
§Basic Usage
use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};

let compressor = ZstdCompressor::new(3, None);
let original = vec![42u8; 1024];

let compressed = compressor.compress(&original).unwrap();

// Decompress into pre-allocated buffer
let mut output = vec![0u8; 1024];
let size = compressor.decompress_into(&compressed, &mut output).unwrap();

assert_eq!(size, 1024);
assert_eq!(output, original);
§Buffer Too Small
use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};

let compressor = ZstdCompressor::new(3, None);
let original = vec![42u8; 1024];
let compressed = compressor.compress(&original).unwrap();

// Provide insufficient buffer
let mut small_buffer = vec![0u8; 512];
let result = compressor.decompress_into(&compressed, &mut small_buffer);

// Result depends on zstd behavior with undersized buffers
§Thread Safety

This method can be called concurrently from multiple threads on the same ZstdCompressor instance, provided each thread uses its own output buffer.

Source§

fn decompress(&self, data: &[u8]) -> Result<Vec<u8>>

Decompresses an encoded block into a new buffer. Read more
Source§

impl Debug for ZstdCompressor

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the compressor for debugging output.

Displays the compression level and whether a dictionary is present, without exposing sensitive dictionary contents.

§Output Format
ZstdCompressor { level: 3, has_dict: true }
§Examples
use hexz_core::algo::compression::zstd::ZstdCompressor;

let compressor = ZstdCompressor::new(5, None);
println!("{:?}", compressor);
// Outputs: ZstdCompressor { level: 5, has_dict: false }

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more