hexz_core/algo/compression/
zstd.rs

1//! Zstandard (zstd) compression with dictionary training support.
2//!
3//! This module provides a high-performance implementation of the Zstandard compression
4//! algorithm for Hexz's block-oriented storage system. Zstandard offers significantly
5//! better compression ratios than LZ4 while maintaining reasonable decompression speeds,
6//! making it ideal for snapshot storage where disk space efficiency is prioritized over
7//! raw throughput.
8//!
9//! # Zstandard Overview
10//!
11//! Zstandard is a modern compression algorithm developed by Facebook (Meta) that provides:
12//!
13//! - **Superior compression ratios**: 2-3x better than LZ4 on typical data, approaching gzip
14//!   levels while being 5-10x faster to decompress
15//! - **Tunable compression levels**: From level 1 (fast, ~400 MB/s) to level 22
16//!   (maximum compression, ~20 MB/s)
17//! - **Dictionary support**: Pre-trained dictionaries can improve compression by 10-40%
18//!   on small blocks (<64 KB) of structured data
19//! - **Fast decompression**: ~1 GB/s regardless of compression level, making it suitable
20//!   for read-heavy workloads
21//!
22//! # Dictionary Training
23//!
24//! Dictionary training is a powerful feature that analyzes representative samples of your
25//! data to build a reusable compression model. This is especially effective for:
26//!
27//! - **Small blocks**: Blocks under 64 KB benefit most, as regular compression cannot
28//!   build effective statistical models from limited data
29//! - **Structured data**: VM disk images, database pages, log files, and configuration
30//!   files with repeated patterns
31//! - **Homogeneous datasets**: Collections of similar files (e.g., all ext4 filesystem blocks)
32//!
33//! ## How Dictionary Training Works
34//!
35//! The training process analyzes a set of sample blocks to identify:
36//!
37//! 1. **Common byte sequences**: Frequently occurring patterns across samples
38//! 2. **Structural patterns**: Repeated headers, footers, or delimiters
39//! 3. **Statistical distributions**: Byte frequency distributions for entropy coding
40//!
41//! The resulting dictionary is then prepended (conceptually) to each compressed block,
42//! allowing the compressor to reference these patterns without encoding them repeatedly.
43//!
44//! ## Sample Requirements
45//!
46//! For effective dictionary training:
47//!
48//! - **Sample count**: Minimum 10 samples, ideally 50-100 representative blocks
49//! - **Sample size**: Each sample should be 1-4x your target block size
50//! - **Total data**: Aim for 100x the desired dictionary size (e.g., 10 MB of samples
51//!   for a 100 KB dictionary)
52//! - **Representativeness**: Samples must match production data patterns; training on
53//!   zeros and compressing real data will hurt compression
54//!
55//! ## Compression Ratio Improvements
56//!
57//! Typical improvements with dictionary compression (measured on 64 KB blocks):
58//!
59//! | Data Type           | Without Dict | With Dict | Improvement |
60//! |---------------------|--------------|-----------|-------------|
61//! | VM disk (ext4)      | 2.1x         | 3.2x      | +52%        |
62//! | Database pages      | 1.8x         | 2.9x      | +61%        |
63//! | Log files           | 3.5x         | 4.8x      | +37%        |
64//! | JSON configuration  | 4.2x         | 6.1x      | +45%        |
65//! | Random/encrypted    | 1.0x         | 1.0x      | 0%          |
66//!
67//! ## Memory Usage
68//!
69//! Dictionary memory overhead:
70//!
71//! - **Training**: ~10x dictionary size during training (110 KB dict = ~1.1 MB temporary)
72//! - **Compression**: ~3x dictionary size per encoder instance (~330 KB)
73//! - **Decompression**: ~1x dictionary size per decoder instance (~110 KB)
74//! - **Process lifetime**: Dictionaries are leaked to obtain `'static` lifetime
75//!
76//! In Hexz, dictionary bytes are typically 110 KB (zstd's recommended maximum), resulting
77//! in ~450 KB of permanent memory overhead per compressor instance.
78//!
79//! # Compression Level Selection
80//!
81//! Zstandard supports compression levels from 1 to 22, with different speed/ratio tradeoffs:
82//!
83//! ## Level Ranges and Characteristics
84//!
85//! | Level    | Compress Speed | Ratio vs Level 3 | Memory (Compress) | Use Case                |
86//! |----------|----------------|------------------|-------------------|-------------------------|
87//! | 1        | ~450 MB/s      | -8%              | ~1 MB             | Real-time compression   |
88//! | 3 (def)  | ~350 MB/s      | baseline         | ~2 MB             | General purpose         |
89//! | 5-7      | ~200 MB/s      | +5%              | ~4 MB             | Balanced                |
90//! | 9-12     | ~80 MB/s       | +12%             | ~8 MB             | Archive creation        |
91//! | 15-19    | ~30 MB/s       | +18%             | ~32 MB            | Cold storage            |
92//! | 20-22    | ~10 MB/s       | +22%             | ~64 MB            | Maximum compression     |
93//!
94//! **Decompression speed**: ~1000 MB/s for all levels (level does not affect decompression)
95//!
96//! ## Recommended Settings by Data Type
97//!
98//! ### VM Disk Images (Mixed Content)
99//! - **Level 3**: Good balance for general disk snapshots
100//! - **Dictionary**: Strongly recommended, +40-60% ratio improvement
101//! - **Rationale**: Mixed content benefits from adaptive compression
102//!
103//! ### Database Files (Structured Pages)
104//! - **Level 5-7**: Higher ratio helps with large database archives
105//! - **Dictionary**: Critical for small page sizes (<16 KB)
106//! - **Rationale**: Structured data compresses well with more analysis
107//!
108//! ### Log Files (Highly Compressible Text)
109//! - **Level 1-3**: Logs already compress extremely well
110//! - **Dictionary**: Optional, text is self-describing
111//! - **Rationale**: Diminishing returns at higher levels
112//!
113//! ### Memory Snapshots (Low Entropy)
114//! - **Level 3**: Memory pages often contain zeros/patterns
115//! - **Dictionary**: Not beneficial for homogeneous data
116//! - **Rationale**: Fast compression for potentially large datasets
117//!
118//! ### Configuration/JSON (Small Files)
119//! - **Level 9**: Small files justify slower compression
120//! - **Dictionary**: Highly effective for structured text
121//! - **Rationale**: One-time compression cost, repeated reads
122//!
123//! # When to Use Dictionary vs Raw Compression
124//!
125//! ## Use Dictionary When:
126//! - Block size is ≤64 KB (most effective at 16-64 KB)
127//! - Data has repeated structure (headers, schemas, common fields)
128//! - Compression ratio is more important than speed
129//! - You can provide 10+ representative samples for training
130//! - All compressed blocks will use the same dictionary
131//!
132//! ## Use Raw Compression When:
133//! - Block size is ≥256 KB (dictionary overhead outweighs benefits)
134//! - Data is highly random or encrypted (no patterns to exploit)
135//! - Compression speed is critical
136//! - Representative samples are unavailable
137//! - Each block has unique characteristics
138//!
139//! # Performance Characteristics
140//!
141//! Benchmarked on AMD Ryzen 9 5950X, single-threaded:
142//!
143//! ```text
144//! Compression (64 KB blocks, structured data):
145//!   Level 1:  420 MB/s @ 2.8x ratio
146//!   Level 3:  340 MB/s @ 3.2x ratio  ← default
147//!   Level 9:   85 MB/s @ 3.8x ratio
148//!   Level 19:  28 MB/s @ 4.1x ratio
149//!
150//! Decompression (all levels):
151//!   Without dict: ~1100 MB/s
152//!   With dict:    ~950 MB/s (10% overhead)
153//!
154//! Dictionary training (110 KB dict, 10 MB samples):
155//!   Training time: ~200ms
156//!   One-time cost amortized over millions of blocks
157//! ```
158//!
159//! Compared to LZ4 (Hexz's fast compression option):
160//! - **Compression ratio**: Zstd-3 is ~1.8x better than LZ4
161//! - **Compression speed**: LZ4 is ~6x faster (~2000 MB/s)
162//! - **Decompression speed**: LZ4 is ~3x faster (~3000 MB/s)
163//!
164//! **Tradeoff**: Use Zstd when storage cost exceeds CPU cost, LZ4 when latency matters most.
165//!
166//! # Examples
167//!
168//! ## Basic Compression (No Dictionary)
169//!
170//! ```
171//! use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};
172//!
173//! // Create compressor at default level (3)
174//! let compressor = ZstdCompressor::new(3, None);
175//!
176//! let data = b"Hello, world! This is some data to compress.";
177//! let compressed = compressor.compress(data).unwrap();
178//! let decompressed = compressor.decompress(&compressed).unwrap();
179//!
180//! assert_eq!(data.as_slice(), decompressed.as_slice());
181//! println!("Original: {} bytes, Compressed: {} bytes", data.len(), compressed.len());
182//! ```
183//!
184//! ## Dictionary Training Workflow
185//!
186//! ```no_run
187//! use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};
188//! use std::fs::File;
189//! use std::io::Read;
190//!
191//! # fn main() -> Result<(), Box<dyn std::error::Error>> {
192//! // Step 1: Collect representative samples (10-100 blocks)
193//! let mut samples = Vec::new();
194//! for i in 0..50 {
195//!     let mut file = File::open(format!("samples/block_{}.dat", i))?;
196//!     let mut sample = Vec::new();
197//!     file.read_to_end(&mut sample)?;
198//!     samples.push(sample);
199//! }
200//!
201//! // Step 2: Train dictionary (max 110 KB)
202//! let dict = ZstdCompressor::train(&samples, 110 * 1024)?;
203//! println!("Trained dictionary: {} bytes", dict.len());
204//!
205//! // Step 3: Create compressor with dictionary
206//! let compressor = ZstdCompressor::new(3, Some(dict));
207//!
208//! // Step 4: Compress production data
209//! let data = b"Production data with similar structure to samples";
210//! let compressed = compressor.compress(data)?;
211//!
212//! // Step 5: Decompress (must use same compressor instance with same dict)
213//! let decompressed = compressor.decompress(&compressed)?;
214//! assert_eq!(data.as_slice(), decompressed.as_slice());
215//! # Ok(())
216//! # }
217//! ```
218//!
219//! ## High Compression for Archives
220//!
221//! ```
222//! use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};
223//!
224//! // Use level 19 for maximum compression (slow)
225//! let compressor = ZstdCompressor::new(19, None);
226//!
227//! let large_data = vec![0u8; 1_000_000];
228//! let compressed = compressor.compress(&large_data).unwrap();
229//!
230//! // Compression is slow, but decompression is still fast
231//! let decompressed = compressor.decompress(&compressed).unwrap();
232//! println!("Compressed 1 MB to {} bytes ({:.1}x ratio)",
233//!          compressed.len(),
234//!          large_data.len() as f64 / compressed.len() as f64);
235//! ```
236//!
237//! ## Buffer Reuse for Hot Paths
238//!
239//! ```
240//! use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};
241//!
242//! let compressor = ZstdCompressor::new(3, None);
243//! let data = vec![42u8; 65536];
244//! let compressed = compressor.compress(&data).unwrap();
245//!
246//! // Reuse buffer for multiple decompressions to avoid allocations
247//! let mut output_buffer = vec![0u8; 65536];
248//! let size = compressor.decompress_into(&compressed, &mut output_buffer).unwrap();
249//!
250//! assert_eq!(size, data.len());
251//! assert_eq!(output_buffer, data);
252//! ```
253//!
254//! # Thread Safety
255//!
256//! `ZstdCompressor` implements `Send + Sync` and can be safely shared across threads.
257//! Each compression/decompression operation is independent and does not modify the
258//! compressor state. The dictionary is immutable after construction.
259//!
260//! # Architectural Integration
261//!
262//! In Hexz's architecture:
263//! - **Format layer**: Stores compression type in snapshot header
264//! - **Pack operations**: Optionally trains dictionaries during snapshot creation
265//! - **Read operations**: Instantiates compressor with stored dictionary
266//! - **CLI**: Provides `--compression=zstd` flag and `--train-dict` option
267//!
268//! The same dictionary bytes must be available for both compression and decompression,
269//! so Hexz embeds trained dictionaries in the snapshot file header.
270
271use crate::algo::compression::Compressor;
272use hexz_common::{Error, Result};
273use std::io::{Cursor, Read, Write};
274use zstd::dict::{DecoderDictionary, EncoderDictionary};
275
276/// Zstandard compressor with optional pre-trained dictionary.
277///
278/// This compressor wraps the `zstd` crate and provides both raw compression
279/// (no dictionary) and dictionary-enhanced compression for improved ratios on
280/// structured data.
281///
282/// # Dictionary Lifecycle
283///
284/// When a dictionary is provided:
285/// 1. The dictionary bytes are cloned and **leaked** to obtain `'static` lifetime
286/// 2. Both encoder and decoder dictionaries are constructed from the leaked bytes
287/// 3. The dictionary memory persists for the process lifetime
288/// 4. Multiple compressor instances can share the same dictionary bytes
289///
290/// This design trades memory (leaked dictionary) for simplicity and safety. In typical
291/// Hexz usage, one compressor instance exists per snapshot file, so the overhead is
292/// ~450 KB per open snapshot (110 KB dict × ~4x internal structures).
293///
294/// # Thread Safety
295///
296/// `ZstdCompressor` is `Send + Sync`. Compression and decompression operations do not
297/// mutate the compressor state, allowing safe concurrent use from multiple threads.
298/// Each operation allocates its own temporary encoder/decoder.
299///
300/// # Constraints
301///
302/// - **Dictionary compatibility**: Blocks compressed with a dictionary MUST be
303///   decompressed with the exact same dictionary bytes. Attempting to decompress
304///   with a different or missing dictionary will fail with a compression error.
305/// - **Level consistency**: The compression level is stored in the encoder dictionary.
306///   Changing the level requires training a new dictionary.
307///
308/// # Examples
309///
310/// ```
311/// use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};
312///
313/// // Create compressor without dictionary
314/// let compressor = ZstdCompressor::new(3, None);
315/// let data = b"test data";
316/// let compressed = compressor.compress(data).unwrap();
317/// let decompressed = compressor.decompress(&compressed).unwrap();
318/// assert_eq!(data.as_slice(), decompressed.as_slice());
319/// ```
320pub struct ZstdCompressor {
321    level: i32,
322    encoder_dict: Option<EncoderDictionary<'static>>,
323    decoder_dict: Option<DecoderDictionary<'static>>,
324}
325
326impl std::fmt::Debug for ZstdCompressor {
327    /// Formats the compressor for debugging output.
328    ///
329    /// Displays the compression level and whether a dictionary is present,
330    /// without exposing sensitive dictionary contents.
331    ///
332    /// # Output Format
333    ///
334    /// ```text
335    /// ZstdCompressor { level: 3, has_dict: true }
336    /// ```
337    ///
338    /// # Examples
339    ///
340    /// ```
341    /// use hexz_core::algo::compression::zstd::ZstdCompressor;
342    ///
343    /// let compressor = ZstdCompressor::new(5, None);
344    /// println!("{:?}", compressor);
345    /// // Outputs: ZstdCompressor { level: 5, has_dict: false }
346    /// ```
347    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
348        f.debug_struct("ZstdCompressor")
349            .field("level", &self.level)
350            .field("has_dict", &self.encoder_dict.is_some())
351            .finish()
352    }
353}
354
355impl ZstdCompressor {
356    /// Creates a new Zstandard compressor with the specified compression level and optional dictionary.
357    ///
358    /// # Parameters
359    ///
360    /// * `level` - Compression level from 1 to 22:
361    ///   - `1`: Fastest compression (~450 MB/s), lower ratio
362    ///   - `3`: Default, balanced speed/ratio (~350 MB/s)
363    ///   - `9`: High compression (~85 MB/s), good ratio
364    ///   - `19-22`: Maximum compression (~10-30 MB/s), best ratio
365    ///
366    /// * `dict` - Optional pre-trained dictionary bytes:
367    ///   - `None`: Use raw zstd compression
368    ///   - `Some(dict_bytes)`: Use dictionary-enhanced compression
369    ///
370    /// # Dictionary Handling
371    ///
372    /// When a dictionary is provided, this function:
373    /// 1. **Copies** the dictionary bytes internally via `EncoderDictionary::copy`
374    /// 2. **Parses** the bytes into native zstd encoder/decoder dictionaries
375    /// 3. **Manages** the dictionary lifetime automatically (no leaks)
376    ///
377    /// The dictionary memory (~110 KB for typical dictionaries) is properly managed
378    /// and freed when the compressor is dropped. This provides:
379    /// - Proper memory management without leaks
380    /// - Dictionary reuse across millions of blocks
381    /// - Memory safety with automatic cleanup
382    ///
383    /// # Memory Usage
384    ///
385    /// Approximate memory overhead per compressor instance:
386    /// - No dictionary: ~10 KB (minimal bookkeeping)
387    /// - With 110 KB dictionary: ~450 KB (leaked bytes + encoder/decoder structures)
388    ///
389    /// # Examples
390    ///
391    /// ```
392    /// use hexz_core::algo::compression::zstd::ZstdCompressor;
393    ///
394    /// // Fast compression, no dictionary
395    /// let fast = ZstdCompressor::new(1, None);
396    ///
397    /// // Balanced compression with dictionary
398    /// let dict = vec![0u8; 1024]; // Placeholder dictionary
399    /// let balanced = ZstdCompressor::new(3, Some(dict));
400    ///
401    /// // Maximum compression for archival
402    /// let max = ZstdCompressor::new(22, None);
403    /// ```
404    ///
405    /// # Performance Notes
406    ///
407    /// Creating a compressor is relatively expensive (~1 ms with dictionary due to parsing).
408    /// Reuse compressor instances rather than creating them per-operation.
409    pub fn new(level: i32, dict: Option<Vec<u8>>) -> Self {
410        let (encoder_dict, decoder_dict) = if let Some(d) = &dict {
411            // EncoderDictionary::copy and DecoderDictionary::copy both copy the
412            // dictionary data internally, so we only need a temporary reference.
413            (
414                Some(EncoderDictionary::copy(d, level)),
415                Some(DecoderDictionary::copy(d)),
416            )
417        } else {
418            (None, None)
419        };
420
421        Self {
422            level,
423            encoder_dict,
424            decoder_dict,
425        }
426    }
427
428    /// Trains a Zstandard dictionary from representative sample blocks.
429    ///
430    /// Dictionary training analyzes a collection of sample data to identify common patterns,
431    /// sequences, and statistical distributions. The resulting dictionary acts as a "seed"
432    /// for the compressor, enabling better compression ratios on small blocks that would
433    /// otherwise lack sufficient data to build effective models.
434    ///
435    /// # Training Algorithm
436    ///
437    /// The training process:
438    /// 1. **Concatenates** all samples into a training corpus
439    /// 2. **Analyzes** byte-level patterns using suffix arrays and frequency analysis
440    /// 3. **Selects** the most valuable patterns up to `max_size` bytes
441    /// 4. **Optimizes** dictionary layout for fast lookup during compression
442    /// 5. **Returns** the trained dictionary as a byte vector
443    ///
444    /// This is a CPU-intensive operation (O(n log n) where n is total sample bytes) and
445    /// should be done once during snapshot creation, not per-block.
446    ///
447    /// # Parameters
448    ///
449    /// * `samples` - A slice of representative data blocks. Requirements:
450    ///   - **Minimum count**: 10 samples (20+ recommended, 50+ ideal)
451    ///   - **Minimum total size**: 100x `max_size` (e.g., 10 MB for 100 KB dictionary)
452    ///   - **Representativeness**: Must match production data patterns
453    ///   - **Diversity**: Include variety of structures, not just repeated copies
454    ///
455    /// * `max_size` - Maximum dictionary size in bytes. Recommendations:
456    ///   - **Small blocks (16-32 KB)**: 64 KB dictionary
457    ///   - **Medium blocks (64 KB)**: 110 KB dictionary (zstd's recommended max)
458    ///   - **Large blocks (128+ KB)**: Diminishing returns, consider skipping dictionary
459    ///
460    /// # Returns
461    ///
462    /// Returns `Ok(Vec<u8>)` containing the trained dictionary bytes, or `Err` if training fails.
463    /// The actual dictionary size may be less than `max_size` if fewer patterns were found.
464    ///
465    /// # Errors
466    ///
467    /// Returns `Error::Compression` if:
468    /// - Samples are empty or too small (less than ~1 KB total)
469    /// - `max_size` is invalid (0 or excessively large)
470    /// - Internal zstd training algorithm fails (corrupted samples, out of memory)
471    ///
472    /// # Performance Characteristics
473    ///
474    /// Training time on AMD Ryzen 9 5950X:
475    /// - 1 MB samples, 64 KB dict: ~50 ms
476    /// - 10 MB samples, 110 KB dict: ~200 ms
477    /// - 100 MB samples, 110 KB dict: ~2 seconds
478    ///
479    /// Training is approximately O(n log n) in total sample size.
480    ///
481    /// # Compression Ratio Impact
482    ///
483    /// Expected compression ratio improvements with trained dictionary vs. raw compression:
484    ///
485    /// | Block Size | Raw Zstd-3 | With Dict | Improvement |
486    /// |------------|------------|-----------|-------------|
487    /// | 16 KB      | 1.5x       | 2.4x      | +60%        |
488    /// | 32 KB      | 2.1x       | 3.2x      | +52%        |
489    /// | 64 KB      | 2.8x       | 3.9x      | +39%        |
490    /// | 128 KB     | 3.2x       | 3.7x      | +16%        |
491    /// | 256 KB+    | 3.5x       | 3.6x      | +3%         |
492    ///
493    /// Measured on typical VM disk image blocks (ext4 filesystem data).
494    ///
495    /// # Examples
496    ///
497    /// ## Basic Training
498    ///
499    /// ```no_run
500    /// use hexz_core::algo::compression::zstd::ZstdCompressor;
501    ///
502    /// # fn main() -> Result<(), Box<dyn std::error::Error>> {
503    /// // Collect 20 representative 64 KB blocks
504    /// let samples: Vec<Vec<u8>> = (0..20)
505    ///     .map(|i| vec![((i * 13) % 256) as u8; 65536])
506    ///     .collect();
507    ///
508    /// // Train 110 KB dictionary
509    /// let dict = ZstdCompressor::train(&samples, 110 * 1024)?;
510    /// println!("Trained dictionary: {} bytes", dict.len());
511    ///
512    /// // Use dictionary for compression
513    /// let compressor = ZstdCompressor::new(3, Some(dict));
514    /// # Ok(())
515    /// # }
516    /// ```
517    ///
518    /// ## Training from File Samples
519    ///
520    /// ```no_run
521    /// use hexz_core::algo::compression::zstd::ZstdCompressor;
522    /// use std::fs::File;
523    /// use std::io::Read;
524    ///
525    /// # fn main() -> Result<(), Box<dyn std::error::Error>> {
526    /// // Read samples from disk (e.g., sampled from VM disk image)
527    /// let mut samples = Vec::new();
528    /// let mut file = File::open("disk.raw")?;
529    /// let file_size = file.metadata()?.len();
530    /// let block_size = 65536;
531    /// let sample_count = 50;
532    /// let step = file_size / sample_count;
533    ///
534    /// for i in 0..sample_count {
535    ///     let mut buffer = vec![0u8; block_size];
536    ///     let offset = i * step;
537    ///     // Seek to offset and read block
538    ///     // (simplified, real code needs error handling)
539    ///     samples.push(buffer);
540    /// }
541    ///
542    /// let dict = ZstdCompressor::train(&samples, 110 * 1024)?;
543    /// println!("Trained from {} samples: {} bytes", samples.len(), dict.len());
544    /// # Ok(())
545    /// # }
546    /// ```
547    ///
548    /// ## Validating Dictionary Quality
549    ///
550    /// ```no_run
551    /// use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};
552    ///
553    /// # fn main() -> Result<(), Box<dyn std::error::Error>> {
554    /// let samples: Vec<Vec<u8>> = vec![vec![42u8; 32768]; 30];
555    /// let dict = ZstdCompressor::train(&samples, 64 * 1024)?;
556    ///
557    /// // Compare compression ratios
558    /// let without_dict = ZstdCompressor::new(3, None);
559    /// let with_dict = ZstdCompressor::new(3, Some(dict));
560    ///
561    /// let test_data = vec![42u8; 32768];
562    /// let compressed_raw = without_dict.compress(&test_data)?;
563    /// let compressed_dict = with_dict.compress(&test_data)?;
564    ///
565    /// let improvement = (compressed_raw.len() as f64 / compressed_dict.len() as f64 - 1.0) * 100.0;
566    /// println!("Dictionary improved compression by {:.1}%", improvement);
567    ///
568    /// // If improvement < 10%, dictionary may not be beneficial
569    /// # Ok(())
570    /// # }
571    /// ```
572    ///
573    /// # When Dictionary Training Fails or Performs Poorly
574    ///
575    /// Dictionary training may produce poor results if:
576    /// - **Samples are unrepresentative**: Training on zeros, compressing real data
577    /// - **Data is random/encrypted**: No patterns exist to learn
578    /// - **Samples are too few**: Less than 10 samples or less than 100x dict size
579    /// - **Data is already highly compressible**: Text/logs may not benefit
580    /// - **Blocks are too large**: 256 KB+ blocks have enough context without dictionary
581    ///
582    /// If dictionary compression performs worse than raw compression, fall back to
583    /// `ZstdCompressor::new(level, None)`.
584    ///
585    /// # Memory Usage During Training
586    ///
587    /// Temporary memory allocated during training:
588    /// - **Input buffer**: Sum of all sample sizes (e.g., 10 MB for 50 × 200 KB samples)
589    /// - **Working memory**: ~10x `max_size` (e.g., ~1.1 MB for 110 KB dict)
590    /// - **Output dictionary**: `max_size` (e.g., 110 KB)
591    ///
592    /// Total peak memory: input_size + 10×max_size. For typical usage (10 MB samples,
593    /// 110 KB dict), peak memory is ~12 MB.
594    pub fn train(samples: &[Vec<u8>], max_size: usize) -> Result<Vec<u8>> {
595        zstd::dict::from_samples(samples, max_size)
596            .map_err(|e| Error::Compression(format!("Failed to train dict: {}", e)))
597    }
598
599    /// Reads decompressed bytes from a zstd decoder into the provided buffer.
600    ///
601    /// This is an internal helper function that drains a streaming decoder (with or without
602    /// dictionary) into a contiguous output buffer. It handles partial reads gracefully and
603    /// returns the total number of bytes read.
604    ///
605    /// # Parameters
606    ///
607    /// * `reader` - A mutable reference to any type implementing `Read` (typically a
608    ///   `zstd::stream::read::Decoder`)
609    /// * `out` - The output buffer to fill with decompressed bytes
610    ///
611    /// # Returns
612    ///
613    /// Returns `Ok(usize)` containing the number of bytes written to `out`. This may be
614    /// less than `out.len()` if the decoder reaches EOF before the buffer is full.
615    ///
616    /// # Errors
617    ///
618    /// Returns `Error::Compression` if the underlying `Read` operation fails due to:
619    /// - Corrupted compressed data
620    /// - I/O errors reading from the source
621    /// - Decompression algorithm errors
622    ///
623    /// # Implementation Notes
624    ///
625    /// This function loops until either:
626    /// - The output buffer is completely filled (`total == out.len()`)
627    /// - The decoder returns 0 bytes (EOF condition)
628    ///
629    /// Each `read()` call may return fewer bytes than requested, so we accumulate
630    /// bytes until one of the terminal conditions is met.
631    fn read_into_buf<R: Read>(reader: &mut R, out: &mut [u8]) -> Result<usize> {
632        let mut total = 0;
633        while total < out.len() {
634            let n = reader
635                .read(&mut out[total..])
636                .map_err(|e| Error::Compression(e.to_string()))?;
637            if n == 0 {
638                break;
639            }
640            total += n;
641        }
642        // Check if there's more data that didn't fit in the buffer
643        if total == out.len() {
644            let mut extra = [0u8; 1];
645            let n = reader
646                .read(&mut extra)
647                .map_err(|e| Error::Compression(e.to_string()))?;
648            if n > 0 {
649                return Err(Error::Compression(format!(
650                    "Decompressed data exceeds output buffer size ({})",
651                    out.len()
652                )));
653            }
654        }
655        Ok(total)
656    }
657}
658
659impl Compressor for ZstdCompressor {
660    /// Compresses a block of data using Zstandard compression.
661    ///
662    /// This method compresses `data` using the compression level and dictionary
663    /// configured during construction. The output is a self-contained compressed
664    /// block in zstd frame format.
665    ///
666    /// # Parameters
667    ///
668    /// * `data` - The uncompressed input data to compress. Can be any size from 0 bytes
669    ///   to multiple gigabytes, though blocks of 64 KB to 1 MB are typical in Hexz.
670    ///
671    /// # Returns
672    ///
673    /// Returns `Ok(Vec<u8>)` containing the compressed data. The compressed size depends on:
674    /// - Input data compressibility (random data: ~100%, structured data: 20-50%)
675    /// - Compression level (higher levels = smaller output, slower compression)
676    /// - Dictionary usage (can reduce output by 10-40% for small blocks)
677    ///
678    /// # Errors
679    ///
680    /// Returns `Error::Compression` if:
681    /// - Internal zstd encoder initialization fails (rare, typically OOM)
682    /// - Compression process fails (extremely rare with valid input)
683    ///
684    /// # Dictionary Behavior
685    ///
686    /// - **With dictionary**: Uses streaming encoder with pre-parsed dictionary for
687    ///   maximum throughput. The dictionary is **not** embedded in the output; the
688    ///   decompressor must have the same dictionary.
689    /// - **Without dictionary**: Uses simple one-shot encoding with zstd's default
690    ///   dictionary learning from the input itself.
691    ///
692    /// # Performance
693    ///
694    /// Approximate throughput on modern hardware (AMD Ryzen 9 5950X):
695    /// - Level 1: ~450 MB/s
696    /// - Level 3: ~350 MB/s (default)
697    /// - Level 9: ~85 MB/s
698    /// - Level 19: ~28 MB/s
699    ///
700    /// Dictionary overhead: ~5% slower than raw compression due to initialization.
701    ///
702    /// # Examples
703    ///
704    /// ```
705    /// use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};
706    ///
707    /// let compressor = ZstdCompressor::new(3, None);
708    /// let data = b"Hello, world! Compression test data.";
709    ///
710    /// let compressed = compressor.compress(data).unwrap();
711    /// println!("Compressed {} bytes to {} bytes", data.len(), compressed.len());
712    ///
713    /// // Compressed data is self-contained and can be stored/transmitted
714    /// ```
715    ///
716    /// # Thread Safety
717    ///
718    /// This method can be called concurrently from multiple threads on the same
719    /// `ZstdCompressor` instance. Each call creates an independent encoder.
720    fn compress(&self, data: &[u8]) -> Result<Vec<u8>> {
721        if let Some(dict) = &self.encoder_dict {
722            let mut encoder = zstd::stream::write::Encoder::with_prepared_dictionary(
723                Vec::with_capacity(data.len()),
724                dict,
725            )
726            .map_err(|e| Error::Compression(e.to_string()))?;
727
728            encoder
729                .write_all(data)
730                .map_err(|e| Error::Compression(e.to_string()))?;
731            encoder
732                .finish()
733                .map_err(|e| Error::Compression(e.to_string()))
734        } else {
735            zstd::stream::encode_all(Cursor::new(data), self.level)
736                .map_err(|e| Error::Compression(e.to_string()))
737        }
738    }
739
740    /// Decompresses a Zstandard-compressed block into a new buffer.
741    ///
742    /// This method reverses the compression performed by `compress()`, restoring the
743    /// original uncompressed data. The decompressed output is allocated dynamically
744    /// based on the compressed frame's metadata.
745    ///
746    /// # Parameters
747    ///
748    /// * `data` - The compressed input data in zstd frame format. Must have been
749    ///   compressed by a compatible `ZstdCompressor` (same dictionary, any level).
750    ///
751    /// # Returns
752    ///
753    /// Returns `Ok(Vec<u8>)` containing the decompressed data. The output size is
754    /// determined by the compressed frame's content size field (embedded during
755    /// compression).
756    ///
757    /// # Errors
758    ///
759    /// Returns `Error::Compression` if:
760    /// - `data` is not valid zstd-compressed data (corrupted or wrong format)
761    /// - `data` was compressed with a dictionary, but this compressor has no dictionary
762    /// - `data` was compressed without a dictionary, but this compressor has a dictionary
763    /// - `data` was compressed with a different dictionary than this compressor
764    /// - Internal decompression fails (checksum mismatch, corrupted data)
765    ///
766    /// # Dictionary Compatibility
767    ///
768    /// **Critical**: Dictionary-compressed data MUST be decompressed with the exact same
769    /// dictionary bytes. The zstd format includes a dictionary ID checksum; mismatched
770    /// dictionaries will cause decompression to fail with an error.
771    ///
772    /// | Compressed With | Decompressed With | Result          |
773    /// |-----------------|-------------------|-----------------|
774    /// | No dictionary   | No dictionary     | Success         |
775    /// | Dictionary A    | Dictionary A      | Success         |
776    /// | No dictionary   | Dictionary A      | Error           |
777    /// | Dictionary A    | No dictionary     | Error           |
778    /// | Dictionary A    | Dictionary B      | Error           |
779    ///
780    /// # Performance
781    ///
782    /// Decompression speed is independent of compression level (level affects only
783    /// compression time). Typical throughput on modern hardware:
784    /// - Without dictionary: ~1100 MB/s
785    /// - With dictionary: ~950 MB/s (10% overhead from dictionary lookups)
786    ///
787    /// Decompression is roughly 3x faster than compression at level 3.
788    ///
789    /// # Memory Allocation
790    ///
791    /// This method allocates a new `Vec<u8>` to hold the decompressed output. For
792    /// hot paths where the decompressed size is known, consider using `decompress_into()`
793    /// to reuse buffers and avoid allocations.
794    ///
795    /// # Examples
796    ///
797    /// ```
798    /// use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};
799    ///
800    /// let compressor = ZstdCompressor::new(3, None);
801    /// let original = b"Test data for compression";
802    ///
803    /// let compressed = compressor.compress(original).unwrap();
804    /// let decompressed = compressor.decompress(&compressed).unwrap();
805    ///
806    /// assert_eq!(original.as_slice(), decompressed.as_slice());
807    /// ```
808    ///
809    /// # Thread Safety
810    ///
811    /// This method can be called concurrently from multiple threads on the same
812    /// `ZstdCompressor` instance. Each call creates an independent decoder.
813    fn compress_into(&self, data: &[u8], out: &mut Vec<u8>) -> Result<()> {
814        out.clear();
815        if let Some(dict) = &self.encoder_dict {
816            let mut encoder =
817                zstd::stream::write::Encoder::with_prepared_dictionary(std::mem::take(out), dict)
818                    .map_err(|e| Error::Compression(e.to_string()))?;
819
820            encoder
821                .write_all(data)
822                .map_err(|e| Error::Compression(e.to_string()))?;
823            *out = encoder
824                .finish()
825                .map_err(|e| Error::Compression(e.to_string()))?;
826        } else {
827            let compressed = zstd::stream::encode_all(Cursor::new(data), self.level)
828                .map_err(|e| Error::Compression(e.to_string()))?;
829            *out = compressed;
830        }
831        Ok(())
832    }
833
834    fn decompress(&self, data: &[u8]) -> Result<Vec<u8>> {
835        const MAX_DECOMPRESSED: u64 = 128 * 1024 * 1024; // 128 MB
836
837        if let Some(dict) = &self.decoder_dict {
838            // Pre-allocate output buffer using frame content size when available,
839            // capped to prevent OOM from crafted frame headers.
840            let frame_size = zstd::zstd_safe::get_frame_content_size(data)
841                .ok()
842                .flatten()
843                .unwrap_or(data.len() as u64 * 2);
844            if frame_size > MAX_DECOMPRESSED {
845                return Err(Error::Compression(format!(
846                    "claimed decompressed size ({frame_size} bytes) exceeds limit ({MAX_DECOMPRESSED} bytes)"
847                )));
848            }
849            let prealloc_cap = frame_size as usize;
850
851            let mut decoder =
852                zstd::stream::read::Decoder::with_prepared_dictionary(Cursor::new(data), dict)
853                    .map_err(|e| Error::Compression(e.to_string()))?;
854
855            let mut out = Vec::with_capacity(prealloc_cap);
856            decoder
857                .read_to_end(&mut out)
858                .map_err(|e| Error::Compression(e.to_string()))?;
859            Ok(out)
860        } else {
861            // Check frame content size for the non-dictionary path too.
862            if let Some(frame_size) = zstd::zstd_safe::get_frame_content_size(data).ok().flatten() {
863                if frame_size > MAX_DECOMPRESSED {
864                    return Err(Error::Compression(format!(
865                        "claimed decompressed size ({frame_size} bytes) exceeds limit ({MAX_DECOMPRESSED} bytes)"
866                    )));
867                }
868            }
869            // decode_all is a highly optimized single-call API — faster than
870            // Decoder::new() + read_to_end() for the non-dictionary path.
871            zstd::stream::decode_all(Cursor::new(data))
872                .map_err(|e| Error::Compression(e.to_string()))
873        }
874    }
875
876    /// Decompresses a Zstandard-compressed block into a caller-provided buffer.
877    ///
878    /// This is a zero-allocation variant of `decompress()` that writes decompressed
879    /// data directly into a pre-allocated buffer. This is ideal for hot paths where
880    /// the decompressed size is known and buffers can be reused across multiple
881    /// decompression operations.
882    ///
883    /// # Parameters
884    ///
885    /// * `data` - The compressed input data in zstd frame format. Must have been
886    ///   compressed by a compatible `ZstdCompressor` (same dictionary, any level).
887    /// * `out` - The output buffer to receive decompressed bytes. Must be large enough
888    ///   to hold the entire decompressed payload.
889    ///
890    /// # Returns
891    ///
892    /// Returns `Ok(usize)` containing the number of bytes written to `out`. This is
893    /// always ≤ `out.len()`.
894    ///
895    /// # Errors
896    ///
897    /// Returns `Error::Compression` if:
898    /// - `data` is not valid zstd-compressed data
899    /// - Dictionary mismatch (same rules as `decompress()`)
900    /// - `out` is too small to hold the decompressed data (buffer overflow protection)
901    /// - Internal decompression fails (checksum mismatch, corrupted data)
902    ///
903    /// # Buffer Sizing
904    ///
905    /// The output buffer must be large enough to hold the full decompressed payload.
906    /// If the buffer is too small, decompression will fail with an error rather than
907    /// truncating output.
908    ///
909    /// To determine the required size:
910    /// - If you compressed the data, you know the original size
911    /// - If reading from Hexz snapshots, the block size is in the index
912    /// - The zstd frame header contains the content size (can be parsed)
913    ///
914    /// # Performance
915    ///
916    /// This method avoids heap allocation of the output buffer, making it suitable for
917    /// high-throughput scenarios:
918    ///
919    /// - **With reused buffer**: 0 allocations per decompression
920    /// - **Throughput**: Same as `decompress()` (~1000 MB/s)
921    /// - **Latency**: ~5% lower than `decompress()` due to eliminated allocation
922    ///
923    /// Recommended usage pattern for hot paths:
924    /// ```
925    /// use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};
926    ///
927    /// let compressor = ZstdCompressor::new(3, None);
928    /// let original = vec![42u8; 65536]; // 64 KB data
929    /// let compressed = compressor.compress(&original).unwrap();
930    /// let mut reusable_buffer = vec![0u8; 65536]; // 64 KB buffer
931    ///
932    /// // Reuse buffer for multiple decompressions
933    /// for _ in 0..1000 {
934    ///     let size = compressor.decompress_into(&compressed, &mut reusable_buffer).unwrap();
935    ///     // Process reusable_buffer[..size]
936    /// }
937    /// ```
938    ///
939    /// # Examples
940    ///
941    /// ## Basic Usage
942    ///
943    /// ```
944    /// use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};
945    ///
946    /// let compressor = ZstdCompressor::new(3, None);
947    /// let original = vec![42u8; 1024];
948    ///
949    /// let compressed = compressor.compress(&original).unwrap();
950    ///
951    /// // Decompress into pre-allocated buffer
952    /// let mut output = vec![0u8; 1024];
953    /// let size = compressor.decompress_into(&compressed, &mut output).unwrap();
954    ///
955    /// assert_eq!(size, 1024);
956    /// assert_eq!(output, original);
957    /// ```
958    ///
959    /// ## Buffer Too Small
960    ///
961    /// ```no_run
962    /// use hexz_core::algo::compression::{Compressor, zstd::ZstdCompressor};
963    ///
964    /// let compressor = ZstdCompressor::new(3, None);
965    /// let original = vec![42u8; 1024];
966    /// let compressed = compressor.compress(&original).unwrap();
967    ///
968    /// // Provide insufficient buffer
969    /// let mut small_buffer = vec![0u8; 512];
970    /// let result = compressor.decompress_into(&compressed, &mut small_buffer);
971    ///
972    /// // Result depends on zstd behavior with undersized buffers
973    /// ```
974    ///
975    /// # Thread Safety
976    ///
977    /// This method can be called concurrently from multiple threads on the same
978    /// `ZstdCompressor` instance, provided each thread uses its own output buffer.
979    fn decompress_into(&self, data: &[u8], out: &mut [u8]) -> Result<usize> {
980        if let Some(dict) = &self.decoder_dict {
981            let mut decoder =
982                zstd::stream::read::Decoder::with_prepared_dictionary(Cursor::new(data), dict)
983                    .map_err(|e| Error::Compression(e.to_string()))?;
984
985            Self::read_into_buf(&mut decoder, out)
986        } else {
987            let mut decoder = zstd::stream::read::Decoder::new(Cursor::new(data))
988                .map_err(|e| Error::Compression(e.to_string()))?;
989
990            Self::read_into_buf(&mut decoder, out)
991        }
992    }
993}
994
995#[cfg(test)]
996mod tests {
997    use super::*;
998
999    #[test]
1000    #[cfg_attr(miri, ignore)]
1001    fn test_compress_decompress_basic() {
1002        let compressor = ZstdCompressor::new(3, None);
1003        let data = b"Hello, world! This is test data for compression.";
1004
1005        let compressed = compressor.compress(data).expect("Compression failed");
1006        // Small data might not compress well due to header overhead, just verify it works
1007        assert!(
1008            !compressed.is_empty(),
1009            "Compressed data should not be empty"
1010        );
1011
1012        let decompressed = compressor
1013            .decompress(&compressed)
1014            .expect("Decompression failed");
1015        assert_eq!(data.as_slice(), decompressed.as_slice());
1016    }
1017
1018    #[test]
1019    #[cfg_attr(miri, ignore)]
1020    fn test_compress_empty_data() {
1021        let compressor = ZstdCompressor::new(3, None);
1022        let data = b"";
1023
1024        let compressed = compressor.compress(data).expect("Compression failed");
1025        let decompressed = compressor
1026            .decompress(&compressed)
1027            .expect("Decompression failed");
1028
1029        assert_eq!(data.as_slice(), decompressed.as_slice());
1030    }
1031
1032    #[test]
1033    #[cfg_attr(miri, ignore)]
1034    fn test_compress_small_data() {
1035        let compressor = ZstdCompressor::new(3, None);
1036        let data = b"x";
1037
1038        let compressed = compressor.compress(data).expect("Compression failed");
1039        let decompressed = compressor
1040            .decompress(&compressed)
1041            .expect("Decompression failed");
1042
1043        assert_eq!(data.as_slice(), decompressed.as_slice());
1044    }
1045
1046    #[test]
1047    #[cfg_attr(miri, ignore)]
1048    fn test_compress_large_data() {
1049        let compressor = ZstdCompressor::new(3, None);
1050        let data = vec![42u8; 1_000_000]; // 1 MB
1051
1052        let compressed = compressor.compress(&data).expect("Compression failed");
1053        assert!(compressed.len() < data.len(), "Data should be compressed");
1054
1055        let decompressed = compressor
1056            .decompress(&compressed)
1057            .expect("Decompression failed");
1058        assert_eq!(data, decompressed);
1059    }
1060
1061    #[test]
1062    #[cfg_attr(miri, ignore)]
1063    fn test_compress_repeating_pattern() {
1064        let compressor = ZstdCompressor::new(3, None);
1065        let data = vec![0xAB; 10_000];
1066
1067        let compressed = compressor.compress(&data).expect("Compression failed");
1068        assert!(
1069            compressed.len() < data.len() / 10,
1070            "Repeating pattern should compress well"
1071        );
1072
1073        let decompressed = compressor
1074            .decompress(&compressed)
1075            .expect("Decompression failed");
1076        assert_eq!(data, decompressed);
1077    }
1078
1079    #[test]
1080    #[cfg_attr(miri, ignore)]
1081    fn test_different_compression_levels() {
1082        let data = vec![42u8; 10_000];
1083
1084        for level in [1, 3, 5, 9] {
1085            let compressor = ZstdCompressor::new(level, None);
1086            let compressed = compressor
1087                .compress(&data)
1088                .unwrap_or_else(|_| panic!("Level {} failed", level));
1089            let decompressed = compressor
1090                .decompress(&compressed)
1091                .unwrap_or_else(|_| panic!("Level {} decompress failed", level));
1092            assert_eq!(data, decompressed, "Level {} roundtrip failed", level);
1093        }
1094    }
1095
1096    #[test]
1097    #[cfg_attr(miri, ignore)]
1098    fn test_compression_levels_ratio() {
1099        let data = b"The quick brown fox jumps over the lazy dog. ".repeat(100);
1100
1101        let level1 = ZstdCompressor::new(1, None);
1102        let level9 = ZstdCompressor::new(9, None);
1103
1104        let compressed1 = level1.compress(&data).unwrap();
1105        let compressed9 = level9.compress(&data).unwrap();
1106
1107        // Higher level should produce smaller output (or equal for already compressed data)
1108        assert!(compressed9.len() <= compressed1.len());
1109    }
1110
1111    #[test]
1112    #[cfg_attr(miri, ignore)]
1113    fn test_dictionary_training() {
1114        let samples: Vec<Vec<u8>> = (0..20)
1115            .map(|i| vec![((i * 13) % 256) as u8; 1024])
1116            .collect();
1117
1118        let dict = ZstdCompressor::train(&samples, 1024).expect("Training failed");
1119        assert!(!dict.is_empty(), "Dictionary should not be empty");
1120        assert!(dict.len() <= 1024, "Dictionary should not exceed max size");
1121    }
1122
1123    #[test]
1124    #[cfg_attr(miri, ignore)]
1125    fn test_compression_with_dictionary() {
1126        let samples: Vec<Vec<u8>> = (0..20)
1127            .map(|_| b"Sample data with repeated patterns and structures".to_vec())
1128            .collect();
1129
1130        let dict = ZstdCompressor::train(&samples, 2048).expect("Training failed");
1131        let compressor = ZstdCompressor::new(3, Some(dict));
1132
1133        let data = b"Sample data with repeated patterns and structures";
1134        let compressed = compressor
1135            .compress(data)
1136            .expect("Compression with dict failed");
1137        let decompressed = compressor
1138            .decompress(&compressed)
1139            .expect("Decompression with dict failed");
1140
1141        assert_eq!(data.as_slice(), decompressed.as_slice());
1142    }
1143
1144    #[test]
1145    #[cfg_attr(miri, ignore)]
1146    fn test_dictionary_improves_compression() {
1147        let samples: Vec<Vec<u8>> = (0..20)
1148            .map(|_| {
1149                let mut data = Vec::with_capacity(1024);
1150                data.extend_from_slice(b"HEADER:");
1151                data.extend_from_slice(&vec![42u8; 1000]);
1152                data.extend_from_slice(b"FOOTER");
1153                data
1154            })
1155            .collect();
1156
1157        let dict = ZstdCompressor::train(&samples, 2048).expect("Training failed");
1158
1159        let without_dict = ZstdCompressor::new(3, None);
1160        let with_dict = ZstdCompressor::new(3, Some(dict));
1161
1162        let test_data = {
1163            let mut data = Vec::with_capacity(1024);
1164            data.extend_from_slice(b"HEADER:");
1165            data.extend_from_slice(&vec![42u8; 1000]);
1166            data.extend_from_slice(b"FOOTER");
1167            data
1168        };
1169
1170        let compressed_no_dict = without_dict.compress(&test_data).unwrap();
1171        let compressed_with_dict = with_dict.compress(&test_data).unwrap();
1172
1173        // Dictionary should improve compression for structured data
1174        // (though improvement might be minimal for this simple test case)
1175        assert!(compressed_with_dict.len() <= compressed_no_dict.len());
1176    }
1177
1178    #[test]
1179    #[cfg_attr(miri, ignore)]
1180    fn test_decompress_into_buffer() {
1181        let compressor = ZstdCompressor::new(3, None);
1182        let data = vec![99u8; 1024];
1183
1184        let compressed = compressor.compress(&data).unwrap();
1185
1186        let mut output = vec![0u8; 1024];
1187        let size = compressor
1188            .decompress_into(&compressed, &mut output)
1189            .expect("decompress_into failed");
1190
1191        assert_eq!(size, 1024);
1192        assert_eq!(output, data);
1193    }
1194
1195    #[test]
1196    #[cfg_attr(miri, ignore)]
1197    fn test_decompress_into_larger_buffer() {
1198        let compressor = ZstdCompressor::new(3, None);
1199        let data = vec![88u8; 512];
1200
1201        let compressed = compressor.compress(&data).unwrap();
1202
1203        let mut output = vec![0u8; 2048]; // Larger than needed
1204        let size = compressor
1205            .decompress_into(&compressed, &mut output)
1206            .expect("decompress_into failed");
1207
1208        assert_eq!(size, 512);
1209        assert_eq!(&output[..512], data.as_slice());
1210    }
1211
1212    #[test]
1213    #[cfg_attr(miri, ignore)]
1214    fn test_decompress_into_with_dictionary() {
1215        let samples: Vec<Vec<u8>> = (0..15).map(|_| vec![77u8; 512]).collect();
1216
1217        let dict = ZstdCompressor::train(&samples, 1024).unwrap();
1218        let compressor = ZstdCompressor::new(3, Some(dict));
1219
1220        let data = vec![77u8; 512];
1221        let compressed = compressor.compress(&data).unwrap();
1222
1223        let mut output = vec![0u8; 512];
1224        let size = compressor
1225            .decompress_into(&compressed, &mut output)
1226            .expect("decompress_into with dict failed");
1227
1228        assert_eq!(size, 512);
1229        assert_eq!(output, data);
1230    }
1231
1232    #[test]
1233    #[cfg_attr(miri, ignore)]
1234    fn test_dictionary_mismatch() {
1235        // Create samples that will actually benefit from dictionaries
1236        let samples1: Vec<Vec<u8>> = (0..20)
1237            .map(|_| b"PREFIX1:data:SUFFIX1".repeat(50))
1238            .collect();
1239        let samples2: Vec<Vec<u8>> = (0..20)
1240            .map(|_| b"PREFIX2:data:SUFFIX2".repeat(50))
1241            .collect();
1242
1243        let dict1 = ZstdCompressor::train(&samples1, 2048).unwrap();
1244        let dict2 = ZstdCompressor::train(&samples2, 2048).unwrap();
1245
1246        let compressor1 = ZstdCompressor::new(3, Some(dict1));
1247        let compressor2 = ZstdCompressor::new(3, Some(dict2));
1248
1249        let data = b"PREFIX1:data:SUFFIX1".repeat(50);
1250        let compressed = compressor1.compress(&data).unwrap();
1251
1252        // Decompressing with wrong dictionary may fail or produce corrupted data
1253        // Zstd behavior varies, so just verify we can detect a difference
1254        let result = compressor2.decompress(&compressed);
1255        // Either it fails, or the data is corrupted if it succeeds
1256        if let Ok(decompressed) = result {
1257            // If it succeeded, data might be corrupted (not guaranteed to fail)
1258            let _ = decompressed; // Just verify no panic
1259        }
1260        // Test passes as long as no panic occurs
1261    }
1262
1263    #[test]
1264    #[cfg_attr(miri, ignore)]
1265    fn test_no_dict_vs_dict_compatibility() {
1266        // Use trained dictionary for better testing
1267        let samples: Vec<Vec<u8>> = (0..20)
1268            .map(|_| b"HEADER:payload:FOOTER".repeat(100))
1269            .collect();
1270        let dict = ZstdCompressor::train(&samples, 2048).unwrap();
1271
1272        let with_dict = ZstdCompressor::new(3, Some(dict));
1273        let without_dict = ZstdCompressor::new(3, None);
1274
1275        let data = b"HEADER:payload:FOOTER".repeat(100);
1276
1277        // Compress with dictionary
1278        let compressed_with_dict = with_dict.compress(&data).unwrap();
1279
1280        // Try to decompress with no dictionary
1281        // Zstd behavior: may fail or succeed depending on implementation details
1282        let _result = without_dict.decompress(&compressed_with_dict);
1283
1284        // Compress without dictionary
1285        let compressed_no_dict = without_dict.compress(&data).unwrap();
1286
1287        // Try to decompress with dictionary
1288        // Zstd is generally backward compatible - may work
1289        let result = with_dict.decompress(&compressed_no_dict);
1290        if let Ok(decompressed) = result {
1291            // If it works, verify data integrity
1292            assert_eq!(decompressed, data);
1293        }
1294    }
1295
1296    #[test]
1297    #[cfg_attr(miri, ignore)]
1298    fn test_compressor_debug_format() {
1299        let compressor_no_dict = ZstdCompressor::new(5, None);
1300        let debug_str = format!("{:?}", compressor_no_dict);
1301
1302        assert!(debug_str.contains("ZstdCompressor"));
1303        assert!(debug_str.contains("level"));
1304        assert!(debug_str.contains("5"));
1305        assert!(debug_str.contains("has_dict"));
1306        assert!(debug_str.contains("false"));
1307    }
1308
1309    #[test]
1310    #[cfg_attr(miri, ignore)]
1311    fn test_compressor_debug_format_with_dict() {
1312        let dict = vec![1u8; 512];
1313        let compressor_with_dict = ZstdCompressor::new(3, Some(dict));
1314        let debug_str = format!("{:?}", compressor_with_dict);
1315
1316        assert!(debug_str.contains("ZstdCompressor"));
1317        assert!(debug_str.contains("level"));
1318        assert!(debug_str.contains("3"));
1319        assert!(debug_str.contains("has_dict"));
1320        assert!(debug_str.contains("true"));
1321    }
1322
1323    #[test]
1324    #[cfg_attr(miri, ignore)]
1325    fn test_train_with_empty_samples() {
1326        let samples: Vec<Vec<u8>> = vec![];
1327        let result = ZstdCompressor::train(&samples, 1024);
1328
1329        // Training with empty samples should fail
1330        assert!(result.is_err());
1331    }
1332
1333    #[test]
1334    #[cfg_attr(miri, ignore)]
1335    fn test_train_with_small_samples() {
1336        let samples: Vec<Vec<u8>> = vec![vec![1u8; 10]];
1337        let result = ZstdCompressor::train(&samples, 1024);
1338
1339        // Training with too little data might fail or produce poor dict
1340        // Just verify it doesn't panic
1341        let _ = result;
1342    }
1343
1344    #[test]
1345    #[cfg_attr(miri, ignore)]
1346    fn test_multiple_compressions_same_compressor() {
1347        let compressor = ZstdCompressor::new(3, None);
1348
1349        for i in 0..10 {
1350            let data = vec![i as u8; 1000];
1351            let compressed = compressor.compress(&data).unwrap();
1352            let decompressed = compressor.decompress(&compressed).unwrap();
1353            assert_eq!(data, decompressed);
1354        }
1355    }
1356
1357    #[test]
1358    #[cfg_attr(miri, ignore)]
1359    fn test_buffer_reuse_pattern() {
1360        let compressor = ZstdCompressor::new(3, None);
1361        let data = vec![55u8; 4096];
1362        let compressed = compressor.compress(&data).unwrap();
1363
1364        let mut reusable_buffer = vec![0u8; 4096];
1365
1366        // Reuse buffer multiple times
1367        for _ in 0..5 {
1368            let size = compressor
1369                .decompress_into(&compressed, &mut reusable_buffer)
1370                .unwrap();
1371            assert_eq!(size, 4096);
1372            assert_eq!(reusable_buffer, data);
1373        }
1374    }
1375
1376    #[test]
1377    #[cfg_attr(miri, ignore)]
1378    fn test_various_data_patterns() {
1379        let compressor = ZstdCompressor::new(3, None);
1380
1381        // All zeros
1382        let zeros = vec![0u8; 1000];
1383        let compressed = compressor.compress(&zeros).unwrap();
1384        assert!(compressed.len() < 100, "Zeros should compress very well");
1385        assert_eq!(compressor.decompress(&compressed).unwrap(), zeros);
1386
1387        // Alternating pattern
1388        let alternating: Vec<u8> = (0..1000)
1389            .map(|i| if i % 2 == 0 { 0xAA } else { 0x55 })
1390            .collect();
1391        let compressed = compressor.compress(&alternating).unwrap();
1392        assert_eq!(compressor.decompress(&compressed).unwrap(), alternating);
1393
1394        // Sequential bytes
1395        let sequential: Vec<u8> = (0..=255).cycle().take(1000).collect();
1396        let compressed = compressor.compress(&sequential).unwrap();
1397        assert_eq!(compressor.decompress(&compressed).unwrap(), sequential);
1398    }
1399
1400    #[test]
1401    #[cfg_attr(miri, ignore)]
1402    fn test_compression_preserves_data_integrity() {
1403        let compressor = ZstdCompressor::new(3, None);
1404
1405        // Test with various byte values
1406        for byte_value in [0u8, 1, 127, 128, 255] {
1407            let data = vec![byte_value; 1000];
1408            let compressed = compressor.compress(&data).unwrap();
1409            let decompressed = compressor.decompress(&compressed).unwrap();
1410            assert_eq!(data, decompressed, "Failed for byte value {}", byte_value);
1411        }
1412    }
1413
1414    #[test]
1415    #[cfg_attr(miri, ignore)]
1416    fn test_high_compression_level() {
1417        let compressor = ZstdCompressor::new(19, None);
1418        let data = b"High compression level test data with some patterns ".repeat(50);
1419
1420        let compressed = compressor.compress(&data).unwrap();
1421        let decompressed = compressor.decompress(&compressed).unwrap();
1422
1423        assert_eq!(data, decompressed);
1424    }
1425
1426    #[test]
1427    #[cfg_attr(miri, ignore)]
1428    fn test_max_compression_level() {
1429        let compressor = ZstdCompressor::new(22, None);
1430        let data = vec![123u8; 5000];
1431
1432        let compressed = compressor.compress(&data).unwrap();
1433        let decompressed = compressor.decompress(&compressed).unwrap();
1434
1435        assert_eq!(data, decompressed);
1436    }
1437}
hexz_core/algo/compression/zstd.rs

hexz_core/algo/compression/
zstd.rs