hexz_ops/
write.rs

1//! Low-level write operations for Hexz archives.
2//!
3//! This module provides the foundational building blocks for writing compressed,
4//! encrypted, and deduplicated blocks to archive files. These functions implement
5//! the core write semantics used by higher-level pack operations while remaining
6//! independent of the packing workflow.
7//!
8//! # Module Purpose
9//!
10//! The write operations module serves as the bridge between the high-level packing
11//! pipeline and the raw file I/O layer. It encapsulates the logic for:
12//!
13//! - **Block Writing**: Transform raw chunks into compressed, encrypted blocks
14//! - **Deduplication**: Detect and eliminate redundant blocks via content hashing
15//! - **Zero Optimization**: Handle sparse data efficiently without storage
16//! - **Metadata Generation**: Create `BlockInfo` descriptors for index building
17//!
18//! # Design Philosophy
19//!
20//! These functions are designed to be composable, stateless, and easily testable.
21//! They operate on raw byte buffers and writers without knowledge of the broader
22//! packing context (progress reporting, stream management, index organization).
23//!
24//! This separation enables:
25//! - Unit testing of write logic in isolation
26//! - Reuse in different packing strategies (single-stream, multi-threaded, streaming)
27//! - Clear separation of concerns (write vs. orchestration)
28//!
29//! # Write Operation Semantics
30//!
31//! ## Block Transformation Pipeline
32//!
33//! Each block undergoes a multi-stage transformation before being written:
34//!
35//! ```text
36//! Raw Chunk (input)
37//!      ↓
38//! ┌────────────────┐
39//! │ Compression    │ → Compress using LZ4 or Zstd
40//! └────────────────┘   (reduces size, increases CPU)
41//!      ↓
42//! ┌────────────────┐
43//! │ Encryption     │ → Optional AES-256-GCM with block_idx nonce
44//! └────────────────┘   (confidentiality + integrity)
45//!      ↓
46//! ┌────────────────┐
47//! │ Checksum       │ → CRC32 of final data (fast integrity check)
48//! └────────────────┘
49//!      ↓
50//! ┌────────────────┐
51//! │ Deduplication  │ → BLAKE3 hash lookup (skip write if duplicate)
52//! └────────────────┘   (disabled for encrypted data)
53//!      ↓
54//! ┌────────────────┐
55//! │ Write          │ → Append to output file at current offset
56//! └────────────────┘
57//!      ↓
58//! BlockInfo (metadata: offset, length, checksum)
59//! ```
60//!
61//! ## Write Behavior and Atomicity
62//!
63//! ### Single Block Writes
64//!
65//! Individual block writes via [`write_block`] are atomic with respect to the
66//! underlying file system's write atomicity guarantees:
67//!
68//! - **Buffered writes**: Data passes through OS page cache
69//! - **No fsync**: Writes are not flushed to disk until the writer is closed
70//! - **Partial write handling**: Writer's `write_all` ensures complete writes or error
71//! - **Crash behavior**: Partial blocks may be written if process crashes mid-write
72//!
73//! ### Deduplication State
74//!
75//! The deduplication map is maintained externally (by the caller). This design allows:
76//! - **Flexibility**: Caller controls when/if to enable deduplication
77//! - **Memory control**: Map lifetime and size managed by orchestration layer
78//! - **Consistency**: Map updates are immediately visible to subsequent writes
79//!
80//! ### Offset Management
81//!
82//! The `current_offset` parameter is updated atomically after each successful write.
83//! This ensures:
84//! - **Sequential allocation**: Blocks are laid out contiguously in file
85//! - **No gaps**: Every byte between header and master index is utilized
86//! - **Predictable layout**: Physical offset increases monotonically
87//!
88//! ## Block Allocation Strategy
89//!
90//! Blocks are allocated sequentially in the order they are written:
91//!
92//! ```text
93//! File Layout:
94//! ┌──────────────┬──────────┬──────────┬──────────┬─────────────┐
95//! │ Header (512B)│ Block 0  │ Block 1  │ Block 2  │ Index Pages │
96//! └──────────────┴──────────┴──────────┴──────────┴─────────────┘
97//!  ↑             ↑          ↑          ↑
98//!  0             512        512+len0   512+len0+len1
99//!
100//! current_offset advances after each write:
101//! - Initial: 512 (after header)
102//! - After Block 0: 512 + len0
103//! - After Block 1: 512 + len0 + len1
104//! - After Block 2: 512 + len0 + len1 + len2
105//! ```
106//!
107//! ### Deduplication Impact
108//!
109//! When deduplication detects a duplicate block:
110//! - **No physical write**: Block is not written to disk
111//! - **Offset reuse**: `BlockInfo` references the existing block's offset
112//! - **Space savings**: Multiple logical blocks share one physical block
113//! - **Transparency**: Readers cannot distinguish between deduplicated and unique blocks
114//!
115//! Example with deduplication:
116//!
117//! ```text
118//! Logical Blocks: [A, B, A, C, B]
119//! Physical Blocks: [A, B, C]
120//!                   ↑  ↑     ↑
121//!                   │  │     └─ Block 3 (unique)
122//!                   │  └─ Block 1 (unique)
123//!                   └─ Block 0 (unique)
124//!
125//! BlockInfo for logical block 2: offset = offset_of(A), length = len(A)
126//! BlockInfo for logical block 4: offset = offset_of(B), length = len(B)
127//! ```
128//!
129//! ## Buffer Management
130//!
131//! This module does not perform explicit buffer management. All buffers are:
132//!
133//! - **Caller-allocated**: Input chunks are provided by caller
134//! - **Temporary allocations**: Compression/encryption output is allocated, then consumed
135//! - **No pooling**: Each operation allocates fresh buffers (GC handles reclamation)
136//!
137//! For high-performance scenarios, callers should consider:
138//! - Reusing chunk buffers across iterations
139//! - Using buffer pools for compression output (requires refactoring)
140//! - Batch writes to amortize allocation overhead
141//!
142//! ## Flush Behavior
143//!
144//! Functions in this module do NOT flush data to disk. Flushing is the caller's
145//! responsibility and typically occurs:
146//!
147//! - After writing all blocks and indices (in [`pack_archive`](crate::pack::pack_archive))
148//! - Before closing the output file
149//! - Never during block writing (to maximize write batching)
150//!
151//! This design allows the OS to batch writes for optimal I/O performance.
152//!
153//! # Error Handling and Recovery
154//!
155//! ## Error Categories
156//!
157//! Write operations can fail for several reasons:
158//!
159//! ### I/O Errors
160//!
161//! - **Disk full**: No space for compressed block (`ENOSPC`)
162//! - **Permission denied**: Writer lacks write permission (`EACCES`)
163//! - **Device error**: Hardware failure, I/O timeout (`EIO`)
164//!
165//! These surface as `Error::Io` wrapping the underlying `std::io::Error`.
166//!
167//! ### Compression Errors
168//!
169//! - **Compression failure**: Compressor returns error (rare, usually indicates bug)
170//! - **Incompressible data**: Not an error; stored with expansion
171//!
172//! These surface as `Error::Compression`.
173//!
174//! ### Encryption Errors
175//!
176//! - **Cipher initialization failure**: Invalid state (should not occur in practice)
177//! - **Encryption failure**: Crypto operation fails (indicates library bug)
178//!
179//! These surface as `Error::Encryption`.
180//!
181//! ## Error Recovery
182//!
183//! Write operations provide **no automatic recovery**. On error:
184//!
185//! - **Function returns immediately**: No cleanup or rollback
186//! - **File state undefined**: Partial data may be written
187//! - **Caller responsibility**: Must handle error and clean up
188//!
189//! Typical error handling pattern in pack operations:
190//!
191//! ```text
192//! match write_block_simple(...) {
193//!     Ok(info) => {
194//!         // Success: Add info to index, continue
195//!     }
196//!     Err(e) => {
197//!         // Failure: Log error, delete partial output file, return error to caller
198//!         std::fs::remove_file(output)?;
199//!         return Err(e);
200//!     }
201//! }
202//! ```
203//!
204//! ## Partial Write Handling
205//!
206//! The underlying `Write::write_all` method ensures atomic writes of complete blocks:
207//!
208//! - **Success**: Entire block written, offset updated
209//! - **Failure**: Partial write may occur, but error is returned
210//! - **No retry**: Caller must handle retries if desired
211//!
212//! # Performance Characteristics
213//!
214//! ## Write Throughput
215//!
216//! Block write performance is dominated by compression:
217//!
218//! - **LZ4**: ~2 GB/s (minimal overhead)
219//! - **Zstd level 3**: ~200-500 MB/s (depends on data)
220//! - **Encryption**: ~1-2 GB/s (hardware AES-NI)
221//! - **BLAKE3 hashing**: ~3200 MB/s (for deduplication)
222//!
223//! Typical bottleneck: Compression CPU time.
224//!
225//! ## Deduplication Overhead
226//!
227//! BLAKE3 hashing adds ~5-10% overhead to write operations:
228//!
229//! - **Hash computation**: ~3200 MB/s throughput (BLAKE3 tree-hashed)
230//! - **Hash table lookup**: O(1) average, ~50-100 ns per lookup
231//! - **Memory usage**: ~48 bytes per unique block
232//!
233//! For datasets with <10% duplication, deduplication overhead may exceed savings.
234//! Consider disabling dedup for unique data.
235//!
236//! ## Zero Block Detection
237//!
238//! [`is_zero_chunk`] uses SIMD-optimized comparison on modern CPUs:
239//!
240//! - **Throughput**: ~10-20 GB/s (memory bandwidth limited)
241//! - **Overhead**: Negligible (~5-10 cycles per 64-byte cache line)
242//!
243//! Zero detection is always worth enabling for sparse data.
244//!
245//! # Memory Usage
246//!
247//! Per-block memory allocation:
248//!
249//! - **Input chunk**: Caller-provided (typically 64 KiB)
250//! - **Compression output**: ~1.5× chunk size worst case (incompressible data)
251//! - **Encryption output**: `compression_size` + 28 bytes (AES-GCM overhead)
252//! - **Dedup hash**: 32 bytes (BLAKE3 digest)
253//!
254//! Total temporary allocation per write: ~100-150 KiB (released immediately after write).
255//!
256//! # Examples
257//!
258//! See individual function documentation for usage examples.
259//!
260//! # Future Enhancements
261//!
262//! Potential improvements to write operations:
263//!
264//! - **Buffer pooling**: Reuse compression/encryption buffers to reduce allocation overhead
265//! - **Async I/O**: Use `tokio` or `io_uring` for overlapped writes
266//! - **Parallel writes**: Write multiple blocks concurrently (requires coordination)
267//! - **Write-ahead logging**: Enable atomic commits for crash safety
268
269use hexz_common::Result;
270use std::io::Write;
271
272use hexz_core::algo::compression::Compressor;
273use hexz_core::algo::dedup::hash_table::StandardHashTable;
274use hexz_core::algo::encryption::Encryptor;
275use hexz_core::algo::hashing::ContentHasher;
276use hexz_core::format::index::BlockInfo;
277
278/// Reusable context for block write operations.
279///
280/// Bundles the compressor, encryptor, hasher, and scratch buffers needed by
281/// [`write_block`] so that callers do not have to pass many individual arguments.
282pub struct WriteContext<'a> {
283    /// Compressor used to compress block data.
284    pub compressor: &'a dyn Compressor,
285    /// Optional encryptor for per-block encryption.
286    pub encryptor: Option<&'a dyn Encryptor>,
287    /// Content hasher for deduplication.
288    pub hasher: &'a dyn ContentHasher,
289    /// Scratch buffer for hash output.
290    pub hash_buf: &'a mut [u8; 32],
291    /// Scratch buffer for compressed data.
292    pub compress_buf: &'a mut Vec<u8>,
293    /// Scratch buffer for encrypted data.
294    pub encrypt_buf: &'a mut Vec<u8>,
295}
296
297impl std::fmt::Debug for WriteContext<'_> {
298    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
299        f.debug_struct("WriteContext")
300            .field("encryptor", &self.encryptor.as_ref().map(|_| ".."))
301            .finish_non_exhaustive()
302    }
303}
304
305/// Writes a compressed and optionally encrypted block to the output stream.
306///
307/// This function implements the complete block transformation pipeline: compression,
308/// optional encryption, checksum computation, deduplication, and physical write.
309/// It returns a `BlockInfo` descriptor suitable for inclusion in an index page.
310///
311/// # Transformation Pipeline
312///
313/// 1. **Compression**: Compress raw chunk using provided compressor (LZ4 or Zstd)
314/// 2. **Encryption** (optional): Encrypt compressed data with AES-256-GCM using `block_idx` as nonce
315/// 3. **Checksum**: Compute CRC32 of final data for integrity verification
316/// 4. **Deduplication** (optional, not for encrypted):
317///    - Compute BLAKE3 hash of final data
318///    - Check `dedup_map` for existing block with same hash
319///    - If found: Reuse existing offset, skip write
320///    - If new: Write block, record offset in `dedup_map`
321/// 5. **Write**: Append final data to output at `current_offset`
322/// 6. **Metadata**: Create and return `BlockInfo` with offset, length, checksum
323///
324/// # Parameters
325///
326/// - `out`: Output writer implementing `Write` trait
327///   - Typically a `File` or `BufWriter<File>`
328///   - Must support `write_all` for atomic block writes
329///
330/// - `chunk`: Uncompressed chunk data (raw bytes)
331///   - Typical size: 16 KiB - 256 KiB (configurable)
332///   - Must not be empty (undefined behavior for zero-length chunks)
333///
334/// - `block_idx`: Global block index (zero-based)
335///   - Used as encryption nonce (must be unique per archive)
336///   - Monotonically increases across all streams
337///   - Must not reuse indices within same encrypted archive (breaks security)
338///
339/// - `current_offset`: Mutable reference to current physical file offset
340///   - Updated after successful write: `*current_offset += bytes_written`
341///   - Not updated on error (file state undefined)
342///   - Not updated for deduplicated blocks (reuses existing offset)
343///
344/// - `dedup_map`: Optional deduplication hash table
345///   - `Some(&mut map)`: Enable dedup, use this map
346///   - `None`: Disable dedup, always write
347///   - Ignored if `encryptor.is_some()` (encryption prevents dedup)
348///   - Maps BLAKE3 hash → physical offset of first occurrence
349///
350/// - `compressor`: Compression algorithm implementation
351///   - Typically `Lz4Compressor` or `ZstdCompressor`
352///   - Must implement [`Compressor`] trait
353///
354/// - `encryptor`: Optional encryption implementation
355///   - `Some(enc)`: Encrypt compressed data with AES-256-GCM
356///   - `None`: Store compressed data unencrypted
357///   - Must implement [`Encryptor`] trait
358///
359/// - `hasher`: Content hasher for deduplication
360///   - Typically `Blake3Hasher`
361///   - Must implement [`ContentHasher`] trait
362///   - Used only when `dedup_map` is Some and encryptor is None
363///
364/// - `hash_buf`: Reusable buffer for hash output (must be ≥32 bytes)
365///   - Avoids allocation on every hash computation
366///   - Only used when dedup is enabled
367///
368/// # Returns
369///
370/// - `Ok(BlockInfo)`: Block written successfully, metadata returned
371///   - `offset`: Physical byte offset where block starts
372///   - `length`: Compressed (and encrypted) size in bytes
373///   - `logical_len`: Original uncompressed size
374///   - `checksum`: CRC32 of final data (compressed + encrypted)
375///
376/// - `Err(Error::Io)`: I/O error during write
377///   - Disk full, permission denied, device error
378///   - File state undefined (partial write may have occurred)
379///
380/// - `Err(Error::Compression)`: Compression failed
381///   - Rare; usually indicates library bug or corrupted input
382///
383/// - `Err(Error::Encryption)`: Encryption failed
384///   - Rare; usually indicates crypto library bug
385///
386/// # Examples
387///
388/// ## Basic Usage (No Encryption, No Dedup)
389///
390/// ```no_run
391/// use hexz_ops::write::{WriteContext, write_block};
392/// use hexz_core::algo::compression::Lz4Compressor;
393/// use hexz_core::algo::hashing::blake3::Blake3Hasher;
394/// use hexz_core::algo::dedup::hash_table::StandardHashTable;
395/// use std::fs::File;
396///
397/// # fn main() -> Result<(), Box<dyn std::error::Error>> {
398/// let mut out = File::create("output.hxz")?;
399/// let mut offset = 512u64; // After header
400/// let chunk = vec![0x42; 65536]; // 64 KiB of data
401/// let compressor = Lz4Compressor::new();
402/// let hasher = Blake3Hasher;
403/// let mut hash_buf = [0u8; 32];
404///
405/// let mut compress_buf = Vec::new();
406/// let mut encrypt_buf = Vec::new();
407///
408/// let mut ctx = WriteContext {
409///     compressor: &compressor,
410///     encryptor: None,
411///     hasher: &hasher,
412///     hash_buf: &mut hash_buf,
413///     compress_buf: &mut compress_buf,
414///     encrypt_buf: &mut encrypt_buf,
415/// };
416///
417/// let info = write_block(
418///     &mut out,
419///     &chunk,
420///     0,              // block_idx
421///     &mut offset,
422///     None::<&mut StandardHashTable>, // No dedup
423///     &mut ctx,
424/// )?;
425///
426/// println!("Block written at offset {}, size {}", info.offset, info.length);
427/// # Ok(())
428/// # }
429/// ```
430///
431/// ## With Deduplication
432///
433/// ```no_run
434/// use hexz_ops::write::{WriteContext, write_block};
435/// use hexz_core::algo::compression::Lz4Compressor;
436/// use hexz_core::algo::hashing::blake3::Blake3Hasher;
437/// use hexz_core::algo::dedup::hash_table::StandardHashTable;
438/// use std::fs::File;
439///
440/// # fn main() -> Result<(), Box<dyn std::error::Error>> {
441/// let mut out = File::create("output.hxz")?;
442/// let mut offset = 512u64;
443/// let mut dedup_map = StandardHashTable::new();
444/// let compressor = Lz4Compressor::new();
445/// let hasher = Blake3Hasher;
446/// let mut hash_buf = [0u8; 32];
447/// let mut compress_buf = Vec::new();
448/// let mut encrypt_buf = Vec::new();
449///
450/// let mut ctx = WriteContext {
451///     compressor: &compressor,
452///     encryptor: None,
453///     hasher: &hasher,
454///     hash_buf: &mut hash_buf,
455///     compress_buf: &mut compress_buf,
456///     encrypt_buf: &mut encrypt_buf,
457/// };
458///
459/// // Write first block
460/// let chunk1 = vec![0xAA; 65536];
461/// let info1 = write_block(
462///     &mut out,
463///     &chunk1,
464///     0,
465///     &mut offset,
466///     Some(&mut dedup_map),
467///     &mut ctx,
468/// )?;
469/// println!("Block 0: offset={}, written", info1.offset);
470///
471/// // Write duplicate block (same content)
472/// let chunk2 = vec![0xAA; 65536];
473/// let info2 = write_block(
474///     &mut out,
475///     &chunk2,
476///     1,
477///     &mut offset,
478///     Some(&mut dedup_map),
479///     &mut ctx,
480/// )?;
481/// println!("Block 1: offset={}, deduplicated (no write)", info2.offset);
482/// assert_eq!(info1.offset, info2.offset); // Same offset, block reused
483/// # Ok(())
484/// # }
485/// ```
486///
487/// ## With Encryption
488///
489/// ```no_run
490/// use hexz_ops::write::{WriteContext, write_block};
491/// use hexz_core::algo::compression::Lz4Compressor;
492/// use hexz_core::algo::encryption::AesGcmEncryptor;
493/// use hexz_core::algo::hashing::blake3::Blake3Hasher;
494/// use hexz_common::crypto::KeyDerivationParams;
495/// use hexz_core::algo::dedup::hash_table::StandardHashTable;
496/// use std::fs::File;
497///
498/// # fn main() -> Result<(), Box<dyn std::error::Error>> {
499/// let mut out = File::create("output.hxz")?;
500/// let mut offset = 512u64;
501/// let compressor = Lz4Compressor::new();
502/// let hasher = Blake3Hasher;
503/// let mut hash_buf = [0u8; 32];
504///
505/// // Initialize encryptor
506/// let params = KeyDerivationParams::default();
507/// let encryptor = AesGcmEncryptor::new(
508///     b"strong_password",
509///     &params.salt,
510///     params.iterations,
511/// )?;
512///
513/// let mut compress_buf = Vec::new();
514/// let mut encrypt_buf = Vec::new();
515///
516/// let mut ctx = WriteContext {
517///     compressor: &compressor,
518///     encryptor: Some(&encryptor),
519///     hasher: &hasher,
520///     hash_buf: &mut hash_buf,
521///     compress_buf: &mut compress_buf,
522///     encrypt_buf: &mut encrypt_buf,
523/// };
524///
525/// let chunk = vec![0x42; 65536];
526/// let info = write_block(
527///     &mut out,
528///     &chunk,
529///     0,
530///     &mut offset,
531///     None::<&mut StandardHashTable>, // Dedup disabled (encryption prevents it)
532///     &mut ctx,
533/// )?;
534///
535/// println!("Encrypted block: offset={}, length={}", info.offset, info.length);
536/// # Ok(())
537/// # }
538/// ```
539///
540/// # Performance
541///
542/// - **Compression**: Dominates runtime (~2 GB/s LZ4, ~500 MB/s Zstd)
543/// - **Encryption**: ~1-2 GB/s (hardware AES-NI)
544/// - **Hashing**: ~3200 MB/s (BLAKE3 for dedup)
545/// - **I/O**: Typically not bottleneck (buffered writes, ~3 GB/s sequential)
546///
547/// # Deduplication Effectiveness
548///
549/// Deduplication is most effective when:
550/// - **Fixed-size blocks**: Same content → same boundaries → same hash
551/// - **Unencrypted**: Encryption produces unique ciphertext per block (different nonces)
552/// - **Redundant data**: Duplicate files, repeated patterns, copy-on-write filesystems
553///
554/// Deduplication is ineffective when:
555/// - **Content-defined chunking**: Small shifts cause different boundaries
556/// - **Compressed input**: Pre-compressed data has low redundancy
557/// - **Unique data**: No duplicate blocks to detect
558///
559/// # Security Considerations
560///
561/// ## Block Index as Nonce
562///
563/// When encrypting, `block_idx` is used as part of the AES-GCM nonce. **CRITICAL**:
564/// - Never reuse `block_idx` values within the same encrypted archive
565/// - Nonce reuse breaks AES-GCM security (allows plaintext recovery)
566/// - Each logical block must have a unique index
567///
568/// ## Deduplication and Encryption
569///
570/// Deduplication is automatically disabled when encrypting because:
571/// - Each block has a unique nonce → unique ciphertext
572/// - BLAKE3(ciphertext1) ≠ BLAKE3(ciphertext2) even if plaintext is identical
573/// - Attempting dedup with encryption wastes CPU (hashing) without space savings
574///
575/// # Thread Safety
576///
577/// This function is **not thread-safe** with respect to the output writer:
578/// - Concurrent calls with the same `out` writer will interleave writes (corruption)
579/// - Concurrent calls with different writers to the same file will corrupt file
580///
581/// For parallel writing, use separate output files or implement external synchronization.
582///
583/// The `dedup_map` must also be externally synchronized for concurrent access.
584pub fn write_block<W: Write>(
585    out: &mut W,
586    chunk: &[u8],
587    block_idx: u64,
588    current_offset: &mut u64,
589    dedup_map: Option<&mut StandardHashTable>,
590    ctx: &mut WriteContext<'_>,
591) -> Result<BlockInfo> {
592    // Compress the chunk into reusable buffer
593    ctx.compressor.compress_into(chunk, ctx.compress_buf)?;
594
595    // Encrypt if requested, using reusable buffer
596    let final_data: &[u8] = if let Some(enc) = ctx.encryptor {
597        enc.encrypt_into(ctx.compress_buf, block_idx, ctx.encrypt_buf)?;
598        ctx.encrypt_buf
599    } else {
600        ctx.compress_buf
601    };
602
603    let checksum = crc32fast::hash(final_data);
604    let chunk_len = chunk.len() as u32;
605    let final_len = final_data.len() as u32;
606
607    // Handle deduplication (only if not encrypting)
608    let offset = if ctx.encryptor.is_some() {
609        // No dedup for encrypted data
610        let off = *current_offset;
611        out.write_all(final_data)?;
612        *current_offset += final_len as u64;
613        off
614    } else if let Some(map) = dedup_map {
615        // Hash directly into the fixed-size buffer (no runtime bounds check).
616        // Hash the UNCOMPRESSED data for consistent deduplication across compression algorithms.
617        *ctx.hash_buf = ctx.hasher.hash_fixed(chunk);
618
619        if let Some(existing_offset) = map.get(ctx.hash_buf) {
620            // Block already exists, reuse it — no copy needed on hit
621            existing_offset
622        } else {
623            // New block: copy hash_buf only on miss (insert needs owned key)
624            let off = *current_offset;
625            _ = map.insert(*ctx.hash_buf, off);
626            out.write_all(final_data)?;
627            *current_offset += final_len as u64;
628            off
629        }
630    } else {
631        // No dedup, just write
632        let off = *current_offset;
633        out.write_all(final_data)?;
634        *current_offset += final_len as u64;
635        off
636    };
637
638    Ok(BlockInfo {
639        offset,
640        length: final_len,
641        logical_len: chunk_len,
642        checksum,
643        hash: *ctx.hash_buf,
644    })
645}
646
647/// Creates a zero-block descriptor without writing data to disk.
648///
649/// Zero blocks (all-zero chunks) are a special case optimized for space efficiency.
650/// Instead of compressing and storing zeros, we create a metadata-only descriptor
651/// that signals to the reader to return zeros without performing any I/O.
652///
653/// # Sparse Data Optimization
654///
655/// Many VM disk images and memory dumps contain large regions of zeros:
656/// - **Unallocated disk space**: File systems often zero-initialize blocks
657/// - **Memory pages**: Unused or zero-initialized memory
658/// - **Sparse files**: Holes in sparse file systems
659///
660/// Storing these zeros (even compressed) wastes space:
661/// - **LZ4-compressed zeros**: ~100 bytes per 64 KiB block (~0.15% of original)
662/// - **Uncompressed zeros**: 64 KiB per block (100%)
663/// - **Metadata-only**: 20 bytes per block (~0.03%)
664///
665/// The metadata approach saves 99.97% of space for zero blocks.
666///
667/// # Descriptor Format
668///
669/// Zero blocks are identified by a special `BlockInfo` signature:
670/// - `offset = 0`: Invalid physical offset (data region starts at ≥512)
671/// - `length = 0`: No physical storage
672/// - `logical_len = N`: Original zero block size in bytes
673/// - `checksum = 0`: No checksum needed (zeros are deterministic)
674///
675/// Readers recognize this pattern and synthesize zeros without I/O.
676///
677/// # Parameters
678///
679/// - `logical_len`: Size of the zero block in bytes
680///   - Typically matches `block_size` (e.g., 65536 for 64 KiB blocks)
681///   - Can vary with content-defined chunking
682///   - Must be > 0 (zero-length blocks are invalid)
683///
684/// # Returns
685///
686/// `BlockInfo` descriptor with zero-block semantics:
687/// - `offset = 0`
688/// - `length = 0`
689/// - `logical_len = logical_len`
690/// - `checksum = 0`
691///
692/// # Examples
693///
694/// ## Detecting and Creating Zero Blocks
695///
696/// ```
697/// use hexz_ops::write::{is_zero_chunk, create_zero_block};
698/// use hexz_core::format::index::BlockInfo;
699///
700/// let chunk = vec![0u8; 65536]; // 64 KiB of zeros
701///
702/// if is_zero_chunk(&chunk) {
703///     let info = create_zero_block(chunk.len() as u32);
704///     assert_eq!(info.offset, 0);
705///     assert_eq!(info.length, 0);
706///     assert_eq!(info.logical_len, 65536);
707///     println!("Zero block: No storage required!");
708/// }
709/// ```
710///
711/// ## Usage in Packing Loop
712///
713/// ```no_run
714/// # use hexz_ops::write::{is_zero_chunk, create_zero_block, write_block, WriteContext};
715/// # use hexz_core::algo::compression::Lz4Compressor;
716/// # use hexz_core::algo::hashing::blake3::Blake3Hasher;
717/// # use hexz_core::algo::dedup::hash_table::StandardHashTable;
718/// # use std::fs::File;
719/// # fn main() -> Result<(), Box<dyn std::error::Error>> {
720/// # let mut out = File::create("output.hxz")?;
721/// # let mut offset = 512u64;
722/// # let compressor = Lz4Compressor::new();
723/// # let hasher = Blake3Hasher;
724/// # let mut hash_buf = [0u8; 32];
725/// # let mut compress_buf = Vec::new();
726/// # let mut encrypt_buf = Vec::new();
727/// # let chunks: Vec<Vec<u8>> = vec![];
728/// let mut ctx = WriteContext {
729///     compressor: &compressor, encryptor: None, hasher: &hasher,
730///     hash_buf: &mut hash_buf, compress_buf: &mut compress_buf, encrypt_buf: &mut encrypt_buf,
731/// };
732/// for (idx, chunk) in chunks.iter().enumerate() {
733///     let info = if is_zero_chunk(chunk) {
734///         create_zero_block(chunk.len() as u32)
735///     } else {
736///         write_block(&mut out, chunk, idx as u64, &mut offset, None::<&mut StandardHashTable>, &mut ctx)?
737///     };
738///     // Add info to index page...
739/// }
740/// # Ok(())
741/// # }
742/// ```
743///
744/// # Performance
745///
746/// - **Time complexity**: O(1) (no I/O, no computation)
747/// - **Space complexity**: O(1) (fixed-size struct)
748/// - **Typical savings**: 99.97% vs. compressed zeros
749///
750/// # Reader Behavior
751///
752/// When a reader encounters a zero block (offset=0, length=0):
753/// 1. Recognize zero-block pattern from metadata
754/// 2. Allocate buffer of size `logical_len`
755/// 3. Fill buffer with zeros (optimized memset)
756/// 4. Return buffer to caller
757///
758/// No decompression, decryption, or checksum verification is performed.
759///
760/// # Interaction with Deduplication
761///
762/// Zero blocks do not participate in deduplication:
763/// - They are never written to disk → no physical offset → no dedup entry
764/// - Each zero block gets its own metadata descriptor
765/// - This is fine: Metadata is cheap (20 bytes), and all zero blocks have same content
766///
767/// # Interaction with Encryption
768///
769/// Zero blocks work correctly with encryption:
770/// - They are detected **before** compression/encryption
771/// - Encrypted archives still use zero-block optimization
772/// - Readers synthesize zeros without decryption
773///
774/// This is safe because zeros are public information (no confidentiality lost).
775///
776/// # Validation
777///
778/// **IMPORTANT**: This function does NOT validate that the original chunk was actually
779/// all zeros. The caller is responsible for calling [`is_zero_chunk`] first.
780///
781/// If a non-zero chunk is incorrectly marked as a zero block, readers will return
782/// zeros instead of the original data (silent data corruption).
783pub const fn create_zero_block(logical_len: u32) -> BlockInfo {
784    BlockInfo {
785        offset: 0,
786        length: 0,
787        logical_len,
788        checksum: 0,
789        hash: [0u8; 32],
790    }
791}
792
793/// Convenience wrapper for `write_block` that allocates hasher and buffer internally.
794///
795/// This is a simpler API for tests and one-off writes. For hot paths (like archive
796/// packing loops), use `write_block` directly with a reused hasher and buffer.
797#[cfg(test)]
798fn write_block_simple<W: Write>(
799    out: &mut W,
800    chunk: &[u8],
801    block_idx: u64,
802    current_offset: &mut u64,
803    dedup_map: Option<&mut StandardHashTable>,
804    compressor: &dyn Compressor,
805    encryptor: Option<&dyn Encryptor>,
806) -> Result<BlockInfo> {
807    use hexz_core::algo::hashing::blake3::Blake3Hasher;
808    let hasher = Blake3Hasher;
809    let mut hash_buf = [0u8; 32];
810    let mut compress_buf = Vec::new();
811    let mut encrypt_buf = Vec::new();
812    let mut ctx = WriteContext {
813        compressor,
814        encryptor,
815        hasher: &hasher,
816        hash_buf: &mut hash_buf,
817        compress_buf: &mut compress_buf,
818        encrypt_buf: &mut encrypt_buf,
819    };
820    write_block(out, chunk, block_idx, current_offset, dedup_map, &mut ctx)
821}
822
823/// Checks if a chunk consists entirely of zero bytes.
824///
825/// This function efficiently detects all-zero chunks to enable sparse block optimization.
826/// Zero chunks are common in VM images (unallocated space), memory dumps (zero-initialized
827/// pages), and sparse files.
828///
829/// # Algorithm
830///
831/// Uses Rust's iterator `all()` combinator, which:
832/// - Short-circuits on first non-zero byte (early exit)
833/// - Compiles to SIMD instructions on modern CPUs (autovectorization)
834/// - Typically processes 16-32 bytes per instruction (AVX2/AVX-512)
835///
836/// # Parameters
837///
838/// - `chunk`: Byte slice to check
839///   - Empty slices return `true` (vacuous truth)
840///   - Typical size: 16 KiB - 256 KiB (configurable block size)
841///
842/// # Returns
843///
844/// - `true`: All bytes are zero (sparse block, use [`create_zero_block`])
845/// - `false`: At least one non-zero byte (normal block, compress and write)
846///
847/// # Performance
848///
849/// Modern CPUs with SIMD support achieve excellent throughput:
850///
851/// - **SIMD-optimized**: ~10-20 GB/s (memory bandwidth limited)
852/// - **Scalar fallback**: ~1-2 GB/s (without SIMD)
853/// - **Typical overhead**: <1% of total packing time
854///
855/// The check is always worth performing given the massive space savings for zero blocks.
856///
857/// # Examples
858///
859/// ## Basic Usage
860///
861/// ```
862/// use hexz_ops::write::is_zero_chunk;
863///
864/// let zeros = vec![0u8; 65536];
865/// assert!(is_zero_chunk(&zeros));
866///
867/// let data = vec![0u8, 1u8, 0u8];
868/// assert!(!is_zero_chunk(&data));
869///
870/// let empty: &[u8] = &[];
871/// assert!(is_zero_chunk(empty)); // Empty is considered "all zeros"
872/// ```
873///
874/// ## Packing Loop Integration
875///
876/// ```no_run
877/// # use hexz_ops::write::{is_zero_chunk, create_zero_block, write_block, WriteContext};
878/// # use hexz_core::algo::compression::Lz4Compressor;
879/// # use hexz_core::algo::hashing::blake3::Blake3Hasher;
880/// # use hexz_core::format::index::BlockInfo;
881/// # use hexz_core::algo::dedup::hash_table::StandardHashTable;
882/// # use std::fs::File;
883/// # fn main() -> Result<(), Box<dyn std::error::Error>> {
884/// # let mut out = File::create("output.hxz")?;
885/// # let mut offset = 512u64;
886/// # let compressor = Lz4Compressor::new();
887/// # let hasher = Blake3Hasher;
888/// # let mut hash_buf = [0u8; 32];
889/// # let mut compress_buf = Vec::new();
890/// # let mut encrypt_buf = Vec::new();
891/// # let mut index_blocks = Vec::new();
892/// # let chunks: Vec<Vec<u8>> = vec![];
893/// let mut ctx = WriteContext {
894///     compressor: &compressor, encryptor: None, hasher: &hasher,
895///     hash_buf: &mut hash_buf, compress_buf: &mut compress_buf, encrypt_buf: &mut encrypt_buf,
896/// };
897/// for (idx, chunk) in chunks.iter().enumerate() {
898///     let info = if is_zero_chunk(chunk) {
899///         create_zero_block(chunk.len() as u32)
900///     } else {
901///         write_block(&mut out, chunk, idx as u64, &mut offset, None::<&mut StandardHashTable>, &mut ctx)?
902///     };
903///     index_blocks.push(info);
904/// }
905/// # Ok(())
906/// # }
907/// ```
908///
909/// ## Benchmarking Zero Detection
910///
911/// ```
912/// use hexz_ops::write::is_zero_chunk;
913/// use std::time::Instant;
914///
915/// let chunk = vec![0u8; 64 * 1024 * 1024]; // 64 MiB
916/// let start = Instant::now();
917///
918/// for _ in 0..100 {
919///     let _ = is_zero_chunk(&chunk);
920/// }
921///
922/// let elapsed = start.elapsed();
923/// let throughput = (64.0 * 100.0) / elapsed.as_secs_f64(); // MB/s
924/// println!("Zero detection: {:.1} GB/s", throughput / 1024.0);
925/// ```
926///
927/// # SIMD Optimization
928///
929/// On x86-64 with AVX2, the compiler typically generates code like:
930///
931/// ```text
932/// vpxor    ymm0, ymm0, ymm0    ; Zero register
933/// loop:
934///   vmovdqu  ymm1, [rsi]        ; Load 32 bytes
935///   vpcmpeqb ymm2, ymm1, ymm0   ; Compare with zero
936///   vpmovmskb eax, ymm2         ; Extract comparison mask
937///   cmp      eax, 0xFFFFFFFF    ; All zeros?
938///   jne      found_nonzero      ; Early exit if not
939///   add      rsi, 32            ; Advance pointer
940///   loop
941/// ```
942///
943/// This processes 32 bytes per iteration (~1-2 cycles on modern CPUs).
944///
945/// # Edge Cases
946///
947/// - **Empty chunks**: Return `true` (vacuous truth, no non-zero bytes)
948/// - **Single byte**: Works correctly, no special handling needed
949/// - **Unaligned chunks**: SIMD code handles unaligned loads transparently
950///
951/// # Alternative Implementations
952///
953/// Other possible implementations (not currently used):
954///
955/// 1. **Manual SIMD**: Use `std::arch` for explicit SIMD (faster but less portable)
956/// 2. **Chunked comparison**: Process in 8-byte chunks with `u64` casts (faster scalar)
957/// 3. **Bitmap scan**: Use CPU's `bsf`/`tzcnt` to skip zero regions (complex)
958///
959/// Current implementation relies on compiler autovectorization, which works well
960/// in practice and maintains portability.
961///
962/// # Correctness
963///
964/// This function is pure and infallible:
965/// - No side effects (read-only operation)
966/// - No panics (iterator `all()` is safe for all inputs)
967/// - No undefined behavior (all byte patterns are valid)
968pub fn is_zero_chunk(chunk: &[u8]) -> bool {
969    chunk.iter().all(|&b| b == 0)
970}
971
972#[cfg(test)]
973mod tests {
974    use super::*;
975    use hexz_core::algo::compression::{Lz4Compressor, ZstdCompressor};
976    use hexz_core::algo::encryption::AesGcmEncryptor;
977    use std::io::Cursor;
978
979    /// Convenience wrapper that calls `write_block_simple` with no dedup map.
980    fn write_block_no_dedup<W: Write>(
981        out: &mut W,
982        chunk: &[u8],
983        block_idx: u64,
984        current_offset: &mut u64,
985        compressor: &dyn Compressor,
986        encryptor: Option<&dyn Encryptor>,
987    ) -> Result<BlockInfo> {
988        write_block_simple(
989            out,
990            chunk,
991            block_idx,
992            current_offset,
993            None::<&mut StandardHashTable>,
994            compressor,
995            encryptor,
996        )
997    }
998
999    #[test]
1000    fn test_is_zero_chunk_all_zeros() {
1001        let chunk = vec![0u8; 1024];
1002        assert!(is_zero_chunk(&chunk));
1003    }
1004
1005    #[test]
1006    fn test_is_zero_chunk_with_nonzero() {
1007        let mut chunk = vec![0u8; 1024];
1008        chunk[512] = 1; // Single non-zero byte
1009        assert!(!is_zero_chunk(&chunk));
1010    }
1011
1012    #[test]
1013    fn test_is_zero_chunk_all_nonzero() {
1014        let chunk = vec![0xFFu8; 1024];
1015        assert!(!is_zero_chunk(&chunk));
1016    }
1017
1018    #[test]
1019    fn test_is_zero_chunk_empty() {
1020        let chunk: Vec<u8> = vec![];
1021        assert!(is_zero_chunk(&chunk)); // Vacuous truth
1022    }
1023
1024    #[test]
1025    fn test_is_zero_chunk_single_zero() {
1026        let chunk = vec![0u8];
1027        assert!(is_zero_chunk(&chunk));
1028    }
1029
1030    #[test]
1031    fn test_is_zero_chunk_single_nonzero() {
1032        let chunk = vec![1u8];
1033        assert!(!is_zero_chunk(&chunk));
1034    }
1035
1036    #[test]
1037    fn test_create_zero_block() {
1038        let logical_len = 65536;
1039        let info = create_zero_block(logical_len);
1040
1041        assert_eq!(info.offset, 0);
1042        assert_eq!(info.length, 0);
1043        assert_eq!(info.logical_len, logical_len);
1044        assert_eq!(info.checksum, 0);
1045    }
1046
1047    #[test]
1048    fn test_create_zero_block_various_sizes() {
1049        for size in [1, 16, 1024, 4096, 65536, 1_048_576] {
1050            let info = create_zero_block(size);
1051            assert_eq!(info.offset, 0);
1052            assert_eq!(info.length, 0);
1053            assert_eq!(info.logical_len, size);
1054            assert_eq!(info.checksum, 0);
1055        }
1056    }
1057
1058    #[test]
1059    fn test_write_block_basic_lz4() {
1060        let mut output = Cursor::new(Vec::new());
1061        let mut offset = 512u64; // Start after header
1062        let chunk = vec![0xAAu8; 4096];
1063        let compressor = Lz4Compressor::new();
1064
1065        let result = write_block_no_dedup(&mut output, &chunk, 0, &mut offset, &compressor, None);
1066
1067        assert!(result.is_ok());
1068        let info = result.unwrap();
1069
1070        // Verify offset updated
1071        assert!(offset > 512);
1072
1073        // Verify block info
1074        assert_eq!(info.offset, 512);
1075        assert!(info.length > 0); // Compressed data written
1076        assert_eq!(info.logical_len, 4096);
1077        assert!(info.checksum != 0);
1078
1079        // Verify data was written
1080        let written = output.into_inner();
1081        assert_eq!(written.len(), (offset - 512) as usize);
1082    }
1083
1084    #[test]
1085    fn test_write_block_basic_zstd() {
1086        let mut output = Cursor::new(Vec::new());
1087        let mut offset = 512u64;
1088        let chunk = vec![0xAAu8; 4096];
1089        let compressor = ZstdCompressor::new(3, None);
1090
1091        let result = write_block_no_dedup(&mut output, &chunk, 0, &mut offset, &compressor, None);
1092
1093        assert!(result.is_ok());
1094        let info = result.unwrap();
1095
1096        assert_eq!(info.offset, 512);
1097        assert!(info.length > 0);
1098        assert_eq!(info.logical_len, 4096);
1099    }
1100
1101    #[test]
1102    fn test_write_block_incompressible_data() {
1103        let mut output = Cursor::new(Vec::new());
1104        let mut offset = 512u64;
1105
1106        // Random-ish data that doesn't compress well
1107        let chunk: Vec<u8> = (0..4096).map(|i| ((i * 7 + 13) % 256) as u8).collect();
1108        let compressor = Lz4Compressor::new();
1109
1110        let result = write_block_no_dedup(&mut output, &chunk, 0, &mut offset, &compressor, None);
1111
1112        assert!(result.is_ok());
1113        let info = result.unwrap();
1114
1115        // Even "incompressible" data might compress slightly or expand
1116        // Just verify it executed successfully
1117        assert_eq!(info.logical_len, chunk.len() as u32);
1118        assert!(info.length > 0);
1119    }
1120
1121    #[test]
1122    fn test_write_block_with_dedup_unique_blocks() {
1123        let mut output = Cursor::new(Vec::new());
1124        let mut offset = 512u64;
1125        let mut dedup_map = StandardHashTable::new();
1126        let compressor = Lz4Compressor::new();
1127
1128        // Write first block
1129        let chunk1 = vec![0xAAu8; 4096];
1130        let info1 = write_block_simple(
1131            &mut output,
1132            &chunk1,
1133            0,
1134            &mut offset,
1135            Some(&mut dedup_map),
1136            &compressor,
1137            None,
1138        )
1139        .unwrap();
1140
1141        let offset_after_block1 = offset;
1142
1143        // Write second unique block
1144        let chunk2 = vec![0xBBu8; 4096];
1145        let info2 = write_block_simple(
1146            &mut output,
1147            &chunk2,
1148            1,
1149            &mut offset,
1150            Some(&mut dedup_map),
1151            &compressor,
1152            None,
1153        )
1154        .unwrap();
1155
1156        // Both blocks should be written
1157        assert_eq!(info1.offset, 512);
1158        assert_eq!(info2.offset, offset_after_block1);
1159        assert!(offset > offset_after_block1);
1160
1161        // Dedup map should have 2 entries
1162        assert_eq!(dedup_map.len(), 2);
1163    }
1164
1165    #[test]
1166    fn test_write_block_with_dedup_duplicate_blocks() {
1167        let mut output = Cursor::new(Vec::new());
1168        let mut offset = 512u64;
1169        let mut dedup_map = StandardHashTable::new();
1170        let compressor = Lz4Compressor::new();
1171
1172        // Write first block
1173        let chunk1 = vec![0xAAu8; 4096];
1174        let info1 = write_block_simple(
1175            &mut output,
1176            &chunk1,
1177            0,
1178            &mut offset,
1179            Some(&mut dedup_map),
1180            &compressor,
1181            None,
1182        )
1183        .unwrap();
1184
1185        let offset_after_block1 = offset;
1186
1187        // Write duplicate block (same content)
1188        let chunk2 = vec![0xAAu8; 4096];
1189        let info2 = write_block_simple(
1190            &mut output,
1191            &chunk2,
1192            1,
1193            &mut offset,
1194            Some(&mut dedup_map),
1195            &compressor,
1196            None,
1197        )
1198        .unwrap();
1199
1200        // Second block should reuse first block's offset
1201        assert_eq!(info1.offset, info2.offset);
1202        assert_eq!(info1.length, info2.length);
1203        assert_eq!(info1.checksum, info2.checksum);
1204
1205        // Offset should not advance (no write)
1206        assert_eq!(offset, offset_after_block1);
1207
1208        // Dedup map should have 1 entry (deduplicated)
1209        assert_eq!(dedup_map.len(), 1);
1210    }
1211
1212    #[test]
1213    fn test_write_block_with_encryption() {
1214        let mut output = Cursor::new(Vec::new());
1215        let mut offset = 512u64;
1216        let chunk = vec![0xAAu8; 4096];
1217        let compressor = Lz4Compressor::new();
1218
1219        // Create encryptor
1220        let salt = [0u8; 32];
1221        let encryptor = AesGcmEncryptor::new(b"test_password", &salt, 100_000).unwrap();
1222
1223        let result = write_block_no_dedup(
1224            &mut output,
1225            &chunk,
1226            0,
1227            &mut offset,
1228            &compressor,
1229            Some(&encryptor),
1230        );
1231
1232        assert!(result.is_ok());
1233        let info = result.unwrap();
1234
1235        // Encrypted data should be larger than compressed (adds GCM tag)
1236        assert!(info.length > 16); // At least tag overhead
1237        assert_eq!(info.logical_len, 4096);
1238    }
1239
1240    #[test]
1241    fn test_write_block_encryption_disables_dedup() {
1242        let mut output = Cursor::new(Vec::new());
1243        let mut offset = 512u64;
1244        let mut dedup_map = StandardHashTable::new();
1245        let compressor = Lz4Compressor::new();
1246        let salt = [0u8; 32];
1247        let encryptor = AesGcmEncryptor::new(b"test_password", &salt, 100_000).unwrap();
1248
1249        // Write first encrypted block
1250        let chunk1 = vec![0xAAu8; 4096];
1251        let info1 = write_block_simple(
1252            &mut output,
1253            &chunk1,
1254            0,
1255            &mut offset,
1256            Some(&mut dedup_map),
1257            &compressor,
1258            Some(&encryptor),
1259        )
1260        .unwrap();
1261
1262        let offset_after_block1 = offset;
1263
1264        // Write second encrypted block (same content, different nonce)
1265        let chunk2 = vec![0xAAu8; 4096];
1266        let info2 = write_block_simple(
1267            &mut output,
1268            &chunk2,
1269            1,
1270            &mut offset,
1271            Some(&mut dedup_map),
1272            &compressor,
1273            Some(&encryptor),
1274        )
1275        .unwrap();
1276
1277        // Both blocks should be written (no dedup with encryption)
1278        assert_eq!(info1.offset, 512);
1279        assert_eq!(info2.offset, offset_after_block1);
1280        assert!(offset > offset_after_block1);
1281
1282        // Dedup map should be empty (encryption disables dedup)
1283        assert_eq!(dedup_map.len(), 0);
1284    }
1285
1286    #[test]
1287    fn test_write_block_multiple_sequential() {
1288        let mut output = Cursor::new(Vec::new());
1289        let mut offset = 512u64;
1290        let compressor = Lz4Compressor::new();
1291
1292        let mut expected_offset = 512u64;
1293
1294        // Write 10 blocks sequentially
1295        for i in 0..10 {
1296            let chunk = vec![i as u8; 4096];
1297            let info = write_block_no_dedup(&mut output, &chunk, i, &mut offset, &compressor, None)
1298                .unwrap();
1299
1300            assert_eq!(info.offset, expected_offset);
1301            expected_offset += info.length as u64;
1302        }
1303
1304        assert_eq!(offset, expected_offset);
1305    }
1306
1307    #[test]
1308    fn test_write_block_preserves_logical_length() {
1309        let mut output = Cursor::new(Vec::new());
1310        let mut offset = 512u64;
1311        let compressor = Lz4Compressor::new();
1312
1313        for size in [128, 1024, 4096, 65536] {
1314            let chunk = vec![0xAAu8; size];
1315            let info = write_block_no_dedup(&mut output, &chunk, 0, &mut offset, &compressor, None)
1316                .unwrap();
1317
1318            assert_eq!(info.logical_len, size as u32);
1319        }
1320    }
1321
1322    #[test]
1323    fn test_write_block_checksum_differs() {
1324        let mut output1 = Cursor::new(Vec::new());
1325        let mut output2 = Cursor::new(Vec::new());
1326        let mut offset1 = 512u64;
1327        let mut offset2 = 512u64;
1328        let compressor = Lz4Compressor::new();
1329
1330        let chunk1 = vec![0xAAu8; 4096];
1331        let chunk2 = vec![0xBBu8; 4096];
1332
1333        let info1 = write_block_no_dedup(&mut output1, &chunk1, 0, &mut offset1, &compressor, None)
1334            .unwrap();
1335
1336        let info2 = write_block_no_dedup(&mut output2, &chunk2, 0, &mut offset2, &compressor, None)
1337            .unwrap();
1338
1339        // Different input data should produce different checksums
1340        assert_ne!(info1.checksum, info2.checksum);
1341    }
1342
1343    #[test]
1344    fn test_write_block_empty_chunk() {
1345        let mut output = Cursor::new(Vec::new());
1346        let mut offset = 512u64;
1347        let chunk: Vec<u8> = vec![];
1348        let compressor = Lz4Compressor::new();
1349
1350        let result = write_block_no_dedup(&mut output, &chunk, 0, &mut offset, &compressor, None);
1351
1352        // Should handle empty chunk
1353        assert!(result.is_ok());
1354        let info = result.unwrap();
1355        assert_eq!(info.logical_len, 0);
1356    }
1357
1358    #[test]
1359    fn test_write_block_large_block() {
1360        let mut output = Cursor::new(Vec::new());
1361        let mut offset = 512u64;
1362        let chunk = vec![0xAAu8; 1024 * 1024]; // 1 MB
1363        let compressor = Lz4Compressor::new();
1364
1365        let result = write_block_no_dedup(&mut output, &chunk, 0, &mut offset, &compressor, None);
1366
1367        assert!(result.is_ok());
1368        let info = result.unwrap();
1369        assert_eq!(info.logical_len, 1024 * 1024);
1370        // Highly compressible data should compress well
1371        assert!(info.length < info.logical_len);
1372    }
1373
1374    #[test]
1375    fn test_integration_zero_detection_and_write() {
1376        let mut output = Cursor::new(Vec::new());
1377        let mut offset = 512u64;
1378        let compressor = Lz4Compressor::new();
1379
1380        let zero_chunk = vec![0u8; 4096];
1381        let data_chunk = vec![0xAAu8; 4096];
1382
1383        // Process zero chunk
1384        let zero_info = if is_zero_chunk(&zero_chunk) {
1385            create_zero_block(zero_chunk.len() as u32)
1386        } else {
1387            write_block_no_dedup(&mut output, &zero_chunk, 0, &mut offset, &compressor, None)
1388                .unwrap()
1389        };
1390
1391        // Process data chunk
1392        let data_info = if is_zero_chunk(&data_chunk) {
1393            create_zero_block(data_chunk.len() as u32)
1394        } else {
1395            write_block_no_dedup(&mut output, &data_chunk, 1, &mut offset, &compressor, None)
1396                .unwrap()
1397        };
1398
1399        // Zero block should not be written
1400        assert_eq!(zero_info.offset, 0);
1401        assert_eq!(zero_info.length, 0);
1402
1403        // Data block should be written
1404        assert_eq!(data_info.offset, 512);
1405        assert!(data_info.length > 0);
1406    }
1407}
hexz_ops/write.rs

hexz_ops/
write.rs