Expand description
Analyze archive structure and optimize CDC parameters using DCAM.
This command performs offline analysis of disk images to scientifically determine optimal content-defined chunking (CDC) parameters. It uses DCAM (Deduplication Change-Estimation Analytical Model), a mathematical framework that predicts deduplication effectiveness without performing full chunking, enabling fast parameter optimization.
§DCAM Algorithm Overview
DCAM estimates deduplication efficiency by:
- Baseline Pass: Chunks a sample with LBFS parameters (8 KiB average)
- Change Probability: Calculates
c(fraction of unique data) - Greedy Search: Tests parameter combinations to minimize deduped size
- Prediction: Uses analytical model to estimate full-file deduplication
The key insight is that deduplication effectiveness depends on:
f: Fingerprint bits (determines average chunk size = 2^f)m: Minimum chunk size (prevents pathologically small chunks)c: Change probability (intrinsic data characteristic)
DCAM predicts the deduplication ratio without actually deduplicating the entire file, making optimization practical for large disk images.
§Greedy Search Algorithm
The find_optimal_parameters function implements a hill-climbing search:
Algorithm:
current = baseline parameters (f=13, m=256)
best_ratio = predict_ratio(current)
while improved:
for each neighbor of current (f±1, m×2, m÷2):
ratio = predict_ratio(neighbor)
if ratio < best_ratio:
current = neighbor
best_ratio = ratio
improved = true
endfor
endwhile
return currentSearch Space:
f: [8, 20] → average chunk size [256 B, 1 MB]m: [64, 16384] → minimum chunk size [64 B, 16 KiB]- Constraint:
m < zwherez = 2^(f+3)(max chunk size)
Termination:
- Converges when no neighbor improves the ratio
- Maximum 100 iterations (typically converges in 5-15)
§Sampling Strategy
To reduce analysis time, only 512 MiB is sampled:
- For files > 513 MiB: Skip first 1 MiB (to avoid partition tables/headers)
- For files ≤ 513 MiB: Analyze entire file
This sampling is sufficient because deduplication characteristics are typically uniform across a disk image (same filesystem, similar files).
§Use Cases
- Pre-Snapshot Optimization: Determine optimal CDC parameters before packing
- Workload Characterization: Understand data redundancy patterns
- Compression Tuning: Compare fixed vs. variable block effectiveness
- Research: Validate DCAM model predictions on real-world data
§Recommended Workflow
# 1. Analyze disk image
hexz analyze disk.img
# Output: Recommends f=14 (16 KiB avg), m=1024 (1 KiB min)
# 2. Pack with recommended parameters
hexz pack --disk disk.img --output snapshot.st --cdc \
--min-chunk 1024 --avg-chunk 16384 --max-chunk 32768
# 3. Verify compression ratio
hexz info snapshot.st
# Output: Compression ratio should match DCAM prediction§Performance Characteristics
- Sampling Time: ~2-5 seconds for 512 MiB sample
- Baseline Pass: Single CDC chunking at ~200 MB/s
- Greedy Search: 10-30 iterations × DCAM prediction (~1 μs each)
- Total Time: Typically 5-10 seconds for large disk images
Functions§
- run
- Executes the analyze command to optimize CDC parameters using DCAM.