Skip to main content

Module analyze

Module analyze 

Source
Expand description

Analyze archive structure and optimize CDC parameters using DCAM.

This command performs offline analysis of disk images to scientifically determine optimal content-defined chunking (CDC) parameters. It uses DCAM (Deduplication Change-Estimation Analytical Model), a mathematical framework that predicts deduplication effectiveness without performing full chunking, enabling fast parameter optimization.

§DCAM Algorithm Overview

DCAM estimates deduplication efficiency by:

  1. Baseline Pass: Chunks a sample with LBFS parameters (8 KiB average)
  2. Change Probability: Calculates c (fraction of unique data)
  3. Greedy Search: Tests parameter combinations to minimize deduped size
  4. Prediction: Uses analytical model to estimate full-file deduplication

The key insight is that deduplication effectiveness depends on:

  • f: Fingerprint bits (determines average chunk size = 2^f)
  • m: Minimum chunk size (prevents pathologically small chunks)
  • c: Change probability (intrinsic data characteristic)

DCAM predicts the deduplication ratio without actually deduplicating the entire file, making optimization practical for large disk images.

§Greedy Search Algorithm

The find_optimal_parameters function implements a hill-climbing search:

Algorithm:

current = baseline parameters (f=13, m=256)
best_ratio = predict_ratio(current)

while improved:
  for each neighbor of current (f±1, m×2, m÷2):
    ratio = predict_ratio(neighbor)
    if ratio < best_ratio:
      current = neighbor
      best_ratio = ratio
      improved = true
  endfor
endwhile

return current

Search Space:

  • f: [8, 20] → average chunk size [256 B, 1 MB]
  • m: [64, 16384] → minimum chunk size [64 B, 16 KiB]
  • Constraint: m < z where z = 2^(f+3) (max chunk size)

Termination:

  • Converges when no neighbor improves the ratio
  • Maximum 100 iterations (typically converges in 5-15)

§Sampling Strategy

To reduce analysis time, only 512 MiB is sampled:

  • For files > 513 MiB: Skip first 1 MiB (to avoid partition tables/headers)
  • For files ≤ 513 MiB: Analyze entire file

This sampling is sufficient because deduplication characteristics are typically uniform across a disk image (same filesystem, similar files).

§Use Cases

  • Pre-Snapshot Optimization: Determine optimal CDC parameters before packing
  • Workload Characterization: Understand data redundancy patterns
  • Compression Tuning: Compare fixed vs. variable block effectiveness
  • Research: Validate DCAM model predictions on real-world data
# 1. Analyze disk image
hexz analyze disk.img
# Output: Recommends f=14 (16 KiB avg), m=1024 (1 KiB min)

# 2. Pack with recommended parameters
hexz pack --disk disk.img --output snapshot.st --cdc \
  --min-chunk 1024 --avg-chunk 16384 --max-chunk 32768

# 3. Verify compression ratio
hexz info snapshot.st
# Output: Compression ratio should match DCAM prediction

§Performance Characteristics

  • Sampling Time: ~2-5 seconds for 512 MiB sample
  • Baseline Pass: Single CDC chunking at ~200 MB/s
  • Greedy Search: 10-30 iterations × DCAM prediction (~1 μs each)
  • Total Time: Typically 5-10 seconds for large disk images

Functions§

run
Executes the analyze command to optimize CDC parameters using DCAM.