Skip to main content

run

Function run 

Source
pub fn run(input: PathBuf) -> Result<()>
Expand description

Executes the analyze command to optimize CDC parameters using DCAM.

Reads a sample of the input file, performs a baseline CDC chunking pass to measure deduplication characteristics, calculates the change probability c, and then uses a greedy search algorithm to find optimal chunking parameters (fingerprint bits f and minimum chunk size m). Displays the baseline, recommended parameters, and predicted deduplication ratio.

§Arguments

  • input - Path to the disk image file (raw, qcow2, or any binary file)

§Output Format

Analyzing disk.img using DCAM...
Reading 512.0 MB sample for analysis...
Running Baseline CDC Pass (Avg Chunk: 8KB)...
  Processed: 512.0 MB
  Unique:    384.0 MB (75.0%)
  Chunks:    65536
  Estimated Change Prob (c): 0.750000

Optimizing parameters using DCAM...

--- Optimization Results ---
Parameter                 | Baseline (LBFS) | Recommended
--------------------------|-----------------|----------------
Fingerprint Bits (f)      | 13              | 14
Min Chunk Size (m)        | 256             | 1024
Avg Chunk Size            | 8.0 KB          | 16.0 KB

--- Predictions ---
Predicted Ratio: 0.7234
Est. Final Size: 7.2 GB
Est. Savings:    2.8 GB

§Algorithm Details

The function implements these steps:

  1. File Reading: Opens input file and reads up to 512 MiB sample
  2. Header Skipping: For large files, skips first 1 MiB to avoid partition metadata
  3. Baseline Chunking: Runs FastCDC with LBFS parameters (f=13, m=256)
  4. Statistics Collection: Counts total bytes, unique bytes, and chunks
  5. Change Probability: Calculates c = unique_bytes / total_bytes
  6. Greedy Optimization: Calls find_optimal_parameters to search parameter space
  7. Prediction: Uses DCAM model to estimate deduplication ratio
  8. Result Display: Prints comparison table and predicted savings

§Errors

Returns an error if:

  • Input file cannot be opened (file not found, permission denied)
  • File metadata cannot be read
  • File read operations fail (I/O error, disk full)
  • CDC analysis fails (invalid data, algorithm error)

Note: Empty files are handled gracefully with an early return.

§Examples

use std::path::PathBuf;
use hexz_cli::cmd::data::analyze;

// Analyze a disk image
analyze::run(PathBuf::from("vm-disk.img"))?;

// Analyze a large backup file
analyze::run(PathBuf::from("/backup/system.tar"))?;