hexz_core::algo::dedup::cdc

Struct CdcStats

pub struct CdcStats {
    pub total_bytes: u64,
    pub unique_bytes: u64,
    pub chunk_count: u64,
    pub unique_chunk_count: u64,
}

Expand description

Statistics from a CDC deduplication analysis run.

This structure captures metrics from a content-defined chunking analysis pass, providing insights into chunk distribution, deduplication effectiveness, and potential storage savings.

§Use Cases

DCAM Model Input: Feeds into analytical formulas to predict optimal parameters
Capacity Planning: Estimates actual storage requirements after deduplication
Performance Analysis: Evaluates chunk size distribution and variance
Comparison Studies: Benchmarks different parameter configurations

§Invariants

The following relationships always hold for valid statistics:

unique_chunk_count <= chunk_count (can’t have more unique than total)
unique_bytes <= chunk_count * max_chunk_size (bounded by chunk constraints)
unique_bytes >= unique_chunk_count * min_chunk_size (minimum size bound)
total_bytes may be 0 (not tracked in streaming mode)

§Calculating Deduplication Metrics

§Deduplication Ratio

The deduplication ratio represents what fraction of data is unique:

dedup_ratio = unique_bytes / total_bytes_before_dedup

Where total_bytes_before_dedup (the original data size) can be estimated from:

total_bytes_before_dedup ≈ unique_bytes × (chunk_count / unique_chunk_count)

Interpretation:

Ratio of 0.5 means 50% of data is unique → 50% storage savings
Ratio of 1.0 means 100% unique (no duplicates) → 0% savings
Ratio of 0.3 means 30% unique → 70% savings

§Deduplication Factor

An alternative metric is the deduplication factor (multiplicative savings):

dedup_factor = chunk_count / unique_chunk_count

Interpretation:

Factor of 1.0x means no deduplication occurred
Factor of 2.0x means data compressed by half (50% savings)
Factor of 10.0x means 90% of chunks were duplicates

§Average Chunk Size

The realized average chunk size (may differ from 2^f for small datasets):

avg_chunk_size = unique_bytes / unique_chunk_count

§Examples

let stats = CdcStats {
    total_bytes: 0, // Not tracked (estimated below)
    unique_bytes: 50_000_000,
    chunk_count: 10_000,
    unique_chunk_count: 3_000,
};

// Estimate original size
let estimated_original = stats.unique_bytes * stats.chunk_count / stats.unique_chunk_count;
assert_eq!(estimated_original, 166_666_666); // ~167 MB original

// Calculate deduplication ratio (fraction unique)
let dedup_ratio = stats.unique_bytes as f64 / estimated_original as f64;
assert!((dedup_ratio - 0.3).abs() < 0.01); // ~30% unique

// Calculate storage savings (percent eliminated)
let savings_percent = (1.0 - dedup_ratio) * 100.0;
assert!((savings_percent - 70.0).abs() < 1.0); // ~70% savings

// Calculate deduplication factor (compression ratio)
let dedup_factor = stats.chunk_count as f64 / stats.unique_chunk_count as f64;
assert!((dedup_factor - 3.33).abs() < 0.01); // ~3.33x compression

// Average chunk size
let avg_chunk_size = stats.unique_bytes / stats.unique_chunk_count;
println!("Average unique chunk size: {} bytes", avg_chunk_size);

Fields§

§total_bytes: u64

Total bytes processed from the input stream.

Note: This field is currently not tracked during streaming analysis (always set to 0) to avoid memory overhead. It may be populated in future versions if needed for enhanced metrics.

To estimate total bytes, use:

total_bytes ≈ unique_bytes * (chunk_count / unique_chunk_count)

§unique_bytes: u64

Number of unique bytes after deduplication.

This is the sum of sizes of all unique chunks (first occurrence only). Represents the actual storage space required if deduplication is applied.

§Interpretation

Lower values indicate higher redundancy in the dataset
Equals total data size if no duplicates exist
Bounded by unique_chunk_count * min_chunk_size and unique_chunk_count * max_chunk_size

§chunk_count: u64

Total number of chunks identified by FastCDC.

Includes both unique and duplicate chunks. This represents how many chunks would be created if the data were processed through the chunking pipeline.

§Expected Values

For a dataset of size N bytes with average chunk size 2^f:

chunk_count ≈ N / (2^f)

Example: 1GB file with f=14 (16KB avg) → ~65,536 chunks

§unique_chunk_count: u64

Number of unique chunks after deduplication.

This counts only distinct chunks (based on CRC32 hash comparison). Duplicate chunks are not counted in this total.

§Interpretation

unique_chunk_count == chunk_count → No deduplication (all unique)
unique_chunk_count << chunk_count → High deduplication (many duplicates)
Ratio unique_chunk_count / chunk_count indicates dedup effectiveness

§Hash Collision Note

Uses the first 8 bytes of BLAKE3 as a 64-bit hash, providing negligible collision probability (birthday bound at ~2^32 ≈ 4 billion unique chunks).

Struct CdcStats Copy item path

§Use Cases

§Invariants

§Calculating Deduplication Metrics

§Deduplication Ratio

§Deduplication Factor

§Average Chunk Size

§Examples

Fields§

§Interpretation

§Expected Values

§Interpretation

§Hash Collision Note

Trait Implementations§

impl Debug for CdcStats

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Auto Trait Implementations§

impl Freeze for CdcStats

impl RefUnwindSafe for CdcStats

impl Send for CdcStats

impl Sync for CdcStats

impl Unpin for CdcStats

impl UnsafeUnpin for CdcStats

impl UnwindSafe for CdcStats

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> From<T> for T

fn from(t: T) -> T

impl<T> Instrument for T

fn instrument(self, span: Span) -> Instrumented<Self>

fn in_current_span(self) -> Instrumented<Self>

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T> Pointable for T

const ALIGN: usize

type Init = T

unsafe fn init(init: <T as Pointable>::Init) -> usize

unsafe fn deref<'a>(ptr: usize) -> &'a T

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

unsafe fn drop(ptr: usize)

impl<T> Same for T

type Output = T

impl<T, U> TryFrom<U> for Twhere U: Into<T>,

type Error = Infallible

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

impl<T, U> TryInto<U> for Twhere U: TryFrom<T>,

type Error = <U as TryFrom<T>>::Error

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

impl<V, T> VZip<V> for Twhere V: MultiLane<T>,

fn vzip(self) -> V

impl<T> WithSubscriber for T

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>where S: Into<Dispatch>,

fn with_current_subscriber(self) -> WithDispatch<Self>

Struct CdcStats

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,