Skip to main content

CdcStats

Struct CdcStats 

Source
pub struct CdcStats {
    pub total_bytes: u64,
    pub unique_bytes: u64,
    pub chunk_count: u64,
    pub unique_chunk_count: u64,
}
Expand description

Statistics from a CDC deduplication analysis run.

This structure captures metrics from a content-defined chunking analysis pass, providing insights into chunk distribution, deduplication effectiveness, and potential storage savings.

§Use Cases

  • DCAM Model Input: Feeds into analytical formulas to predict optimal parameters
  • Capacity Planning: Estimates actual storage requirements after deduplication
  • Performance Analysis: Evaluates chunk size distribution and variance
  • Comparison Studies: Benchmarks different parameter configurations

§Invariants

The following relationships always hold for valid statistics:

  • unique_chunk_count <= chunk_count (can’t have more unique than total)
  • unique_bytes <= chunk_count * max_chunk_size (bounded by chunk constraints)
  • unique_bytes >= unique_chunk_count * min_chunk_size (minimum size bound)
  • total_bytes may be 0 (not tracked in streaming mode)

§Calculating Deduplication Metrics

§Deduplication Ratio

The deduplication ratio represents what fraction of data is unique:

dedup_ratio = unique_bytes / total_bytes_before_dedup

Where total_bytes_before_dedup (the original data size) can be estimated from:

total_bytes_before_dedup ≈ unique_bytes × (chunk_count / unique_chunk_count)

Interpretation:

  • Ratio of 0.5 means 50% of data is unique → 50% storage savings
  • Ratio of 1.0 means 100% unique (no duplicates) → 0% savings
  • Ratio of 0.3 means 30% unique → 70% savings

§Deduplication Factor

An alternative metric is the deduplication factor (multiplicative savings):

dedup_factor = chunk_count / unique_chunk_count

Interpretation:

  • Factor of 1.0x means no deduplication occurred
  • Factor of 2.0x means data compressed by half (50% savings)
  • Factor of 10.0x means 90% of chunks were duplicates

§Average Chunk Size

The realized average chunk size (may differ from 2^f for small datasets):

avg_chunk_size = unique_bytes / unique_chunk_count

§Examples

let stats = CdcStats {
    total_bytes: 0, // Not tracked (estimated below)
    unique_bytes: 50_000_000,
    chunk_count: 10_000,
    unique_chunk_count: 3_000,
};

// Estimate original size
let estimated_original = stats.unique_bytes * stats.chunk_count / stats.unique_chunk_count;
assert_eq!(estimated_original, 166_666_666); // ~167 MB original

// Calculate deduplication ratio (fraction unique)
let dedup_ratio = stats.unique_bytes as f64 / estimated_original as f64;
assert!((dedup_ratio - 0.3).abs() < 0.01); // ~30% unique

// Calculate storage savings (percent eliminated)
let savings_percent = (1.0 - dedup_ratio) * 100.0;
assert!((savings_percent - 70.0).abs() < 1.0); // ~70% savings

// Calculate deduplication factor (compression ratio)
let dedup_factor = stats.chunk_count as f64 / stats.unique_chunk_count as f64;
assert!((dedup_factor - 3.33).abs() < 0.01); // ~3.33x compression

// Average chunk size
let avg_chunk_size = stats.unique_bytes / stats.unique_chunk_count;
println!("Average unique chunk size: {} bytes", avg_chunk_size);

Fields§

§total_bytes: u64

Total bytes processed from the input stream.

Note: This field is currently not tracked during streaming analysis (always set to 0) to avoid memory overhead. It may be populated in future versions if needed for enhanced metrics.

To estimate total bytes, use:

total_bytes ≈ unique_bytes * (chunk_count / unique_chunk_count)
§unique_bytes: u64

Number of unique bytes after deduplication.

This is the sum of sizes of all unique chunks (first occurrence only). Represents the actual storage space required if deduplication is applied.

§Interpretation

  • Lower values indicate higher redundancy in the dataset
  • Equals total data size if no duplicates exist
  • Bounded by unique_chunk_count * min_chunk_size and unique_chunk_count * max_chunk_size
§chunk_count: u64

Total number of chunks identified by FastCDC.

Includes both unique and duplicate chunks. This represents how many chunks would be created if the data were processed through the chunking pipeline.

§Expected Values

For a dataset of size N bytes with average chunk size 2^f:

chunk_count ≈ N / (2^f)

Example: 1GB file with f=14 (16KB avg) → ~65,536 chunks

§unique_chunk_count: u64

Number of unique chunks after deduplication.

This counts only distinct chunks (based on CRC32 hash comparison). Duplicate chunks are not counted in this total.

§Interpretation

  • unique_chunk_count == chunk_count → No deduplication (all unique)
  • unique_chunk_count << chunk_count → High deduplication (many duplicates)
  • Ratio unique_chunk_count / chunk_count indicates dedup effectiveness

§Hash Collision Note

Uses the first 8 bytes of BLAKE3 as a 64-bit hash, providing negligible collision probability (birthday bound at ~2^32 ≈ 4 billion unique chunks).

Trait Implementations§

Source§

impl Debug for CdcStats

Source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

Blanket Implementations§

Source§

impl<T> Any for T
where T: 'static + ?Sized,

Source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
Source§

impl<T> Borrow<T> for T
where T: ?Sized,

Source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
Source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

Source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
Source§

impl<T> From<T> for T

Source§

fn from(t: T) -> T

Returns the argument unchanged.

Source§

impl<T> Instrument for T

Source§

fn instrument(self, span: Span) -> Instrumented<Self>

Instruments this type with the provided Span, returning an Instrumented wrapper. Read more
Source§

fn in_current_span(self) -> Instrumented<Self>

Instruments this type with the current Span, returning an Instrumented wrapper. Read more
Source§

impl<T, U> Into<U> for T
where U: From<T>,

Source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

Source§

impl<T> IntoEither for T

Source§

fn into_either(self, into_left: bool) -> Either<Self, Self>

Converts self into a Left variant of Either<Self, Self> if into_left is true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

Converts self into a Left variant of Either<Self, Self> if into_left(&self) returns true. Converts self into a Right variant of Either<Self, Self> otherwise. Read more
Source§

impl<T> Pointable for T

Source§

const ALIGN: usize

The alignment of pointer.
Source§

type Init = T

The type for initializers.
Source§

unsafe fn init(init: <T as Pointable>::Init) -> usize

Initializes a with the given initializer. Read more
Source§

unsafe fn deref<'a>(ptr: usize) -> &'a T

Dereferences the given pointer. Read more
Source§

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

Mutably dereferences the given pointer. Read more
Source§

unsafe fn drop(ptr: usize)

Drops the object pointed to by the given pointer. Read more
Source§

impl<T> Same for T

Source§

type Output = T

Should always be Self
Source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

Source§

type Error = Infallible

The type returned in the event of a conversion error.
Source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
Source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

Source§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
Source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.
Source§

impl<V, T> VZip<V> for T
where V: MultiLane<T>,

Source§

fn vzip(self) -> V

Source§

impl<T> WithSubscriber for T

Source§

fn with_subscriber<S>(self, subscriber: S) -> WithDispatch<Self>
where S: Into<Dispatch>,

Attaches the provided Subscriber to this type, returning a WithDispatch wrapper. Read more
Source§

fn with_current_subscriber(self) -> WithDispatch<Self>

Attaches the current default Subscriber to this type, returning a WithDispatch wrapper. Read more