pub struct CdcStats {
pub total_bytes: u64,
pub unique_bytes: u64,
pub chunk_count: u64,
pub unique_chunk_count: u64,
}Expand description
Statistics from a CDC deduplication analysis run.
This structure captures metrics from a content-defined chunking analysis pass, providing insights into chunk distribution, deduplication effectiveness, and potential storage savings.
§Use Cases
- DCAM Model Input: Feeds into analytical formulas to predict optimal parameters
- Capacity Planning: Estimates actual storage requirements after deduplication
- Performance Analysis: Evaluates chunk size distribution and variance
- Comparison Studies: Benchmarks different parameter configurations
§Invariants
The following relationships always hold for valid statistics:
unique_chunk_count <= chunk_count(can’t have more unique than total)unique_bytes <= chunk_count * max_chunk_size(bounded by chunk constraints)unique_bytes >= unique_chunk_count * min_chunk_size(minimum size bound)total_bytesmay be 0 (not tracked in streaming mode)
§Calculating Deduplication Metrics
§Deduplication Ratio
The deduplication ratio represents what fraction of data is unique:
dedup_ratio = unique_bytes / total_bytes_before_dedupWhere total_bytes_before_dedup (the original data size) can be estimated from:
total_bytes_before_dedup ≈ unique_bytes × (chunk_count / unique_chunk_count)Interpretation:
- Ratio of 0.5 means 50% of data is unique → 50% storage savings
- Ratio of 1.0 means 100% unique (no duplicates) → 0% savings
- Ratio of 0.3 means 30% unique → 70% savings
§Deduplication Factor
An alternative metric is the deduplication factor (multiplicative savings):
dedup_factor = chunk_count / unique_chunk_countInterpretation:
- Factor of 1.0x means no deduplication occurred
- Factor of 2.0x means data compressed by half (50% savings)
- Factor of 10.0x means 90% of chunks were duplicates
§Average Chunk Size
The realized average chunk size (may differ from 2^f for small datasets):
avg_chunk_size = unique_bytes / unique_chunk_count§Examples
let stats = CdcStats {
total_bytes: 0, // Not tracked (estimated below)
unique_bytes: 50_000_000,
chunk_count: 10_000,
unique_chunk_count: 3_000,
};
// Estimate original size
let estimated_original = stats.unique_bytes * stats.chunk_count / stats.unique_chunk_count;
assert_eq!(estimated_original, 166_666_666); // ~167 MB original
// Calculate deduplication ratio (fraction unique)
let dedup_ratio = stats.unique_bytes as f64 / estimated_original as f64;
assert!((dedup_ratio - 0.3).abs() < 0.01); // ~30% unique
// Calculate storage savings (percent eliminated)
let savings_percent = (1.0 - dedup_ratio) * 100.0;
assert!((savings_percent - 70.0).abs() < 1.0); // ~70% savings
// Calculate deduplication factor (compression ratio)
let dedup_factor = stats.chunk_count as f64 / stats.unique_chunk_count as f64;
assert!((dedup_factor - 3.33).abs() < 0.01); // ~3.33x compression
// Average chunk size
let avg_chunk_size = stats.unique_bytes / stats.unique_chunk_count;
println!("Average unique chunk size: {} bytes", avg_chunk_size);Fields§
§total_bytes: u64Total bytes processed from the input stream.
Note: This field is currently not tracked during streaming analysis (always set to 0) to avoid memory overhead. It may be populated in future versions if needed for enhanced metrics.
To estimate total bytes, use:
total_bytes ≈ unique_bytes * (chunk_count / unique_chunk_count)unique_bytes: u64Number of unique bytes after deduplication.
This is the sum of sizes of all unique chunks (first occurrence only). Represents the actual storage space required if deduplication is applied.
§Interpretation
- Lower values indicate higher redundancy in the dataset
- Equals total data size if no duplicates exist
- Bounded by
unique_chunk_count * min_chunk_sizeandunique_chunk_count * max_chunk_size
chunk_count: u64Total number of chunks identified by FastCDC.
Includes both unique and duplicate chunks. This represents how many chunks would be created if the data were processed through the chunking pipeline.
§Expected Values
For a dataset of size N bytes with average chunk size 2^f:
chunk_count ≈ N / (2^f)Example: 1GB file with f=14 (16KB avg) → ~65,536 chunks
unique_chunk_count: u64Number of unique chunks after deduplication.
This counts only distinct chunks (based on CRC32 hash comparison). Duplicate chunks are not counted in this total.
§Interpretation
unique_chunk_count == chunk_count→ No deduplication (all unique)unique_chunk_count << chunk_count→ High deduplication (many duplicates)- Ratio
unique_chunk_count / chunk_countindicates dedup effectiveness
§Hash Collision Note
Uses the first 8 bytes of BLAKE3 as a 64-bit hash, providing negligible collision probability (birthday bound at ~2^32 ≈ 4 billion unique chunks).
Trait Implementations§
Auto Trait Implementations§
impl Freeze for CdcStats
impl RefUnwindSafe for CdcStats
impl Send for CdcStats
impl Sync for CdcStats
impl Unpin for CdcStats
impl UnsafeUnpin for CdcStats
impl UnwindSafe for CdcStats
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> Instrument for T
impl<T> Instrument for T
Source§fn instrument(self, span: Span) -> Instrumented<Self>
fn instrument(self, span: Span) -> Instrumented<Self>
Source§fn in_current_span(self) -> Instrumented<Self>
fn in_current_span(self) -> Instrumented<Self>
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more