Skip to main content

Crate datasynth_fingerprint

Crate datasynth_fingerprint 

Source
Expand description

DataSynth Fingerprint - Privacy-preserving synthetic data fingerprinting.

This crate provides functionality for:

  • Extracting statistical fingerprints from real data
  • Applying privacy mechanisms (differential privacy, k-anonymity)
  • Storing fingerprints in .dsf files
  • Synthesizing generator configurations from fingerprints
  • Evaluating fidelity of generated data

§Overview

A fingerprint captures the statistical properties of a dataset without storing any individual records, enabling privacy-preserving synthetic data generation.

Real Data → Extract → .dsf File → Generate → Synthetic Data → Evaluate

§Quick Start

§Basic Extraction and Storage

use datasynth_fingerprint::{
    extraction::{FingerprintExtractor, ExtractionConfig},
    io::{FingerprintReader, FingerprintWriter},
    models::PrivacyLevel,
};
use std::path::Path;

// Extract fingerprint from CSV data with standard privacy
let extractor = FingerprintExtractor::new(PrivacyLevel::Standard);
let fingerprint = extractor.extract_from_csv(Path::new("data.csv"))?;

// Write to .dsf file
let writer = FingerprintWriter::new();
writer.write_to_file(&fingerprint, Path::new("output.dsf"))?;

// Read back from .dsf file
let reader = FingerprintReader::new();
let loaded = reader.read_from_file(Path::new("output.dsf"))?;

// Check privacy audit
println!("Epsilon spent: {}", loaded.epsilon_spent());

§Signed Fingerprints

use datasynth_fingerprint::io::{SigningKey, DsfSigner, DsfVerifier};

// Generate a signing key
let key = SigningKey::generate("my-org-key");

// Sign when writing
let signer = DsfSigner::new(key.clone());
writer.write_to_file_signed(&fingerprint, Path::new("signed.dsf"), &signer)?;

// Verify when reading
let verifier = DsfVerifier::new(key);
let verified = reader.read_from_file_verified(Path::new("signed.dsf"), &verifier)?;

§Streaming Extraction for Large Files

use datasynth_fingerprint::extraction::{FingerprintExtractor, ExtractionConfig};

// Configure for streaming (memory-efficient for large files)
let config = ExtractionConfig {
    streaming: true,
    stream_batch_size: 100_000,
    ..ExtractionConfig::default()
};

let extractor = FingerprintExtractor::with_config(config);
let fingerprint = extractor.extract_streaming_csv(Path::new("large_data.csv"))?;

§Config Synthesis

use datasynth_fingerprint::synthesis::{ConfigSynthesizer, SynthesisOptions};

let options = SynthesisOptions {
    scale: 2.0,              // Generate 2x original row count
    seed: Some(42),          // Reproducible generation
    preserve_correlations: true,
    inject_anomalies: true,
};

let synthesizer = ConfigSynthesizer::with_options(options);
let result = synthesizer.synthesize_full(&fingerprint, 42)?;

// result.config_patch - configuration values for generators
// result.copula_generators - for preserving correlations

§Fidelity Evaluation

use datasynth_fingerprint::evaluation::FidelityEvaluator;

let evaluator = FidelityEvaluator::new();
let report = evaluator.evaluate(&original_fingerprint, &synthetic_fingerprint)?;

println!("Overall fidelity: {:.2}", report.overall_score);
println!("Statistical fidelity: {:.2}", report.statistical_fidelity);
println!("Correlation fidelity: {:.2}", report.correlation_fidelity);

§DSF File Format

A .dsf (DataSynth Fingerprint) file is a ZIP archive containing:

FileFormatDescription
manifest.jsonJSONVersion, checksums, privacy config, optional signature
schema.yamlYAMLTables, columns, types, relationships
statistics.yamlYAMLDistributions, percentiles, Benford analysis
correlations.yamlYAMLCorrelation matrices, copulas (optional)
integrity.yamlYAMLFK relationships, cardinality (optional)
rules.yamlYAMLBalance constraints, approval thresholds (optional)
anomalies.yamlYAMLAnomaly rates, type distribution (optional)
privacy_audit.jsonJSONPrivacy decisions, epsilon spent

§Privacy Levels

The crate supports four privacy levels with different tradeoffs:

LevelEpsilonKDescription
PrivacyLevel::Minimal5.03Low privacy, high utility
PrivacyLevel::Standard1.05Balanced (default)
PrivacyLevel::High0.510Higher privacy for sensitive data
PrivacyLevel::Maximum0.120Maximum privacy, reduced utility

§Privacy Mechanisms

The fingerprinting process applies multiple privacy mechanisms:

  • Differential Privacy: Laplace noise calibrated to the sensitivity of each statistic, with configurable epsilon budget. Privacy is enforced through composition tracking.

  • K-Anonymity: Categorical values appearing fewer than k times are suppressed to prevent re-identification of rare values.

  • Outlier Handling: Extreme values are winsorized at configurable percentiles to prevent leakage of unusual records.

  • Privacy Audit Trail: Every privacy decision (noise addition, suppression, generalization) is logged in the fingerprint’s privacy_audit field.

§Supported Data Sources

SourceMethodNotes
CSVextract_from_csv()Auto-infers column types
Parquetextract_from_parquet()Preserves type information
JSON/JSONLextract_from_json()Array or newline-delimited
Directoryextract_from_directory()Multi-table fingerprints
MemoryDataSource::MemoryFor in-memory data

§Module Overview

§models - Data Structures

Core data structures for fingerprints:

§io - File I/O

Reading and writing .dsf files:

§extraction - Data Extraction

Extract fingerprints from data sources:

§privacy - Privacy Mechanisms

Privacy-preserving transformations:

  • Laplace noise for differential privacy
  • K-anonymity suppression
  • Privacy budget tracking

§synthesis - Config Synthesis

Convert fingerprints to generator configurations:

§evaluation - Fidelity Evaluation

Evaluate synthetic data quality:

§CLI Integration

The fingerprint crate integrates with the datasynth-data CLI:

# Extract fingerprint from data
datasynth-data fingerprint extract \
    --input ./real_data/ \
    --output ./fingerprint.dsf \
    --privacy-level standard

# Validate fingerprint file
datasynth-data fingerprint validate ./fingerprint.dsf

# Generate from fingerprint
datasynth-data generate \
    --fingerprint ./fingerprint.dsf \
    --output ./synthetic/ \
    --scale 1.0

# Evaluate fidelity
datasynth-data fingerprint evaluate \
    --fingerprint ./fingerprint.dsf \
    --synthetic ./synthetic/

Re-exports§

pub use error::FingerprintError;
pub use error::FingerprintResult;
pub use io::FingerprintReader;
pub use io::FingerprintValidator;
pub use io::FingerprintWriter;
pub use models::Fingerprint;
pub use models::Manifest;
pub use models::PrivacyLevel;
pub use models::PrivacyMetadata;
pub use models::SchemaFingerprint;

Modules§

aggregation
Industry-level aggregation of behavioral priors.
certificates
Synthetic data certificates for proving privacy guarantees.
error
Error types for the fingerprint crate.
evaluation
Fidelity evaluation for synthetic data.
extraction
Extraction engine for fingerprinting.
federated
Federated fingerprint extraction and aggregation.
io
I/O operations for fingerprint files.
models
Fingerprint data models.
privacy
Privacy mechanisms for fingerprint extraction.
synthesis
Config synthesis from fingerprints.