datasynth-fingerprint
Privacy-preserving synthetic data fingerprinting for DataSynth.
Overview
The datasynth-fingerprint crate provides functionality to:
- Extract statistical fingerprints from real data while preserving privacy
- Store fingerprints in portable
.dsf(DataSynth Fingerprint) files - Synthesize generator configurations that produce matching synthetic data
- Evaluate fidelity between synthetic data and source fingerprints
Quick Start
use ;
// Extract fingerprint from real data
let extractor = new;
let fingerprint = extractor.extract_from_csv?;
// Save to .dsf file
let writer = new;
writer.write_to_file?;
// Later: Load and synthesize config
let reader = new;
let fingerprint = reader.read_from_file?;
let synthesizer = new;
let config_patch = synthesizer.synthesize?;
Privacy Features
The crate implements multiple privacy-preserving mechanisms:
Differential Privacy
- Laplace noise is added to statistics based on epsilon budget
- Configurable privacy levels: Minimal, Standard, High, Maximum
use ;
use PrivacyLevel;
let config = with_privacy_level;
let extractor = with_config;
K-Anonymity
- Rare categorical values are suppressed if they appear fewer than k times
- Default k=5 for Standard privacy level
Privacy Audit Trail
- All privacy decisions are logged in the fingerprint
- Tracks epsilon spent, suppressions, generalizations
Supported Data Sources
CSV Files
let fingerprint = extractor.extract_from_csv?;
Parquet Files
let source = Parquet;
let fingerprint = extractor.extract?;
JSON/JSONL Files
// JSON array format
let source = Json;
// JSONL (newline-delimited) format
let source = Json;
Directories (Multi-table)
// Extract from all supported files in a directory
let fingerprint = extractor.extract_from_directory?;
Streaming Extraction (Large Files)
// Memory-efficient extraction for large CSV files
let fingerprint = extractor.extract_streaming_csv?;
Fingerprint Components
A fingerprint contains:
| Component | Description |
|---|---|
manifest |
Metadata, version, checksums, privacy config |
schema |
Table structures, column types, relationships |
statistics |
Distributions, percentiles, Benford analysis |
correlations |
Correlation matrices, copulas (optional) |
integrity |
Unique constraints, foreign keys (optional) |
rules |
Business rules, balance equations (optional) |
anomalies |
Anomaly patterns and rates (optional) |
privacy_audit |
Privacy actions and epsilon tracking |
DSF File Format
The .dsf format is a ZIP archive containing:
fingerprint.dsf
├── manifest.json # Version, checksums, privacy config
├── schema.yaml # Table and column definitions
├── statistics.yaml # Distribution parameters
├── correlations.yaml # Correlation matrices (optional)
├── integrity.yaml # Integrity constraints (optional)
├── rules.yaml # Business rules (optional)
├── anomalies.yaml # Anomaly profiles (optional)
└── privacy_audit.json # Privacy audit trail
Digital Signatures
DSF files can be signed for authenticity verification:
use ;
// Generate a signing key
let key = generate;
// Sign when writing
let signer = new;
writer.write_to_file_signed?;
// Verify when reading
let verifier = new;
let fingerprint = reader.read_from_file_verified?;
Config Synthesis
Convert fingerprints to generator configurations:
use ;
let options = SynthesisOptions ;
let synthesizer = with_options;
let result = synthesizer.synthesize_full?;
// result.config_patch - configuration values to apply
// result.copula_generators - for preserving correlations
Fidelity Evaluation
Evaluate how well synthetic data matches the original fingerprint:
use FidelityEvaluator;
let evaluator = new;
let report = evaluator.evaluate?;
println!;
println!;
println!;
Statistical Distance Metrics
The fidelity evaluator computes per-column distance metrics:
| Metric | Description |
|---|---|
| KS Statistic | Kolmogorov-Smirnov two-sample test statistic |
| Wasserstein-1 | Earth Mover's Distance via inverse CDF integration (9 percentile knots) |
| JS Divergence | Jensen-Shannon divergence from percentile-bin PMFs (bounded by ln(2)) |
Distribution CDFs
The distribution fitter supports CDF computation for fitted distributions:
| Distribution | CDF Method |
|---|---|
| Normal | Standard error function |
| LogNormal | Transform to normal CDF |
| Gamma | Regularized incomplete gamma (Lanczos + Lentz CF) |
| Pareto | 1 - (x_m/x)^alpha |
| PointMass | Step function |
| Mixture | Weighted sum of component CDFs |
Privacy Levels
| Level | Epsilon | K | Use Case |
|---|---|---|---|
| Minimal | 5.0 | 3 | Low privacy requirements |
| Standard | 1.0 | 5 | Balanced (default) |
| High | 0.5 | 10 | Sensitive data |
| Maximum | 0.1 | 20 | Highly sensitive data |
API Reference
Core Types
Fingerprint- Root fingerprint structureSchemaFingerprint- Table and column schemasStatisticsFingerprint- Numeric and categorical statisticsCorrelationFingerprint- Correlation matrices and copulasPrivacyAudit- Privacy action tracking
Extraction
FingerprintExtractor- Main extraction coordinatorDataSource- Data source types (CSV, Parquet, JSON, Directory, Memory)ExtractionConfig- Extraction configurationStreamingNumericStats/StreamingCategoricalStats- Online statistics
I/O
FingerprintWriter- Write .dsf filesFingerprintReader- Read .dsf filesSigningKey/DsfSigner/DsfVerifier- Digital signaturesvalidate_dsf()- Validate .dsf file integrity
Synthesis
ConfigSynthesizer- Convert fingerprints to configsConfigPatch- Configuration patch valuesCopulaGenerator- Generate correlated samplesDistributionFitter- Fit distributions to data
Evaluation
FidelityEvaluator- Compare fingerprintsFidelityReport- Evaluation results
CLI Integration
The fingerprint crate integrates with the datasynth-data CLI:
# Extract fingerprint from data
# Validate fingerprint file
# Generate from fingerprint
# Evaluate fidelity
License
Same as the parent DataSynth project.