datasynth-core 4.2.1

Core domain models, traits, and distributions for synthetic enterprise data generation
Documentation

datasynth-core

Core domain models, traits, and distributions for synthetic accounting data generation.

Overview

datasynth-core provides the foundational building blocks for the SyntheticData workspace:

  • Domain Models: Journal entries, chart of accounts, master data, documents, anomalies
  • Statistical Distributions: Line item sampling, amount generation, temporal patterns
  • Core Traits: Generator and Sink interfaces for extensibility
  • Template System: File-based templates for regional/sector customization
  • Infrastructure: UUID factory, memory guard, GL account constants

Key Components

Domain Models (models/)

Module Description
journal_entry.rs Journal entry header and balanced line items
chart_of_accounts.rs Hierarchical GL accounts with account types
master_data.rs Enhanced vendors, customers with payment behavior
documents.rs Purchase orders, invoices, goods receipts, payments
temporal.rs Bi-temporal data model for audit trails
anomaly.rs Anomaly types and labels for ML training
internal_control.rs SOX 404 control definitions

Enterprise Process Chain Models (v0.6.0)

Module Description
sourcing/ SourcingProject, RfxEvent, SupplierBid, ProcurementContract, CatalogItem and related procurement models
bank_reconciliation.rs Bank reconciliation statements and matching rules
financial_statements.rs Income statement, balance sheet, cash flow statement models
payroll.rs Payroll runs, pay stubs, deductions, tax withholdings
time_entry.rs Time tracking entries, approval workflows
expense_report.rs Expense reports, line items, receipt matching
production_order.rs Manufacturing production orders and operations
quality_inspection.rs Quality inspection lots, results, defect codes
cycle_count.rs Inventory cycle count programs and variances
sales_quote.rs Sales quotations and quote-to-order conversion
management_kpi.rs Management KPIs and scorecard metrics
budget.rs Budget plans, line items, variance analysis

UUID Factory Extensions (v0.6.0)

The UUID factory (uuid_factory.rs) has been extended with 18 new GeneratorType discriminators (0x28-0x39) covering sourcing, HR, manufacturing, financial reporting, and sales/KPI/budget entities. This ensures collision-free deterministic UUID generation across all new model types.

Statistical Distributions (distributions/)

Module Description
LineItemSampler Empirical distribution (60.68% two-line, 88% even counts)
AmountSampler Log-normal with round-number bias, Benford compliance (legacy path)
AdvancedAmountSampler Enum over LogNormal / Gaussian mixture + Pareto heavy-tailed (v3.4.4+)
LogNormalMixtureSampler / GaussianMixtureSampler Multi-component mixture models
BivariateCopulaSampler Gaussian / Clayton / Gumbel / Frank / Student-t (Gaussian wired at runtime in v3.5.4; others v4.1.0)
ConditionalSampler Breakpoint-based distribution selection; input_field from calendar context
TemporalSampler Seasonality patterns with industry integration
BenfordSampler First-digit distribution following P(d) = log10(1 + 1/d)
IndustryAmountProfile Pre-configured mixtures: retail / manufacturing / financial_services / healthcare / technology
DriftController Regime changes + economic cycles + parameter drifts
TemporalContext (v3.4.1+) Multi-year holiday calendar + business-day calculator bundle
StatisticalValidationReport (v3.5.1+) Benford / chi² / KS goodness-of-fit runners

LLM + Template Infrastructure

Component Description
llm::LlmProvider Trait for LLM backends (Mock, HttpLlmProvider via llm feature)
llm::HttpLlmProvider OpenAI-compatible HTTP client; OpenRouter / Anthropic / OpenAI
templates::TemplateProvider Abstraction for name/description pools
templates::DefaultTemplateProvider Embedded arrays (v4.0 — v4.1.4 migrates to YAML-as-SoT)
templates::LlmTemplateProvider (v4.0.0+) Runtime LLM-backed provider wrapping a base; opt-in per category with in-memory cache
templates::TemplateLoader YAML/JSON load + save + merge; load_from_yaml_str for compile-time bundling

Infrastructure

Component Description
uuid_factory.rs Deterministic FNV-1a hash-based UUID generation, 18 generator-type discriminators
memory_guard.rs / disk_guard.rs / cpu_monitor.rs Resource monitors
resource_guard.rs Unified resource orchestration with graceful degradation
accounts.rs Centralized GL control account numbers
fraud_bias.rs Weekend / round-dollar / off-hours / post-close bias applied to every is_fraud=true entry
templates/ YAML/JSON template loading, merging, LlmTemplateProvider

Usage

use datasynth_core::models::{JournalEntry, JournalEntryLine};
use datasynth_core::distributions::AmountSampler;

// Create a balanced journal entry
let mut entry = JournalEntry::new(header);
entry.add_line(JournalEntryLine::debit("1100", amount, "AR Invoice"));
entry.add_line(JournalEntryLine::credit("4000", amount, "Revenue"));

// Sample realistic amounts
let sampler = AmountSampler::new(seed);
let amount = sampler.sample_benford_compliant(1000.0, 100000.0);

License

Apache-2.0 - See LICENSE for details.