synthdb 0.1.2

Production-grade synthetic data generator for PostgreSQL. Zero config. Single binary.
synthdb-0.1.2 is not a library.

SynthDB

Production-grade synthetic data generator for PostgreSQL
Zero-config. Single binary. Referentially intact.

Built with Rust License: MIT Open Collective

SynthDB generates realistic, referentially-consistent synthetic data for PostgreSQL from your existing schema — fast, local, and privacy-preserving.

Why use SynthDB?

  • Populate staging or CI environments with realistic data without copying production.
  • Preserve foreign-key relationships automatically.
  • Sample real distributions from existing schemas to produce believable values.
  • Run entirely locally — no data leaves your environment.

Quick Start

  1. Export your schema (no rows):
pg_dump -h prod-db.example -U user -d my_app -s > schema.sql
  1. Create a target database and load the schema:
createdb my_staging_db
psql -d my_staging_db < schema.sql
  1. Generate synthetic data:
synthdb clone \
  --url "postgres://user:pass@localhost:5432/my_staging_db" \
  --rows 5000 \
  --output seed.sql
  1. Apply the generated data:
psql -d my_staging_db < seed.sql

Result: a fully-seeded staging environment with realistic users, orders, items, etc. — all referentially intact.

Highlights

  • Zero-config: sensible defaults for most schemas.
  • Referential integrity: builds a dependency DAG and inserts tables in safe order.
  • Smart heuristics: infers types and semantic meanings from column names (emails, phones, SKUs, timestamps).
  • Value sampling: optionally sample from existing distinct values to preserve realistic distributions.
  • Fast: implemented in Rust for native performance.
  • Privacy-first: works locally — no telemetry or data exfiltration by default.

Features

  • Automatic topological ordering based on foreign keys
  • Column "Vibe Engine" for semantic data generation (email, phone, sku, status, timestamps)
  • Optional sampling from existing distinct values to mimic real distributions
  • Single-binary distribution; cross-platform builds via Cargo
  • Pluggable generator heuristics for custom column behaviors

Install

From source (Cargo):

git clone https://github.com/synthdb/synthdb.git
cd synthdb
cargo install --path .

Pre-built binaries may be published on the Releases page.

Usage (CLI)

Common command: clone (inspect schema and generate INSERT statements)

synthdb clone \
  --url "postgres://user:pass@localhost:5432/my_staging_db" \
  --rows 10000 \
  --output seed.sql \
  --sample-percent 20 \
  --concurrency 4

Key flags

  • --url (required) — Postgres connection string for the target schema to inspect
  • --rows — total rows to generate per primary table (defaults vary by mode)
  • --output — path to write SQL INSERTs (stdout if omitted)
  • --sample-percent — percent of distinct values to sample from existing columns (0–100)
  • --concurrency — number of worker threads for generation and sampling
  • --schema — target schema name (defaults to public)
  • --dry-run — analyze schema and print plan without generating data
  • --help — see all options

How it works (high level)

  1. Introspect: read tables, columns, types, and constraints via pg_catalog and information_schema.
  2. Build DAG: create a dependency graph from foreign keys and topologically sort tables.
  3. Profile (optional): sample distinct values and distributions from columns.
  4. Generate: use heuristics and distributions to synthesize values per column.
  5. Emit: produce INSERT statements in an order that satisfies foreign keys.

Column heuristics (Vibe Engine) SynthDB uses column names and types to infer generators:

  • email → realistic emails (local + domain)
  • phone → formatted phone numbers
  • sku, code → uppercase alphanumeric with dashes
  • status → small enumerations (active, pending, failed, etc.)
  • created_at, updated_at → time-decayed timestamps

Heuristics are pluggable — see Contributing for how to add or tune generators.

Sampling real distributions When enabled:

  • SELECT DISTINCT (with safe limits) to collect candidate values
  • Build categorical distributions and sample accordingly This preserves realistic proportions for categorical columns (e.g., product_category, country).

Performance notes

  • Generation is CPU-bound. Increase --concurrency for faster runs.
  • Sampling large distinct sets can be heavy on the DB; use --sample-percent to control load.

Limitations (v0.1)

  • PostgreSQL only (MySQL/SQLite planned for v0.2)
  • Requires an existing schema (does not create tables)
  • Produces SQL INSERT statements only (no truncation or migrations)
  • Very large sampling may be slow or memory-intensive — use limiting flags

Security & Privacy

  • SynthDB is designed for local use. It does not exfiltrate database contents.
  • Prefer role-limited DB users for introspection when running against production-like hosts.

Troubleshooting

  • Permission denied during introspection: ensure the DB user can read pg_catalog and information_schema.
  • Out of memory during sampling: reduce --sample-percent or run on a larger machine.
  • Referential cycles: if your schema has cycles without deferred constraints, run --dry-run to inspect the plan and handle cycles manually.

Examples

  • CI-friendly: generate small dataset and output to stdout
synthdb clone --url "$CI_DATABASE_URL" --rows 1000 > /tmp/ci_seed.sql
  • Generate without sampling real values:
synthdb clone --url "postgres://..." --rows 2000 --sample-percent 0

Development & Contributing Contributions welcome! Typical workflow:

git checkout -b feature/my-feature
# implement and test
git commit -m "Add feature"
git push origin feature/my-feature
# open a pull request

Guidelines

  • Write tests for new generators and sampling logic
  • Keep CLI flags backward-compatible
  • Document heuristics and new flags in README

Suggested repository files to add (if you want me to generate them):

  • CONTRIBUTING.md (contribution guidelines, testing, code style)
  • PR_TEMPLATE.md (pull request template)
  • CODE_OF_CONDUCT.md

Roadmap

  • v0.2: MySQL & SQLite support
  • Native write mode: write directly to a target DB (in addition to SQL files)
  • GUI and VSCode extension for schema previews
  • More generator plugins and community templates

Support & Sponsorship If SynthDB saves you time, consider sponsoring development:

License Distributed under the MIT License — see LICENSE.

Contact