synthdb-0.1.2 is not a library.

SynthDB

Production-grade synthetic data generator for PostgreSQL
Zero-config. Single binary. Referentially intact.

SynthDB generates realistic, referentially-consistent synthetic data for PostgreSQL from your existing schema — fast, local, and privacy-preserving.

Why use SynthDB?

Populate staging or CI environments with realistic data without copying production.
Preserve foreign-key relationships automatically.
Sample real distributions from existing schemas to produce believable values.
Run entirely locally — no data leaves your environment.

Quick Start

Export your schema (no rows):

pg_dump -h prod-db.example -U user -d my_app -s > schema.sql

Create a target database and load the schema:

createdb my_staging_db
psql -d my_staging_db < schema.sql

Generate synthetic data:

synthdb clone \
  --url "postgres://user:pass@localhost:5432/my_staging_db" \
  --rows 5000 \
  --output seed.sql

Apply the generated data:

psql -d my_staging_db < seed.sql

Result: a fully-seeded staging environment with realistic users, orders, items, etc. — all referentially intact.

Highlights

Zero-config: sensible defaults for most schemas.
Referential integrity: builds a dependency DAG and inserts tables in safe order.
Smart heuristics: infers types and semantic meanings from column names (emails, phones, SKUs, timestamps).
Value sampling: optionally sample from existing distinct values to preserve realistic distributions.
Fast: implemented in Rust for native performance.
Privacy-first: works locally — no telemetry or data exfiltration by default.

Features

Automatic topological ordering based on foreign keys
Column "Vibe Engine" for semantic data generation (email, phone, sku, status, timestamps)
Optional sampling from existing distinct values to mimic real distributions
Single-binary distribution; cross-platform builds via Cargo
Pluggable generator heuristics for custom column behaviors

Install

From source (Cargo):

git clone https://github.com/synthdb/synthdb.git
cd synthdb
cargo install --path .

Pre-built binaries may be published on the Releases page.

Usage (CLI)

Common command: clone (inspect schema and generate INSERT statements)

synthdb clone \
  --url "postgres://user:pass@localhost:5432/my_staging_db" \
  --rows 10000 \
  --output seed.sql \
  --sample-percent 20 \
  --concurrency 4

Key flags

--url (required) — Postgres connection string for the target schema to inspect
--rows — total rows to generate per primary table (defaults vary by mode)
--output — path to write SQL INSERTs (stdout if omitted)
--sample-percent — percent of distinct values to sample from existing columns (0–100)
--concurrency — number of worker threads for generation and sampling
--schema — target schema name (defaults to public)
--dry-run — analyze schema and print plan without generating data
--help — see all options

How it works (high level)

Introspect: read tables, columns, types, and constraints via pg_catalog and information_schema.
Build DAG: create a dependency graph from foreign keys and topologically sort tables.
Profile (optional): sample distinct values and distributions from columns.
Generate: use heuristics and distributions to synthesize values per column.
Emit: produce INSERT statements in an order that satisfies foreign keys.

Column heuristics (Vibe Engine) SynthDB uses column names and types to infer generators:

email → realistic emails (local + domain)
phone → formatted phone numbers
sku, code → uppercase alphanumeric with dashes
status → small enumerations (active, pending, failed, etc.)
created_at, updated_at → time-decayed timestamps

Heuristics are pluggable — see Contributing for how to add or tune generators.

Sampling real distributions When enabled:

SELECT DISTINCT (with safe limits) to collect candidate values
Build categorical distributions and sample accordingly This preserves realistic proportions for categorical columns (e.g., product_category, country).

Performance notes

Generation is CPU-bound. Increase --concurrency for faster runs.
Sampling large distinct sets can be heavy on the DB; use --sample-percent to control load.

Limitations (v0.1)

PostgreSQL only (MySQL/SQLite planned for v0.2)
Requires an existing schema (does not create tables)
Produces SQL INSERT statements only (no truncation or migrations)
Very large sampling may be slow or memory-intensive — use limiting flags

Security & Privacy

SynthDB is designed for local use. It does not exfiltrate database contents.
Prefer role-limited DB users for introspection when running against production-like hosts.

Troubleshooting

Permission denied during introspection: ensure the DB user can read pg_catalog and information_schema.
Out of memory during sampling: reduce --sample-percent or run on a larger machine.
Referential cycles: if your schema has cycles without deferred constraints, run --dry-run to inspect the plan and handle cycles manually.

Examples

CI-friendly: generate small dataset and output to stdout

synthdb clone --url "$CI_DATABASE_URL" --rows 1000 > /tmp/ci_seed.sql

Generate without sampling real values:

synthdb clone --url "postgres://..." --rows 2000 --sample-percent 0

Development & Contributing Contributions welcome! Typical workflow:

git checkout -b feature/my-feature
# implement and test
git commit -m "Add feature"
git push origin feature/my-feature
# open a pull request

Guidelines

Write tests for new generators and sampling logic
Keep CLI flags backward-compatible
Document heuristics and new flags in README

Suggested repository files to add (if you want me to generate them):

CONTRIBUTING.md (contribution guidelines, testing, code style)
PR_TEMPLATE.md (pull request template)
CODE_OF_CONDUCT.md

Roadmap

v0.2: MySQL & SQLite support
Native write mode: write directly to a target DB (in addition to SQL files)
GUI and VSCode extension for schema previews
More generator plugins and community templates

Support & Sponsorship If SynthDB saves you time, consider sponsoring development:

Open Collective: https://opencollective.com/synthdb

License Distributed under the MIT License — see LICENSE.

Contact

Issues & PRs: https://github.com/synthdb/synthdb/issues
Sponsorship: https://opencollective.com/synthdb

synthdb 0.1.2

SynthDB