sql-splitter

Split large SQL dump files into individual table files. Fast, memory-efficient, multi-dialect.

600+ MB/s throughput on modern hardware
MySQL, PostgreSQL, SQLite, MSSQL support (including COPY FROM stdin, GO batches)
Compressed files — gzip, bzip2, xz, zstd auto-detected
Streaming architecture — handles files larger than RAM
5x faster than shell-based alternatives

Installation

From crates.io

cargo install sql-splitter

From source

git clone https://github.com/helgesverre/sql-splitter
cd sql-splitter
make install  # Installs binary + shell completions + man pages

Or download pre-built binaries from GitHub Releases.

Man pages (optional)

After installation, view documentation with man sql-splitter or man sql-splitter-diff.

For cargo install users, install man pages manually:

git clone https://github.com/helgesverre/sql-splitter
cd sql-splitter
make install-man

Usage

# MySQL/MariaDB dump (default)
sql-splitter split dump.sql -o tables/

# PostgreSQL pg_dump
sql-splitter split pg_dump.sql -o tables/ --dialect=postgres

# SQLite dump
sql-splitter split sqlite.sql -o tables/ --dialect=sqlite

# MSSQL/T-SQL dump (SSMS "Generate Scripts", sqlcmd)
sql-splitter split mssql_dump.sql -o tables/ --dialect=mssql

# Compressed files (auto-detected)
sql-splitter split backup.sql.gz -o tables/
sql-splitter split backup.sql.zst -o tables/

# Split specific tables only
sql-splitter split dump.sql --tables users,posts,orders

# Schema only (CREATE TABLE, indexes, etc.)
sql-splitter split dump.sql -o schema/ --schema-only

# Data only (INSERT/COPY statements)
sql-splitter split dump.sql -o data/ --data-only

# Merge split files back into single dump
sql-splitter merge tables/ -o restored.sql

# Merge specific tables only
sql-splitter merge tables/ -o partial.sql --tables users,orders

# Merge with transaction wrapper
sql-splitter merge tables/ -o restored.sql --transaction

# Analyze without splitting
sql-splitter analyze dump.sql

# Convert between SQL dialects
sql-splitter convert mysql_dump.sql -o postgres_dump.sql --to postgres
sql-splitter convert pg_dump.sql -o mysql_dump.sql --to mysql
sql-splitter convert dump.sql -o sqlite_dump.sql --to sqlite
sql-splitter convert mssql_dump.sql -o mysql_dump.sql --to mysql

# Convert with explicit source dialect
sql-splitter convert dump.sql --from postgres --to mysql -o output.sql
sql-splitter convert dump.sql --from mssql --to postgres -o output.sql

# Validate SQL dump integrity
sql-splitter validate dump.sql

# Validate with strict mode (warnings = errors)
sql-splitter validate dump.sql --strict

# Validate with JSON output for CI
sql-splitter validate dump.sql --json

# Batch operations with glob patterns
sql-splitter validate "dumps/*.sql" --fail-fast
sql-splitter analyze "**/*.sql"
sql-splitter split "*.sql" -o output/
sql-splitter convert "*.sql" --to postgres -o converted/

# Compare two SQL dumps for changes
sql-splitter diff old.sql new.sql

# Diff with schema-only or data-only
sql-splitter diff old.sql new.sql --schema-only
sql-splitter diff old.sql new.sql --data-only

# Diff with JSON or SQL migration output
sql-splitter diff old.sql new.sql --format json -o diff.json
sql-splitter diff old.sql new.sql --format sql -o migration.sql

# Diff with verbose PK samples and ignore timestamp columns
sql-splitter diff old.sql new.sql --verbose --ignore-columns "*.updated_at,*.created_at"

# Override primary key for tables without PK
sql-splitter diff old.sql new.sql --primary-key logs:timestamp+message

# Redact sensitive data using inline patterns
sql-splitter redact dump.sql -o safe.sql --null "*.ssn" --hash "*.email" --fake "*.name"

# Redact using YAML config file
sql-splitter redact dump.sql -o safe.sql --config redact.yaml

# Generate redaction config by analyzing input file
sql-splitter redact dump.sql --generate-config -o redact.yaml

# Reproducible redaction with seed
sql-splitter redact dump.sql -o safe.sql --null "*.password" --seed 42

# Generate ERD (Entity-Relationship Diagram)
sql-splitter graph dump.sql -o schema.html        # Interactive HTML (default)
sql-splitter graph dump.sql -o schema.dot         # Graphviz DOT format
sql-splitter graph dump.sql -o schema.mmd         # Mermaid erDiagram
sql-splitter graph dump.sql -o schema.json        # JSON with full schema details

# Graph with filtering
sql-splitter graph dump.sql --tables "user*,order*" -o filtered.html
sql-splitter graph dump.sql --exclude "log*,audit*" -o clean.html
sql-splitter graph dump.sql --cycles-only         # Only tables in circular dependencies

# Focus on specific table and its relationships
sql-splitter graph dump.sql --table orders --transitive  # Show all dependencies
sql-splitter graph dump.sql --table users --reverse      # Show all dependents

# Reorder SQL dump in topological FK order
sql-splitter order dump.sql -o ordered.sql        # Safe import order
sql-splitter order dump.sql --check               # Check for cycles
sql-splitter order dump.sql --reverse             # Reverse (for DROP operations)

# Query SQL dumps with DuckDB analytics engine
sql-splitter query dump.sql "SELECT COUNT(*) FROM users"
sql-splitter query dump.sql "SELECT * FROM orders WHERE total > 100" -f json
sql-splitter query dump.sql "SELECT * FROM users LIMIT 10" -o results.csv -f csv
sql-splitter query dump.sql --interactive         # Start REPL session
sql-splitter query huge.sql "SELECT ..." --disk   # Use disk mode for large files

# Query with caching for repeated queries
sql-splitter query dump.sql "SELECT ..." --cache  # Cache imported database
sql-splitter query --list-cache                   # Show cached databases
sql-splitter query --clear-cache                  # Clear all cached databases

# Generate shell completions (auto-installed with make install)
sql-splitter completions bash >> ~/.bashrc
sql-splitter completions zsh >> ~/.zshrc
sql-splitter completions fish >> ~/.config/fish/completions/sql-splitter.fish

Shell Completions

Shell completions are automatically installed when using make install. For manual installation:

# Install for current shell only
make install-completions

# Install for all shells (bash, zsh, fish)
make install-completions-all

Why sql-splitter?

sql-splitter is a dump-first, CLI-first tool designed for automation and CI/CD pipelines.

What it's optimized for

Strength	Description
One tool for the workflow	Split → sample → shard → convert → merge in a single binary
Works on dump files	No running database or JDBC connection needed (unlike mydumper, Jailer, Condenser)
Streaming architecture	10GB+ dumps with constant memory, 600+ MB/s throughput
Multi-dialect + conversion	MySQL, PostgreSQL, SQLite including `COPY FROM stdin` → INSERT
FK-aware operations	Sampling and tenant sharding preserve referential integrity

When another tool might be better

mydumper — Parallel snapshots from live MySQL/MariaDB databases
Jailer — Rich GUI-based FK subsetting with JDBC across 12+ databases
sqlglot — Query-level transpilation and AST manipulation (31 dialects)
DuckDB — Complex analytical queries over SQL/CSV/JSON/Parquet

See docs/COMPETITIVE_ANALYSIS.md for detailed comparisons.

Options

Split Options

Flag	Description	Default
`-o, --output`	Output directory	`output`
`-d, --dialect`	SQL dialect: `mysql`, `postgres`, `sqlite`, `mssql`	auto-detect
`-t, --tables`	Only split these tables (comma-separated)	—
`-p, --progress`	Show progress bar	—
`--dry-run`	Preview without writing files	—
`--schema-only`	Only DDL statements (CREATE, ALTER, DROP)	—
`--data-only`	Only DML statements (INSERT, COPY)	—
`--fail-fast`	Stop on first error (for glob patterns)	—
`--json`	Output results as JSON	—

Input can be a file path or glob pattern (e.g., *.sql, dumps/**/*.sql).

Merge Options

Flag	Description	Default
`-o, --output`	Output SQL file	stdout
`-d, --dialect`	SQL dialect for headers/footers	`mysql`
`-t, --tables`	Only merge these tables (comma-separated)	all
`-e, --exclude`	Exclude these tables (comma-separated)	—
`--transaction`	Wrap in BEGIN/COMMIT transaction	—
`--no-header`	Skip header comments	—
`-p, --progress`	Show progress bar	—
`--dry-run`	Preview without writing files	—
`--json`	Output results as JSON	—

Analyze Options

Flag	Description	Default
`-d, --dialect`	SQL dialect: `mysql`, `postgres`, `sqlite`, `mssql`	auto-detect
`-p, --progress`	Show progress bar	—
`--fail-fast`	Stop on first error (for glob patterns)	—
`--json`	Output results as JSON	—

Input can be a file path or glob pattern (e.g., *.sql, dumps/**/*.sql).

Convert Options

Flag	Description	Default
`-o, --output`	Output SQL file or directory (required for glob)	stdout
`--from`	Source dialect: `mysql`, `postgres`, `sqlite`, `mssql`	auto-detect
`--to`	Target dialect: `mysql`, `postgres`, `sqlite`, `mssql`	required
`--strict`	Fail on any unsupported feature	—
`-p, --progress`	Show progress bar	—
`--dry-run`	Preview without writing files	—
`--fail-fast`	Stop on first error (for glob patterns)	—
`--json`	Output results as JSON	—

Input can be a file path or glob pattern (e.g., *.sql, dumps/**/*.sql).

Supported conversions (12 pairs):

MySQL ↔ PostgreSQL (including COPY → INSERT)
MySQL ↔ SQLite
MySQL ↔ MSSQL
PostgreSQL ↔ SQLite
PostgreSQL ↔ MSSQL
SQLite ↔ MSSQL

Features:

50+ data type mappings
AUTO_INCREMENT ↔ SERIAL ↔ INTEGER PRIMARY KEY ↔ IDENTITY
PostgreSQL COPY → INSERT with NULL and escape handling
Session command stripping
Warnings for unsupported features (ENUM, arrays, triggers)

Validate Options

Flag	Description	Default
`-d, --dialect`	SQL dialect: `mysql`, `postgres`, `sqlite`, `mssql`	auto-detect
`--strict`	Treat warnings as errors (exit 1)	—
`--json`	Output results as JSON	—
`--max-rows-per-table`	Max rows per table for PK/FK checks (0 = no limit)	1,000,000
`--no-limit`	Disable row limit for PK/FK checks	—
`--no-fk-checks`	Skip PK/FK data integrity checks	—
`-p, --progress`	Show progress bar	—
`--fail-fast`	Stop on first error (for glob patterns)	—

Input can be a file path or glob pattern (e.g., *.sql, dumps/**/*.sql).

Validation checks:

SQL syntax validation (parser errors)
DDL/DML consistency (INSERTs reference existing tables)
Encoding validation (UTF-8)
Duplicate primary key detection (all dialects)
FK referential integrity (all dialects)

Sample Options

Flag	Description	Default
`-o, --output`	Output SQL file	stdout
`-d, --dialect`	SQL dialect: `mysql`, `postgres`, `sqlite`, `mssql`	auto-detect
`--percent`	Sample percentage (1-100)	—
`--rows`	Sample fixed number of rows per table	—
`--preserve-relations`	Preserve FK relationships	—
`-t, --tables`	Only sample these tables (comma-separated)	all
`-e, --exclude`	Exclude these tables (comma-separated)	—
`--root-tables`	Explicit root tables for sampling	—
`--include-global`	Global table handling: `none`, `lookups`, `all`	`lookups`
`--seed`	Random seed for reproducibility	random
`-c, --config`	YAML config file for per-table settings	—
`--max-total-rows`	Maximum total rows to sample (0 = no limit)	—
`--no-limit`	Disable row limit	—
`--strict-fk`	Fail if any FK integrity issues detected	—
`--no-schema`	Exclude CREATE TABLE statements from output	—
`-p, --progress`	Show progress bar	—
`--dry-run`	Preview without writing files	—
`--json`	Output results as JSON	—

Shard Options

Flag	Description	Default
`-o, --output`	Output SQL file or directory	stdout
`-d, --dialect`	SQL dialect: `mysql`, `postgres`, `sqlite`, `mssql`	auto-detect
`--tenant-column`	Column name for tenant identification	auto-detect
`--tenant-value`	Single tenant value to extract	—
`--tenant-values`	Multiple tenant values (comma-separated)	—
`--root-tables`	Explicit root tables with tenant column	—
`--include-global`	Global table handling: `none`, `lookups`, `all`	`lookups`
`-c, --config`	YAML config file for table classification	—
`--max-selected-rows`	Maximum rows to select (0 = no limit)	—
`--no-limit`	Disable row limit	—
`--strict-fk`	Fail if any FK integrity issues detected	—
`--no-schema`	Exclude CREATE TABLE statements from output	—
`-p, --progress`	Show progress bar	—
`--dry-run`	Preview without writing files	—
`--json`	Output results as JSON	—

Diff Options

Flag	Description	Default
`-o, --output`	Output file (default: stdout)	stdout
`-d, --dialect`	SQL dialect: `mysql`, `postgres`, `sqlite`, `mssql`	auto-detect
`--schema-only`	Compare schema only, skip data	—
`--data-only`	Compare data only, skip schema	—
`--format`	Output format: `text`, `json`, `sql`	`text`
`-t, --tables`	Only compare these tables (comma-separated)	all
`-e, --exclude`	Exclude these tables (comma-separated)	—
`--max-pk-entries`	Max PK entries to track (0 = no limit)	10,000,000
`-v, --verbose`	Show sample PK values for added/removed/modified rows	—
`--primary-key`	Override PK for tables (format: `table:col1+col2`)	auto-detect
`--ignore-order`	Ignore column order differences in schema comparison	—
`--ignore-columns`	Ignore columns matching glob patterns (e.g., `*.updated_at`)	—
`--allow-no-pk`	Compare tables without PK using all columns as key	—
`-p, --progress`	Show progress bar	—

What diff detects:

Tables added/removed/modified (columns, types, nullability)
Primary key changes
Foreign key changes
Index changes (CREATE INDEX, inline INDEX/KEY)
Rows added/removed/modified (via PK-based comparison)

Output formats:

text: Human-readable summary with optional PK samples
json: Structured data for automation (includes warnings)
sql: Migration script with ALTER/CREATE INDEX/DROP INDEX statements

Graph Options

Flag	Description	Default
`-o, --output`	Output file (html, dot, mmd, json, png, svg, pdf)	stdout
`--format`	Output format: `html`, `dot`, `mermaid`, `json`	auto-detect
`-d, --dialect`	SQL dialect: `mysql`, `postgres`, `sqlite`, `mssql`	auto-detect
`--layout`	Layout direction: `lr` (left-right), `tb` (top-bottom)	`lr`
`-t, --tables`	Only include tables matching glob patterns	all
`-e, --exclude`	Exclude tables matching glob patterns	—
`--table`	Focus on a specific table	—
`--transitive`	Show all dependencies of focused table	—
`--reverse`	Show all tables that depend on focused table	—
`--max-depth`	Limit traversal depth	unlimited
`--cycles-only`	Only show tables in circular dependencies	—
`--render`	Render DOT to PNG/SVG/PDF using Graphviz	—
`-p, --progress`	Show progress bar	—
`--json`	Output as JSON	—

Output formats:

html: Interactive diagram with dark/light theme, copy Mermaid button
dot: Graphviz DOT with ERD-style tables (columns, types, PK/FK markers)
mermaid: Mermaid erDiagram syntax
json: Full schema with tables, columns, relationships, and stats
png/svg/pdf: Rendered image (requires Graphviz dot command)

Order Options

Flag	Description	Default
`-o, --output`	Output SQL file	stdout
`-d, --dialect`	SQL dialect: `mysql`, `postgres`, `sqlite`, `mssql`	auto-detect
`--check`	Check for cycles and report order (don't write)	—
`--dry-run`	Show topological order without writing	—
`--reverse`	Reverse order (children before parents, for DROP)	—

Redact Options

Flag	Description	Default
`-o, --output`	Output SQL file	stdout
`-d, --dialect`	SQL dialect: `mysql`, `postgres`, `sqlite`, `mssql`	auto-detect
`-c, --config`	YAML config file for redaction rules	—
`--generate-config`	Analyze input and generate annotated YAML config	—
`--null`	Columns to set to NULL (glob patterns, comma-separated)	—
`--hash`	Columns to hash with SHA256 (glob patterns)	—
`--fake`	Columns to replace with fake data (glob patterns)	—
`--mask`	Columns to partially mask (format: `pattern=column`)	—
`--constant`	Column=value pairs for constant replacement	—
`--seed`	Random seed for reproducible redaction	random
`--locale`	Locale for fake data (en, de_de, fr_fr, etc.)	`en`
`-t, --tables`	Only redact specific tables (comma-separated)	all
`-e, --exclude`	Exclude specific tables (comma-separated)	—
`--strict`	Fail on warnings (e.g., unsupported locale)	—
`-p, --progress`	Show progress bar	—
`--dry-run`	Preview without writing files	—
`--json`	Output results as JSON	—
`--validate`	Validate config only, don't process	—

Redaction strategies:

null: Replace value with NULL
constant: Replace with fixed value
hash: SHA256 hash (deterministic, preserves FK relationships)
mask: Partial masking with pattern (*=asterisk, X=keep, #=random digit)
fake: Generate realistic fake data (25+ generators)
shuffle: Redistribute values within column (preserves distribution)
skip: No redaction (passthrough)

Fake data generators:

email, name, first_name, last_name, phone, address, city, state, zip, country, company, job_title, username, url, ip, ipv6, uuid, date, datetime, credit_card, iban, ssn, lorem, paragraph, sentence

Query Options

Flag	Description	Default
`-f, --format`	Output format: `table`, `json`, `jsonl`, `csv`, `tsv`	`table`
`-o, --output`	Write output to file instead of stdout	stdout
`-d, --dialect`	SQL dialect: `mysql`, `postgres`, `sqlite`, `mssql`	auto-detect
`-i, --interactive`	Start interactive REPL session	—
`--disk`	Use disk-based storage (for large dumps >2GB)	auto
`--cache`	Cache imported database for repeated queries	—
`-t, --tables`	Only import specific tables (comma-separated)	all
`--memory-limit`	Memory limit for DuckDB (e.g., "4GB")	—
`--timing`	Show query execution time	—
`-p, --progress`	Show import progress	—
`--list-cache`	List cached databases	—
`--clear-cache`	Clear all cached databases	—

REPL commands:

.tables — List all tables
.schema [table] — Show schema (all tables or specific table)
.describe <table> — Describe a specific table
.format <fmt> — Set output format (table, json, csv, tsv)
.count <table> — Count rows in a table
.sample <table> [n] — Show sample rows (default: 10)
.export <file> <query> — Export query results to file
.exit — Exit the REPL

Performance

See BENCHMARKS.md for detailed comparisons.

Testing

# Unit tests
cargo test

# Verify against real-world SQL dumps (MySQL, PostgreSQL, WordPress, etc.)
make verify-realworld

AI Agent Integration

sql-splitter includes documentation optimized for AI agents:

llms.txt - LLM-friendly documentation following the llmstxt.org specification
Agent Skill - Claude Code / Amp skill for automatic tool discovery

Install the skill in Claude Code / Amp:

amp skill add helgesverre/sql-splitter

License

MIT — see LICENSE.md

sql-splitter 1.13.0

sql-splitter

Installation

From crates.io

From source

Man pages (optional)

Usage

Shell Completions

Why sql-splitter?

Options

Split Options

Merge Options

Analyze Options

Convert Options

Validate Options

Sample Options

Shard Options

Diff Options

Graph Options

Order Options

Redact Options

Query Options

Performance

Testing

AI Agent Integration

License