# Changelog
All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [Unreleased]
## [1.13.0] - 2025-01-28
### Added
- **Expanded benchmark suite with 4 new competitor tools**:
- `mysqldbsplit` (PHP) - Now the **fastest tool** at 1238 MB/s throughput
- `mysqldump-splitter` (Rust/Scoopit) - Hierarchical output with gzip support
- `mysql-dump-splitter` (Go/Bekkema) - Include/exclude table filtering
- `extract-mysql-dump` (Python/agroff) - Multi-database extraction
- Total of 10 tools now benchmarked in Docker suite
- **JSON schema generation**: Auto-generate JSON schemas from Rust types using `schemars`
- New `schema` subcommand to export schemas for all JSON output formats
- Schemas available for: analyze, convert, graph, merge, redact, sample, shard, split, validate
- **Website OG image generation**: Dynamic Open Graph images using Satori
- Automatic OG image generation for documentation pages
- IndexNow integration for faster search engine indexing
- **Profiling and benchmark recipes**: New justfile recipes for performance analysis
- `just profile-medium` / `just profile-large` for CPU profiling
- `just bench-baseline` for criterion benchmarks
### Changed
- **Performance**: Use stack-allocated buffers in parser to reduce heap allocations
- **Error handling**: Improved error handling in validation, sample, and compression modules
- **Dependencies**: Updated multiple dependencies (chrono 0.4.43, jsonschema 0.40, schemars 1.2, clap, serde_json, flate2, indicatif, dirs, rustyline, rand, fake)
### Fixed
- **CI disk space**: Resolved disk space exhaustion in GitHub Actions builds
- **Compression**: Return `Result` from `wrap_reader` for better error handling
## [1.12.6] - 2025-12-27
### Added
- **Full INSERT/COPY statement rewriting for redact command**: The `redact` command now actively modifies SQL data
- Previously, redaction strategies were identified but statements passed through unchanged
- INSERT statements are now parsed, values redacted according to matching strategies, and reconstructed
- PostgreSQL COPY data blocks are now parsed and redacted with proper `\N` NULL markers
- Multi-row INSERT statements are fully supported
- All 7 redaction strategies now work: null, constant, hash, mask, fake, shuffle, skip
- **Dialect-aware value formatting for redact command**: Proper SQL escaping per dialect
- MySQL: Backtick quoting, backslash escape sequences (`\'`, `\n`, `\r`, `\t`)
- PostgreSQL: Double-quote identifiers, `''` escaping, `\N` for NULL in COPY format
- SQLite: Double-quote identifiers, `''` escaping
- MSSQL: Square bracket identifiers `[column]`, `N'...'` for Unicode strings
- **ValueRewriter module**: New `src/redactor/rewriter.rs` for INSERT/COPY transformation
- Converts ParsedValue to RedactValue, applies strategy, formats back to SQL
- Handles PostgreSQL's two-step COPY pattern (header + data block)
- Proper stats tracking during actual redaction
- **20 new redact integration tests**:
- 7 MSSQL-specific tests: null, hash, fake, dry-run, bracket quoting, Unicode strings, reproducibility
- 13 cross-dialect tests: MySQL INSERT, PostgreSQL COPY, SQLite, multi-row, escaping, skip tables
### Changed
- **Redact command now reports accurate statistics**: Row and column counts reflect actual redaction performed
## [1.12.5] - 2025-12-27
### Added
- **MSSQL production-style output for test data generator**: New CLI flags for production-style MSSQL output
- `--go-separator`: Use GO batch separators instead of semicolons
- `--schema-prefix`: Use `[dbo].` schema prefix on all table references
- `--named-constraints`: Use named CONSTRAINT syntax (e.g., `CONSTRAINT [PK_users] PRIMARY KEY CLUSTERED`)
- `--mssql-production`: Enable all three options at once for production-like output
- `RenderConfig::mssql_production()` method for programmatic use
- **9 new MSSQL generator integration tests**: Comprehensive tests for generated MSSQL fixtures
- Tests for analyze, split, validate, merge roundtrip, sample command
- Tests for deterministic generation and production-style output
- All 38 MSSQL integration tests now pass
### Changed
- **Updated TEST_DATA_GENERATOR.md documentation**: Added MSSQL dialect examples and updated dialect count
## [1.12.4] - 2025-12-27
### Added
- **MSSQL dialect support for test data generator**: The `gen-fixtures` CLI and `test_data_gen` crate now support MSSQL dialect
- Square bracket `[identifier]` quoting for table and column names
- `INT IDENTITY(1,1)` for auto-increment primary keys
- `NVARCHAR(n)` and `NVARCHAR(MAX)` for Unicode string columns
- `N'...'` prefix for Unicode string literals
- `DATETIME2` for timestamp columns
- `BIT` for boolean columns
- `DECIMAL(10,2)` for monetary values
- Proper `SET ANSI_NULLS ON`, `SET QUOTED_IDENTIFIER ON`, `SET NOCOUNT ON` header
- Both streaming mode (`--rows`) and multi-tenant schema mode (`--scale`) supported
- Usage: `gen-fixtures --dialect mssql --scale small`
## [1.12.3] - 2025-12-27
### Fixed
- **SQLite AUTOINCREMENT stripping for DuckDB**: SQLite dumps with `AUTOINCREMENT` now import correctly
- Previously, tables with `INTEGER PRIMARY KEY AUTOINCREMENT` failed to create in DuckDB
- Now strips `AUTOINCREMENT`, `WITHOUT ROWID`, and `STRICT` table modifiers
- SQLite dumps now import all tables (was failing on most tables before)
### Added
- **New SQLite DuckDB tests**: Added 2 tests for SQLite-specific syntax stripping
- `test_sqlite_autoincrement_must_succeed` - Verifies AUTOINCREMENT is stripped
- `test_sqlite_without_rowid_must_succeed` - Verifies WITHOUT ROWID is stripped
## [1.12.2] - 2025-12-27
### Fixed
- **MySQL syntax stripping for DuckDB**: Fixed comprehensive MySQL-specific syntax that was causing import failures
- **UNIQUE KEY constraints**: Now strips `UNIQUE KEY 'name' (col1, col2)` index definitions from CREATE TABLE
- **Regular indexes (KEY)**: Strips `KEY 'name' (col1, col2)`, `FULLTEXT KEY`, and `SPATIAL KEY` index definitions
- **GENERATED columns**: Strips `GENERATED ALWAYS AS (expr) STORED/VIRTUAL` computed columns
- **FOREIGN KEY constraints**: Strips FK constraints entirely (DuckDB enforces them which breaks batch loading)
- **Type conversion regex**: Fixed bug where column names containing type substrings (e.g., 'internal_note' matching 'INT') were incorrectly converted
### Changed
- **Improved real-world MySQL dump compatibility**: Verified against production dumps with complex schemas
- taskflow_production.sql (17MB, 62 tables) - 0 warnings, imports successfully
- db2sheets_prod.sql (17MB, 18 tables) - 0 warnings, imports successfully
- boatflow_latest_2.sql (128MB, 52 tables) - 4 warnings (data quality issues only: invalid `0000-00-00 00:00:00` timestamps)
### Added
- **New strict DuckDB tests**: Added 8 tests verifying MySQL syntax stripping works correctly
- `test_mysql_unique_key_constraint_strict` - Verifies UNIQUE KEY is stripped
- `test_mysql_key_index_constraint_strict` - Verifies KEY indexes are stripped
- `test_mysql_fulltext_key_constraint_strict` - Verifies FULLTEXT KEY is stripped
- `test_mysql_generated_column_strict` - Verifies GENERATED columns are stripped
- `test_mysql_foreign_key_constraint_strict` - Verifies FK constraints are stripped
- `test_mysql_combined_constraints_strict` - Verifies all constraints together
- `test_mysql_column_name_with_type_substring` - Verifies 'internal_note' isn't converted to 'BIGINTernal_note'
- `test_mysql_type_conversion_preserves_identifiers` - Ensures quoted identifiers are preserved
## [1.12.1] - 2025-12-27
### Fixed
- **PostgreSQL COPY performance**: Fixed severe performance regression in `query` command when processing PostgreSQL dumps with COPY data blocks
- Processing time for pagila-data.sql reduced from 15-27 seconds to ~0.03 seconds
- Full pagila dump now processes in ~1.9 seconds (previously could hang or take minutes)
- Root cause: Statement parser was not correctly handling COPY blocks preceded by comments, causing data accumulation issues
- **Single-column COPY data**: Fixed bug where PostgreSQL COPY data with single columns (no tabs) was incorrectly identified as non-COPY data
- `looks_like_copy_data()` now correctly handles single-column tables
- Previously caused COPY data to be lost and subsequent SQL statements to fail
- **COPY schema prefix handling**: Fixed "table does not exist" errors when COPY statements used schema-qualified names (`COPY public.tablename`)
- DuckDB tables are created without schema prefix, COPY now correctly strips `public.` prefix
### Added
- **Expanded test coverage**: Added 16 new PostgreSQL COPY edge case tests
- `test_postgres_copy_with_schema_prefix` - Tests `COPY public.tablename` pattern
- `test_postgres_copy_with_comments` - Tests COPY preceded by pg_dump-style comments
- `test_postgres_copy_only_keyword` - Tests `COPY ONLY` for inherited tables
- `test_postgres_copy_quoted_identifiers` - Tests quoted table/column names
- `test_postgres_copy_empty_values` - Tests empty strings vs NULL handling
- `test_postgres_copy_wide_table` - Tests many-column tables
- `test_postgres_copy_escape_sequences` - Tests `\n`, `\t`, `\\` handling
- `test_postgres_copy_followed_by_statements` - Tests COPY followed by regular SQL
- `test_postgres_copy_multiple_with_comments` - Tests multiple COPY blocks with comments
- `test_postgres_copy_semicolon_in_comment` - Tests semicolons in line comments
- `test_postgres_copy_block_comment_semicolon` - Tests semicolons in block comments
- `test_postgres_copy_all_null_row` - Tests rows with all NULL values
- `test_postgres_copy_unicode` - Tests unicode data preservation
- **Stress tests**: Added large dump stress tests
- `test_postgres_copy_stress_50k_rows` - 50,000 row COPY import
- `test_postgres_multi_table_stress` - 10 tables × 1,000 rows each
- `test_mysql_insert_stress_10k_rows` - 10,000 rows via multi-value INSERTs
- **Memory profiling**: Added `query` and `shard` commands to `scripts/profile-memory.sh`
### Changed
- **Documentation**: Updated AGENTS.md with latest profiling benchmarks and query performance by dialect
## [1.12.0] - 2025-12-26
### Added
- **Query command**: Run SQL analytics on dump files using embedded DuckDB
- `sql-splitter query dump.sql "SELECT COUNT(*) FROM users"` - Execute single query
- `sql-splitter query dump.sql --interactive` - Start REPL session for multiple queries
- `sql-splitter query dump.sql "SELECT * FROM orders" -f json -o results.json` - Export results
- **Zero dependencies**: DuckDB is bundled and compiled into sql-splitter
- **Multi-dialect support**: Imports MySQL, PostgreSQL, and SQLite dumps
- **Memory management**: Auto-switches to disk mode for dumps >2GB
- **Persistent caching**: `--cache` saves imported databases for repeated queries
- **Output formats**: table (default), json, jsonl, csv, tsv
- **REPL commands**: `.tables`, `.schema`, `.describe`, `.count`, `.sample`, `.format`, `.export`
- **Table filtering**: `--tables` to import specific tables only
- **Memory limit**: `--memory-limit 4GB` to control DuckDB memory usage
- **Query timing**: `--timing` flag shows execution time
- **Cache management**: `--list-cache` and `--clear-cache` for cache control
- Supports compressed input files (.gz, .bz2, .xz, .zst)
- Command alias: `qy`
- **DuckDB module**: Reusable query engine infrastructure
- `QueryEngine`: Manages DuckDB connections (in-memory or disk-based)
- `DumpLoader`: Imports SQL dumps with type conversion and dialect handling
- `TypeConverter`: Maps MySQL/PostgreSQL/SQLite types to DuckDB types
- `QueryResultFormatter`: Formats results as table, JSON, CSV, or TSV
- `CacheManager`: Persistent caching with SHA256-based cache keys
## [1.11.0] - 2025-12-26
### Added
- **Graph command**: Generate Entity-Relationship Diagrams (ERD) from SQL dumps
- `sql-splitter graph dump.sql -o schema.html` - Interactive HTML ERD (default)
- `sql-splitter graph dump.sql -o schema.dot` - Graphviz DOT format with full column details
- `sql-splitter graph dump.sql -o schema.mmd --format mermaid` - Mermaid erDiagram syntax
- `sql-splitter graph dump.sql -o schema.json --format json` - JSON with tables, columns, relationships
- **ERD features**: Shows tables with columns, types, PK/FK markers, nullability
- **Filtering**: `--tables` and `--exclude` with glob patterns, `--cycles-only` for circular deps
- **Focus mode**: `--table TABLE --transitive` (dependencies) or `--reverse` (dependents)
- **HTML viewer**: Dark/light theme toggle, copy Mermaid code button, sql-splitter branding
- **Cycle detection**: Tarjan's SCC algorithm finds self-references and circular FK chains
- `--render` to auto-render DOT to PNG/SVG/PDF using Graphviz
- `--layout lr|tb` for horizontal or vertical layout
- Handles large schemas (tested with 281 tables, 3104 columns)
- Command alias: `gr`
- **Order command**: Output SQL dump with tables in topological FK order
- `sql-splitter order dump.sql -o ordered.sql` - Rewrite dump in safe import order
- `sql-splitter order dump.sql --check` - Verify ordering and detect cycles
- `sql-splitter order dump.sql --dry-run` - Show order without rewriting
- `sql-splitter order dump.sql --reverse` - Children before parents (for DROP operations)
- Detects and reports circular dependencies
- Command alias: `ord`
- **Graph module**: Reusable ERD and graph analysis infrastructure
- `GraphView` with full column info (name, type, PK, FK, nullable, references)
- `TableInfo`, `ColumnInfo`, `EdgeInfo` structs for ERD rendering
- Cycle detection using Tarjan's strongly connected components
- Output formatters: DOT (ERD tables), Mermaid (erDiagram), JSON, HTML
- Integration with existing `SchemaGraph` infrastructure
## [1.10.0] - 2025-12-21
### Added
- **Redact command**: Anonymize sensitive data (PII) in SQL dumps
- `sql-splitter redact dump.sql -o safe.sql --config redact.yaml` - Apply redaction rules from YAML config
- `sql-splitter redact dump.sql -o safe.sql --null "*.ssn" --hash "*.email" --fake "*.name"` - Inline patterns
- `sql-splitter redact dump.sql --generate-config -o redact.yaml` - Auto-generate config by analyzing schema
- **7 redaction strategies**: null, constant, hash, mask, shuffle, fake, skip
- **25+ fake data generators**: email, name, first_name, last_name, phone, address, city, zip, company, ip, uuid, date, credit_card, ssn, lorem, and more
- **Glob pattern matching**: Match columns with patterns like `*.email`, `users.ssn`, `*.password`
- **YAML config support**: Define rules in a config file with defaults, rules, and skip_tables
- **Multi-locale fake data**: Support for 8 locales (en, de_de, fr_fr, zh_cn, zh_tw, ja_jp, pt_br, ar_sa)
- **Reproducible redaction**: `--seed` flag for deterministic fake data
- **Streaming architecture**: Constant memory usage (~87MB for 10GB files)
- **High throughput**: ~230 MB/s on large files
- `--dry-run` flag to preview redaction without writing
- `--json` flag for machine-readable output
- `--strict` flag to fail on warnings
- `--tables` and `--exclude` flags for table filtering
- Supports all 3 dialects (MySQL, PostgreSQL, SQLite)
- Supports compressed input files (.gz, .bz2, .xz, .zst)
- Man page: `man sql-splitter-redact`
- Command alias: `rd`
### Changed
- Updated verify-realworld.sh to test redact command on all 25+ real-world dumps
- Updated profile-memory.sh to profile redact command
## [1.9.2] - 2025-12-21
### Added
- **CLI UX improvements**: Enhanced help output and discoverability
- **Help headings**: Options grouped by category (Input/Output, Filtering, Mode, Behavior, Limits, Output)
- **After-help examples**: Each command shows 2-5 practical usage examples
- **Command aliases**: Short aliases for common commands (`sp`, `an`, `mg`, `sa`, `sh`, `cv`, `val`, `df`)
- **Value hints**: Improved shell completion suggestions for file/directory paths
- **Improved completions command**: Detailed installation instructions for bash, zsh, fish, PowerShell
- **Man page generation**: `make man` generates man pages for all commands
- Uses `clap_mangen` for automatic generation from CLI definitions
- Generates `sql-splitter.1` and per-command pages like `sql-splitter-diff.1`
- `make install` now installs man pages automatically
- `make install-man` for standalone man page installation
### Changed
- Updated CLI about text: "High-performance CLI for splitting, merging, converting, and analyzing SQL dump files"
- All commands now show `--help` when run without required arguments
## [1.9.1] - 2025-12-21
### Added
- **Diff command enhanced features**: Extended schema and data comparison capabilities
- `--verbose` flag: Show sample PK values for added/removed/modified rows (up to 100 samples)
- `--primary-key` flag: Override PK for data comparison (format: `table:col1+col2,table2:col`)
- `--ignore-order` flag: Ignore column order differences in schema comparison
- `--ignore-columns` flag: Exclude columns from comparison using glob patterns (e.g., `*.updated_at`)
- `--allow-no-pk` flag: Compare tables without primary key using all columns as key
- **Index diff**: Detect added/removed indexes in schema comparison (inline and standalone CREATE INDEX)
- **Warnings system**: Emit warnings for tables without PK, invalid PK override columns, etc.
- Supports index types: BTREE, HASH, GIN (PostgreSQL), FULLTEXT (MySQL)
- **40 new integration tests** for enhanced diff features across all 3 dialects
- **Added `glob` dependency** (0.3) for column ignore pattern matching
### Changed
- `diff --format sql` now generates `CREATE INDEX` and `DROP INDEX` statements for index changes
## [1.9.0] - 2025-12-21
### Added
- **Diff command**: Compare two SQL dumps and report schema + data differences
- `sql-splitter diff old.sql new.sql` - Compare two SQL dump files
- Schema comparison: tables added/removed, columns added/removed/modified, PK changes, FK changes
- Data comparison: rows added/removed/modified per table with memory-bounded PK tracking
- Output formats: text (human-readable), json (structured), sql (migration script)
- `--schema-only` flag to compare schema only, skip data
- `--data-only` flag to compare data only, skip schema
- `--tables` flag to compare only specific tables
- `--exclude` flag to exclude specific tables
- `--max-pk-entries` flag to limit memory usage for large tables (default: 10M entries ~160MB)
- Supports all 3 dialects (MySQL, PostgreSQL, SQLite)
- Progress bar support with `--progress` flag
- Supports compressed input files (.gz, .bz2, .xz, .zst)
- **17 new integration tests** for diff command (schema diff, data diff, filters, output formats, PostgreSQL COPY)
### Fixed
- **PostgreSQL COPY data parsing**: Fixed a bug where data rows in PostgreSQL COPY ... FROM stdin format were not correctly parsed for PK/FK validation and data diff operations
- The parser returns COPY header and data as separate statements; consumers now correctly handle this
- Affects: `diff --data-only` and `validate` commands with PostgreSQL dumps using COPY format
- Now correctly detects duplicate PKs, FK violations, and row changes in COPY data blocks
- **`--json` flag for all commands**: Machine-readable JSON output for automation and CI/CD pipelines
- `split --json`: Output split statistics, table names, and throughput as JSON
- `analyze --json`: Output table statistics, INSERT counts, and byte sizes as JSON
- `merge --json`: Output merge statistics and table list as JSON
- `sample --json`: Output sample mode, statistics, and per-table breakdown as JSON
- `shard --json`: Output tenant info, statistics, and table breakdown as JSON
- `convert --json`: Output conversion statistics and warnings as JSON
- Multi-file operations return aggregated JSON with per-file results
- All JSON output uses `serde_json::to_string_pretty()` for readability
- Progress bars automatically suppressed in JSON mode
- **11 new integration tests** for JSON output validation across all commands
## [1.8.2] - 2025-12-21
### Changed
- **Sample command memory optimization**: Reduced memory usage by ~98.5% for large files
- 2.9 GB file: 8.2 GB → 114 MB peak RSS
- Uses streaming approach with temp files instead of in-memory accumulation
- Introduced `PkHashSet` (64-bit hashes) for compact primary key tracking
- Both `--percent` and `--rows` modes benefit from optimization
### Added
- **Memory profiling infrastructure**: New profiling script and Makefile targets
- `make profile` - Profile with medium dataset (~50MB)
- `make profile-large` - Profile with large dataset (~250MB)
- `make profile-mega` - Stress test with ~2GB dataset (100 tables × 100k rows)
- `scripts/profile-memory.sh` - Automated profiling for all commands
- Size presets: tiny (1MB), small (5MB), medium (50MB), large (250MB), xlarge (500MB), huge (1GB), mega (2GB)
## [1.8.1] - 2025-12-21
### Added
- **Glob pattern support**: All file-based commands now accept glob patterns for batch processing
- `sql-splitter validate "dumps/*.sql"` - validate multiple files
- `sql-splitter analyze "**/*.sql"` - analyze files recursively
- `sql-splitter split "*.sql" -o output/` - split multiple files to subdirectories
- `sql-splitter convert "*.sql" --to postgres -o converted/` - batch conversion
- Supports `*`, `**`, `?`, and `[abc]` patterns
- Aggregated JSON output for validate command with multi-file results
- **`--fail-fast` flag**: Stop processing on first error when using glob patterns
- Available on: validate, analyze, split, convert commands
- **`--no-limit` flag**: Disable row limits for memory-intensive operations
- `validate --no-limit` - no limit on rows for PK/FK checks
- `sample --no-limit` - disable explosion guard
- `shard --no-limit` - disable memory guard
- Also supports `--max-rows-per-table=0` as equivalent
- **Agent Skill**: Added SKILL.md following [agentskills.io](https://agentskills.io) specification
- Supports Amp, Claude Code, VS Code/GitHub Copilot, Cursor, Goose, Letta, OpenCode
- Install via `amp skill add helgesverre/sql-splitter` or manual copy
- Universal installer: `npx ai-agent-skills install sql-splitter --agent <agent>`
- **llms.txt**: Added LLM-friendly documentation with Agent Skill installation instructions
### Changed
- **Multi-dialect PK/FK validation**: Extended PK duplicate detection and FK referential integrity checking to all dialects (MySQL, PostgreSQL, SQLite). Previously MySQL-only.
- PostgreSQL: Supports both INSERT statements and COPY ... FROM stdin format
- SQLite: Reuses MySQL INSERT parser for validation
- 6 new dialect-specific tests for PostgreSQL and SQLite PK/FK validation
- **Improved progress tracking**: Consolidated byte-based progress across all commands
- **Simplified skill structure**: Single `skills/sql-splitter/` directory (removed duplicate)
## [1.8.0] - 2025-12-21
### Added
- **Validate command**: Check SQL dump integrity with comprehensive validation
- `sql-splitter validate dump.sql` - Validate a SQL dump file
- `--dialect` flag to specify dialect (mysql, postgres, sqlite - auto-detected if not specified)
- `--strict` flag to treat warnings as errors (non-zero exit on any warning)
- `--json` flag to output results as JSON for programmatic parsing
- `--max-rows-per-table` flag to limit memory usage for PK/FK checks (default: 1,000,000)
- `--no-fk-checks` flag to disable heavy data integrity checks
- `--progress` flag to show validation progress
- Supports compressed input files (.gz, .bz2, .xz, .zst)
- **Validation checks**:
- **SQL syntax validation**: Detects parser errors and malformed statements
- **DDL/DML consistency**: Finds INSERTs referencing tables with no CREATE TABLE
- **Encoding validation**: Warns on invalid UTF-8 byte sequences
- **Duplicate primary key detection** (MySQL only): Finds rows with duplicate PKs
- **FK referential integrity** (MySQL only): Detects FK violations where child references missing parent
- **Output formats**:
- Human-readable text output with colored severity levels
- JSON output for CI/automation integration
- Detailed summary with per-check status
- 32 integration tests for validate command including:
- Basic validation tests (valid dump, missing tables, duplicate PK, FK violations)
- test_data_gen fixture tests for realistic multi-table scenarios
- Split→Merge→Validate roundtrip tests for all 3 dialects
- Edge case tests (empty files, comments-only, composite PKs/FKs, self-referential FKs)
### Changed
- Added `serde_json` dependency for JSON output support
## [1.7.0] - 2025-12-21
### Added
- **Convert command**: Convert SQL dumps between MySQL, PostgreSQL, and SQLite dialects
- Supports all 6 conversion pairs:
- MySQL → PostgreSQL, MySQL → SQLite
- PostgreSQL → MySQL, PostgreSQL → SQLite
- SQLite → MySQL, SQLite → PostgreSQL
- `--from` flag to specify source dialect (auto-detected if not specified)
- `--to` flag to specify target dialect (required)
- `--strict` flag to fail on any unsupported feature
- `--dry-run` flag to preview conversion statistics
- `--progress` flag to show conversion progress
- Supports compressed input files (.gz, .bz2, .xz, .zst)
- **Identifier quoting conversion**: Backticks ↔ double quotes between dialects
- **String escape normalization**: MySQL `\'` → PostgreSQL/SQLite `''`
- **Data type mapping** (bidirectional):
- MySQL → PostgreSQL: `TINYINT(1)` → `BOOLEAN`, `AUTO_INCREMENT` → `SERIAL`, `LONGTEXT` → `TEXT`, `BLOB` → `BYTEA`, `DATETIME` → `TIMESTAMP`, `JSON` → `JSONB`
- MySQL → SQLite: All types → SQLite's affinity types (INTEGER, REAL, TEXT, BLOB)
- PostgreSQL → MySQL: `SERIAL` → `AUTO_INCREMENT`, `BYTEA` → `LONGBLOB`, `BOOLEAN` → `TINYINT(1)`, `JSONB` → `JSON`, `UUID` → `VARCHAR(36)`, `TIMESTAMPTZ` → `DATETIME`
- PostgreSQL → SQLite: `SERIAL` → `INTEGER`, `BYTEA` → `BLOB`, `DOUBLE PRECISION` → `REAL`
- SQLite → MySQL: `REAL` → `DOUBLE`
- SQLite → PostgreSQL: `REAL` → `DOUBLE PRECISION`, `BLOB` → `BYTEA`
- **Session command handling**: Strips dialect-specific session commands during conversion
- MySQL: `SET NAMES`, `SET FOREIGN_KEY_CHECKS`, `LOCK TABLES`, conditional comments
- PostgreSQL: `SET client_encoding`, `SET search_path`, etc.
- SQLite: `PRAGMA` statements
- **Warning system**: Reports unsupported features
- MySQL: ENUM, SET, UNSIGNED, FULLTEXT indexes
- PostgreSQL: Array types, INHERITS, PARTITION BY (to SQLite)
- 37 new unit tests for converter module
- 15 new integration tests for convert command
### Changed
- **Test organization overhaul**: Extracted all 235 inline tests from source files to dedicated test files in `tests/` directory
- Created 11 new unit test files: `parser_unit_test.rs`, `convert_unit_test.rs`, `sample_unit_test.rs`, `shard_unit_test.rs`, `schema_unit_test.rs`, `splitter_unit_test.rs`, `merger_unit_test.rs`, `analyzer_unit_test.rs`, `writer_unit_test.rs`, `cmd_unit_test.rs`
- Moved `parser_edge_cases_test.rs` (83 tests) from src/ to tests/
- Source files now contain only production code (zero inline tests)
- **Convert command enhancements**:
- PostgreSQL COPY → INSERT conversion with batched inserts (100 rows per INSERT)
- NULL marker handling (`\N` → NULL) and escape sequence conversion
- PostgreSQL type cast stripping (::regclass, ::text, ::character varying)
- Schema prefix removal (public.table → table)
- DEFAULT now() → DEFAULT CURRENT_TIMESTAMP conversion
- nextval() sequence removal
- PostgreSQL-only feature filtering with warnings (CREATE DOMAIN/TYPE/FUNCTION/SEQUENCE)
- TIMESTAMP WITH TIME ZONE → DATETIME conversion
- Block comment handling at statement start
- Test suite expanded from 359 to 429 tests
## [1.6.0] - 2025-12-20
### Added
- **Shard command**: Extract tenant-specific data from multi-tenant SQL dumps
- `sql-splitter shard dump.sql -o tenant_5.sql --tenant-value 5`
- `--tenant-column` flag to specify tenant column (auto-detected if not specified)
- Automatic tenant column detection (looks for company_id, tenant_id, organization_id, etc.)
- FK-ordered tenant selection that follows FK chains to include dependent data
- Table classification: tenant-root, tenant-dependent, junction, lookup, system
- `--root-tables` flag for explicit root table specification
- `--include-global` flag to control lookup table handling (none, lookups, all)
- `--config` flag for YAML-based table classification overrides
- `--max-selected-rows` memory guard (default 10M rows)
- `--strict-fk` flag to fail on FK integrity issues
- `--no-schema` flag to exclude CREATE TABLE statements
- `--dry-run` flag to preview sharding statistics
- Supports MySQL, PostgreSQL (COPY format), and SQLite dialects
- **Shard YAML configuration**: Per-table classification overrides
- Tenant column and root table configuration
- Table role overrides (tenant-root, junction, lookup, system)
- Self-FK column specification for hierarchical tables (future use)
- 9 new unit tests for shard command
### Changed
- Extended shared infrastructure from v1.5.0 for tenant extraction use case
- Schema graph reused for FK dependency analysis
## [1.5.0] - 2025-12-20
### Added
- **Sample command**: Create reduced datasets from large SQL dumps for development/testing
- `sql-splitter sample dump.sql -o dev.sql --percent 10` - Sample 10% of rows
- `sql-splitter sample dump.sql -o dev.sql --rows 1000` - Sample up to 1000 rows per table
- `--preserve-relations` flag for FK-aware sampling that maintains referential integrity
- `--root-tables` flag to start sampling from specific tables
- `--include-global` flag to control lookup/global table handling (none, lookups, all)
- `--tables` and `--exclude` flags for table filtering
- `--seed` flag for reproducible sampling
- `--config` flag for YAML-based per-table sampling strategies
- `--max-total-rows` explosion guard
- `--strict-fk` flag to fail on FK integrity issues
- `--no-schema` flag to exclude CREATE TABLE statements
- `--dry-run` flag to preview sampling statistics
- Supports MySQL, PostgreSQL (COPY format), and SQLite dialects
- **Schema graph module**: FK dependency analysis with topological sorting
- `src/schema/` module for DDL parsing and FK extraction
- Supports all three SQL dialects (backticks, double quotes, unquoted identifiers)
- Cycle detection for self-referential and circular FK relationships
- **Row parsing module**: Extract individual rows from INSERT and COPY statements
- `src/parser/mysql_insert.rs` for MySQL INSERT parsing with PK/FK extraction
- `src/parser/postgres_copy.rs` for PostgreSQL COPY format parsing
- **YAML configuration**: Per-table sampling strategies via config file
- Table classification (root, lookup, system, junction, normal)
- Per-table percent or row count overrides
- Skip specific tables
- 15 new integration tests for sample command across all dialects
- 40+ new unit tests for schema parsing, row parsing, and reservoir sampling
### Changed
- DDL parsing now supports PostgreSQL and SQLite identifier quoting
- Column type parsing extended with PostgreSQL types (SERIAL, TIMESTAMPTZ, BYTEA, etc.)
- Test suite expanded from 130 to 359 tests
### Fixed
- PostgreSQL COPY data now correctly converts to INSERT VALUES format in output
## [1.4.0] - 2025-12-20
### Added
- **Merge command**: Combine split SQL files back into a single dump file
- `sql-splitter merge tables/ -o restored.sql`
- `--tables` flag to merge specific tables only
- `--exclude` flag to skip certain tables
- `--transaction` flag to wrap output in BEGIN/COMMIT
- `--dialect` flag for MySQL, PostgreSQL, or SQLite output format
- Streaming concatenation with 256KB buffers
- Progress bar support with `--progress` flag
- **Test data generator**: New `test_data_gen` crate for generating synthetic SQL fixtures
- Deterministic, seed-based generation for reproducible tests
- Multi-tenant schema with 16 tables (FK chains, self-refs, junctions, global tables)
- Three scales: small (~500 rows), medium (~10K), large (~200K)
- MySQL, PostgreSQL (COPY + INSERT), and SQLite output
- CLI tool: `gen-fixtures --dialect mysql --scale small --seed 42`
- **New merger module**: Reusable `Merger` struct in `src/merger/` for programmatic use
- **Static test fixtures**: Edge case SQL files for MySQL, PostgreSQL, SQLite in `tests/fixtures/static/`
- 7 new tests for merge functionality
### Changed
- Project now uses Cargo workspace with `crates/test_data_gen/` subcrate
- Expanded test suite from 123 to 130 unit tests
## [1.3.1] - 2025-12-20
### Added
- **Animated progress bar**: New progress display using `indicatif` crate with:
- Animated spinner (⠋⠙⠹⠸⠼⠴⠦⠧⠇⠏)
- Visual progress bar with gradient characters (█▓▒░)
- Elapsed time, bytes processed, and percentage display
- Smooth 100ms tick rate animation
- Benchmark script now supports `--progress` flag to test progress bar performance
### Fixed
- Locale-related formatting issue in benchmark summary output
## [1.3.0] - 2025-12-20
### Added
- **Compressed file support**: Automatically decompress gzip (`.gz`), bzip2 (`.bz2`), xz (`.xz`), and zstd (`.zst`) files during processing — no manual decompression needed
- **Schema/Data filtering**: New `--schema-only` and `--data-only` flags to extract only DDL or DML statements
- `--schema-only`: Only CREATE TABLE, CREATE INDEX, ALTER TABLE, DROP TABLE
- `--data-only`: Only INSERT and COPY statements
- **Shell completions**: New `completions` subcommand generates completions for bash, zsh, fish, elvish, and PowerShell
- `sql-splitter completions bash >> ~/.bashrc`
- **Automatic shell completion installation**: `make install` now automatically installs shell completions for the detected shell
- New Makefile targets:
- `make install-completions`: Install completions for current shell only
- `make install-completions-all`: Install completions for bash, zsh, and fish
- New `scripts/install-completions.sh` script for flexible completion installation
- 4 new tests for compression and content filtering
### Changed
- Updated README with improved installation instructions and shell completions section
- Expanded test suite from 119 to 123 unit tests
## [1.2.1] - 2025-12-20
### Fixed
- **PostgreSQL dollar-quoting bug**: Fixed parser getting stuck when SQL files contained mixed dollar-quote tags (e.g., `$_$` followed by `$$`). The parser now validates that dollar-quote tags are syntactically valid (empty or identifier-like `[A-Za-z_][A-Za-z0-9_]*`)
- **DROP TABLE IF EXISTS**: Fixed table name extraction for `DROP TABLE IF EXISTS` statements
### Added
- Real-world SQL verification script (`scripts/verify-realworld.sh`) that tests against 25 public SQL dumps
- `make verify-realworld` target for running real-world verification tests
- 32 new edge case tests for PostgreSQL, MySQL, SQLite, and cross-dialect parsing
- Tests for dollar-quoting, schema-qualified tables, IF EXISTS/IF NOT EXISTS clauses
### Changed
- Expanded test suite from 87 to 119 unit tests
## [1.0.0] - 2025-12-20
### Added
- Initial release of sql-splitter CLI tool (Rust rewrite of the Go version)
- **`split` command**: Split large SQL dump files into individual table files
- `--output, -o`: Specify output directory (default: `output`)
- `--verbose, -v`: Enable verbose output
- `--progress, -p`: Show progress during processing
- `--tables, -t`: Filter to split only specific tables (comma-separated)
- `--dry-run`: Preview what would be split without writing files
- **`analyze` command**: Analyze SQL files and display statistics
- `--progress, -p`: Show progress during analysis
- **High-performance streaming parser**
- 400-500 MB/s typical throughput (1.25x faster than Go version on large files)
- Memory-efficient: ~80 MB constant usage regardless of file size
- Handles strings with escaped characters and multi-line statements
- Adaptive buffer sizing based on file size
- **Concurrent writing**: Efficient multi-table writing with writer pools
- **Statement type support**: CREATE TABLE, INSERT INTO, CREATE INDEX, ALTER TABLE, DROP TABLE
- **Version flag**: `--version` to display version information
### Performance
- Streaming architecture handles files larger than available RAM
- Zero-copy parsing using byte slices where possible
- Pre-compiled regexes for fast pattern matching
- Optimized buffer sizes for CPU cache efficiency
- 1.25x faster than Go version on files >1GB
### Documentation
- Comprehensive README with usage examples
- AGENTS.md for AI assistant guidance
- Performance benchmarks and comparison with Go/PHP versions
## [1.2.0] - 2025-12-20
### Added
- **Automatic dialect detection**: The `--dialect` flag is now optional. When omitted, sql-splitter automatically detects the SQL dialect by analyzing the first 8KB of the file
- Uses weighted scoring to identify PostgreSQL, MySQL/MariaDB, or SQLite formats
- Reports detection confidence level (high, medium, low)
- Detects MySQL/MariaDB by: header comments, conditional comments (`/*!40...`), `LOCK TABLES`, backticks
- Detects PostgreSQL by: pg_dump header, `COPY ... FROM stdin`, `search_path`, dollar-quoting, `CREATE EXTENSION`
- Detects SQLite by: header comments, `PRAGMA`, `BEGIN TRANSACTION`
- Defaults to MySQL when no markers are found
### Changed
- `--dialect` flag is now optional for both `split` and `analyze` commands
- Added 16 new tests for dialect detection
### Documentation
- Updated docs/DIALECT_AUTODETECTION.md with implementation details
## [1.1.0] - 2025-12-20
### Added
- **Multi-dialect support**: Now supports MySQL, PostgreSQL, and SQLite dump formats
- `--dialect=mysql` (default): MySQL/MariaDB mysqldump format
- `--dialect=postgres`: PostgreSQL pg_dump format with COPY FROM stdin support
- `--dialect=sqlite`: SQLite .dump format
- **PostgreSQL COPY statement support**: Properly handles `COPY table FROM stdin` blocks
- **Dollar-quoting support**: PostgreSQL dollar-quoted strings (`$$`, `$tag$`) are correctly parsed
- **Comprehensive benchmarks**: Added benchmark suite comparing against 6 competitor tools
- **Test data generator**: Python script for generating synthetic mysqldump files
### Changed
- Improved table name extraction for different identifier quoting styles
- Updated documentation with multi-dialect examples and benchmark results
### Fixed
- Correct handling of double-quote identifiers in PostgreSQL/SQLite
- Edge cases with escaped characters in different SQL dialects