sql-splitter 1.8.0

High-performance CLI tool for splitting large SQL dump files into individual table files
Documentation
# Changelog

All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

## [1.8.0] - 2025-12-21

### Added

- **Validate command**: Check SQL dump integrity with comprehensive validation
  - `sql-splitter validate dump.sql` - Validate a SQL dump file
  - `--dialect` flag to specify dialect (mysql, postgres, sqlite - auto-detected if not specified)
  - `--strict` flag to treat warnings as errors (non-zero exit on any warning)
  - `--json` flag to output results as JSON for programmatic parsing
  - `--max-rows-per-table` flag to limit memory usage for PK/FK checks (default: 1,000,000)
  - `--no-fk-checks` flag to disable heavy data integrity checks
  - `--progress` flag to show validation progress
  - Supports compressed input files (.gz, .bz2, .xz, .zst)
- **Validation checks**:
  - **SQL syntax validation**: Detects parser errors and malformed statements
  - **DDL/DML consistency**: Finds INSERTs referencing tables with no CREATE TABLE
  - **Encoding validation**: Warns on invalid UTF-8 byte sequences
  - **Duplicate primary key detection** (MySQL only): Finds rows with duplicate PKs
  - **FK referential integrity** (MySQL only): Detects FK violations where child references missing parent
- **Output formats**:
  - Human-readable text output with colored severity levels
  - JSON output for CI/automation integration
  - Detailed summary with per-check status
- 32 integration tests for validate command including:
  - Basic validation tests (valid dump, missing tables, duplicate PK, FK violations)
  - test_data_gen fixture tests for realistic multi-table scenarios
  - Split→Merge→Validate roundtrip tests for all 3 dialects
  - Edge case tests (empty files, comments-only, composite PKs/FKs, self-referential FKs)

### Changed

- Added `serde_json` dependency for JSON output support

## [1.7.0] - 2025-12-21

### Added

- **Convert command**: Convert SQL dumps between MySQL, PostgreSQL, and SQLite dialects
  - Supports all 6 conversion pairs:
    - MySQL → PostgreSQL, MySQL → SQLite
    - PostgreSQL → MySQL, PostgreSQL → SQLite
    - SQLite → MySQL, SQLite → PostgreSQL
  - `--from` flag to specify source dialect (auto-detected if not specified)
  - `--to` flag to specify target dialect (required)
  - `--strict` flag to fail on any unsupported feature
  - `--dry-run` flag to preview conversion statistics
  - `--progress` flag to show conversion progress
  - Supports compressed input files (.gz, .bz2, .xz, .zst)
- **Identifier quoting conversion**: Backticks ↔ double quotes between dialects
- **String escape normalization**: MySQL `\'` → PostgreSQL/SQLite `''`
- **Data type mapping** (bidirectional):
  - MySQL → PostgreSQL: `TINYINT(1)``BOOLEAN`, `AUTO_INCREMENT``SERIAL`, `LONGTEXT``TEXT`, `BLOB``BYTEA`, `DATETIME``TIMESTAMP`, `JSON``JSONB`
  - MySQL → SQLite: All types → SQLite's affinity types (INTEGER, REAL, TEXT, BLOB)
  - PostgreSQL → MySQL: `SERIAL``AUTO_INCREMENT`, `BYTEA``LONGBLOB`, `BOOLEAN``TINYINT(1)`, `JSONB``JSON`, `UUID``VARCHAR(36)`, `TIMESTAMPTZ``DATETIME`
  - PostgreSQL → SQLite: `SERIAL``INTEGER`, `BYTEA``BLOB`, `DOUBLE PRECISION``REAL`
  - SQLite → MySQL: `REAL``DOUBLE`
  - SQLite → PostgreSQL: `REAL``DOUBLE PRECISION`, `BLOB``BYTEA`
- **Session command handling**: Strips dialect-specific session commands during conversion
  - MySQL: `SET NAMES`, `SET FOREIGN_KEY_CHECKS`, `LOCK TABLES`, conditional comments
  - PostgreSQL: `SET client_encoding`, `SET search_path`, etc.
  - SQLite: `PRAGMA` statements
- **Warning system**: Reports unsupported features
  - MySQL: ENUM, SET, UNSIGNED, FULLTEXT indexes
  - PostgreSQL: Array types, INHERITS, PARTITION BY (to SQLite)
- 37 new unit tests for converter module
- 15 new integration tests for convert command

### Changed

- **Test organization overhaul**: Extracted all 235 inline tests from source files to dedicated test files in `tests/` directory
  - Created 11 new unit test files: `parser_unit_test.rs`, `convert_unit_test.rs`, `sample_unit_test.rs`, `shard_unit_test.rs`, `schema_unit_test.rs`, `splitter_unit_test.rs`, `merger_unit_test.rs`, `analyzer_unit_test.rs`, `writer_unit_test.rs`, `cmd_unit_test.rs`
  - Moved `parser_edge_cases_test.rs` (83 tests) from src/ to tests/
  - Source files now contain only production code (zero inline tests)
- **Convert command enhancements**:
  - PostgreSQL COPY → INSERT conversion with batched inserts (100 rows per INSERT)
  - NULL marker handling (`\N` → NULL) and escape sequence conversion
  - PostgreSQL type cast stripping (::regclass, ::text, ::character varying)
  - Schema prefix removal (public.table → table)
  - DEFAULT now() → DEFAULT CURRENT_TIMESTAMP conversion
  - nextval() sequence removal
  - PostgreSQL-only feature filtering with warnings (CREATE DOMAIN/TYPE/FUNCTION/SEQUENCE)
  - TIMESTAMP WITH TIME ZONE → DATETIME conversion
  - Block comment handling at statement start
- Test suite expanded from 359 to 429 tests

## [1.6.0] - 2025-12-20

### Added

- **Shard command**: Extract tenant-specific data from multi-tenant SQL dumps
  - `sql-splitter shard dump.sql -o tenant_5.sql --tenant-value 5`
  - `--tenant-column` flag to specify tenant column (auto-detected if not specified)
  - Automatic tenant column detection (looks for company_id, tenant_id, organization_id, etc.)
  - FK-ordered tenant selection that follows FK chains to include dependent data
  - Table classification: tenant-root, tenant-dependent, junction, lookup, system
  - `--root-tables` flag for explicit root table specification
  - `--include-global` flag to control lookup table handling (none, lookups, all)
  - `--config` flag for YAML-based table classification overrides
  - `--max-selected-rows` memory guard (default 10M rows)
  - `--strict-fk` flag to fail on FK integrity issues
  - `--no-schema` flag to exclude CREATE TABLE statements
  - `--dry-run` flag to preview sharding statistics
  - Supports MySQL, PostgreSQL (COPY format), and SQLite dialects
- **Shard YAML configuration**: Per-table classification overrides
  - Tenant column and root table configuration
  - Table role overrides (tenant-root, junction, lookup, system)
  - Self-FK column specification for hierarchical tables (future use)
- 9 new unit tests for shard command

### Changed

- Extended shared infrastructure from v1.5.0 for tenant extraction use case
- Schema graph reused for FK dependency analysis

## [1.5.0] - 2025-12-20

### Added

- **Sample command**: Create reduced datasets from large SQL dumps for development/testing
  - `sql-splitter sample dump.sql -o dev.sql --percent 10` - Sample 10% of rows
  - `sql-splitter sample dump.sql -o dev.sql --rows 1000` - Sample up to 1000 rows per table
  - `--preserve-relations` flag for FK-aware sampling that maintains referential integrity
  - `--root-tables` flag to start sampling from specific tables
  - `--include-global` flag to control lookup/global table handling (none, lookups, all)
  - `--tables` and `--exclude` flags for table filtering
  - `--seed` flag for reproducible sampling
  - `--config` flag for YAML-based per-table sampling strategies
  - `--max-total-rows` explosion guard
  - `--strict-fk` flag to fail on FK integrity issues
  - `--no-schema` flag to exclude CREATE TABLE statements
  - `--dry-run` flag to preview sampling statistics
  - Supports MySQL, PostgreSQL (COPY format), and SQLite dialects
- **Schema graph module**: FK dependency analysis with topological sorting
  - `src/schema/` module for DDL parsing and FK extraction
  - Supports all three SQL dialects (backticks, double quotes, unquoted identifiers)
  - Cycle detection for self-referential and circular FK relationships
- **Row parsing module**: Extract individual rows from INSERT and COPY statements
  - `src/parser/mysql_insert.rs` for MySQL INSERT parsing with PK/FK extraction
  - `src/parser/postgres_copy.rs` for PostgreSQL COPY format parsing
- **YAML configuration**: Per-table sampling strategies via config file
  - Table classification (root, lookup, system, junction, normal)
  - Per-table percent or row count overrides
  - Skip specific tables
- 15 new integration tests for sample command across all dialects
- 40+ new unit tests for schema parsing, row parsing, and reservoir sampling

### Changed

- DDL parsing now supports PostgreSQL and SQLite identifier quoting
- Column type parsing extended with PostgreSQL types (SERIAL, TIMESTAMPTZ, BYTEA, etc.)
- Test suite expanded from 130 to 359 tests

### Fixed

- PostgreSQL COPY data now correctly converts to INSERT VALUES format in output

## [1.4.0] - 2025-12-20

### Added

- **Merge command**: Combine split SQL files back into a single dump file
  - `sql-splitter merge tables/ -o restored.sql`
  - `--tables` flag to merge specific tables only
  - `--exclude` flag to skip certain tables
  - `--transaction` flag to wrap output in BEGIN/COMMIT
  - `--dialect` flag for MySQL, PostgreSQL, or SQLite output format
  - Streaming concatenation with 256KB buffers
  - Progress bar support with `--progress` flag
- **Test data generator**: New `test_data_gen` crate for generating synthetic SQL fixtures
  - Deterministic, seed-based generation for reproducible tests
  - Multi-tenant schema with 16 tables (FK chains, self-refs, junctions, global tables)
  - Three scales: small (~500 rows), medium (~10K), large (~200K)
  - MySQL, PostgreSQL (COPY + INSERT), and SQLite output
  - CLI tool: `gen-fixtures --dialect mysql --scale small --seed 42`
- **New merger module**: Reusable `Merger` struct in `src/merger/` for programmatic use
- **Static test fixtures**: Edge case SQL files for MySQL, PostgreSQL, SQLite in `tests/fixtures/static/`
- 7 new tests for merge functionality

### Changed

- Project now uses Cargo workspace with `crates/test_data_gen/` subcrate
- Expanded test suite from 123 to 130 unit tests

## [1.3.1] - 2025-12-20

### Added

- **Animated progress bar**: New progress display using `indicatif` crate with:
  - Animated spinner (⠋⠙⠹⠸⠼⠴⠦⠧⠇⠏)
  - Visual progress bar with gradient characters (█▓▒░)
  - Elapsed time, bytes processed, and percentage display
  - Smooth 100ms tick rate animation
- Benchmark script now supports `--progress` flag to test progress bar performance

### Fixed

- Locale-related formatting issue in benchmark summary output

## [1.3.0] - 2025-12-20

### Added

- **Compressed file support**: Automatically decompress gzip (`.gz`), bzip2 (`.bz2`), xz (`.xz`), and zstd (`.zst`) files during processing — no manual decompression needed
- **Schema/Data filtering**: New `--schema-only` and `--data-only` flags to extract only DDL or DML statements
  - `--schema-only`: Only CREATE TABLE, CREATE INDEX, ALTER TABLE, DROP TABLE
  - `--data-only`: Only INSERT and COPY statements
- **Shell completions**: New `completions` subcommand generates completions for bash, zsh, fish, elvish, and PowerShell
  - `sql-splitter completions bash >> ~/.bashrc`
- **Automatic shell completion installation**: `make install` now automatically installs shell completions for the detected shell
- New Makefile targets:
  - `make install-completions`: Install completions for current shell only
  - `make install-completions-all`: Install completions for bash, zsh, and fish
- New `scripts/install-completions.sh` script for flexible completion installation
- 4 new tests for compression and content filtering

### Changed

- Updated README with improved installation instructions and shell completions section
- Expanded test suite from 119 to 123 unit tests

## [1.2.1] - 2025-12-20

### Fixed

- **PostgreSQL dollar-quoting bug**: Fixed parser getting stuck when SQL files contained mixed dollar-quote tags (e.g., `$_$` followed by `$$`). The parser now validates that dollar-quote tags are syntactically valid (empty or identifier-like `[A-Za-z_][A-Za-z0-9_]*`)
- **DROP TABLE IF EXISTS**: Fixed table name extraction for `DROP TABLE IF EXISTS` statements

### Added

- Real-world SQL verification script (`scripts/verify-realworld.sh`) that tests against 25 public SQL dumps
- `make verify-realworld` target for running real-world verification tests
- 32 new edge case tests for PostgreSQL, MySQL, SQLite, and cross-dialect parsing
- Tests for dollar-quoting, schema-qualified tables, IF EXISTS/IF NOT EXISTS clauses

### Changed

- Expanded test suite from 87 to 119 unit tests

## [1.0.0] - 2025-12-20

### Added

- Initial release of sql-splitter CLI tool (Rust rewrite of the Go version)
- **`split` command**: Split large SQL dump files into individual table files
  - `--output, -o`: Specify output directory (default: `output`)
  - `--verbose, -v`: Enable verbose output
  - `--progress, -p`: Show progress during processing
  - `--tables, -t`: Filter to split only specific tables (comma-separated)
  - `--dry-run`: Preview what would be split without writing files
- **`analyze` command**: Analyze SQL files and display statistics
  - `--progress, -p`: Show progress during analysis
- **High-performance streaming parser**
  - 400-500 MB/s typical throughput (1.25x faster than Go version on large files)
  - Memory-efficient: ~80 MB constant usage regardless of file size
  - Handles strings with escaped characters and multi-line statements
  - Adaptive buffer sizing based on file size
- **Concurrent writing**: Efficient multi-table writing with writer pools
- **Statement type support**: CREATE TABLE, INSERT INTO, CREATE INDEX, ALTER TABLE, DROP TABLE
- **Version flag**: `--version` to display version information

### Performance

- Streaming architecture handles files larger than available RAM
- Zero-copy parsing using byte slices where possible
- Pre-compiled regexes for fast pattern matching
- Optimized buffer sizes for CPU cache efficiency
- 1.25x faster than Go version on files >1GB

### Documentation

- Comprehensive README with usage examples
- AGENTS.md for AI assistant guidance
- Performance benchmarks and comparison with Go/PHP versions

## [1.2.0] - 2025-12-20

### Added

- **Automatic dialect detection**: The `--dialect` flag is now optional. When omitted, sql-splitter automatically detects the SQL dialect by analyzing the first 8KB of the file
  - Uses weighted scoring to identify PostgreSQL, MySQL/MariaDB, or SQLite formats
  - Reports detection confidence level (high, medium, low)
  - Detects MySQL/MariaDB by: header comments, conditional comments (`/*!40...`), `LOCK TABLES`, backticks
  - Detects PostgreSQL by: pg_dump header, `COPY ... FROM stdin`, `search_path`, dollar-quoting, `CREATE EXTENSION`
  - Detects SQLite by: header comments, `PRAGMA`, `BEGIN TRANSACTION`
  - Defaults to MySQL when no markers are found

### Changed

- `--dialect` flag is now optional for both `split` and `analyze` commands
- Added 16 new tests for dialect detection

### Documentation

- Updated docs/DIALECT_AUTODETECTION.md with implementation details

## [1.1.0] - 2025-12-20

### Added

- **Multi-dialect support**: Now supports MySQL, PostgreSQL, and SQLite dump formats
  - `--dialect=mysql` (default): MySQL/MariaDB mysqldump format
  - `--dialect=postgres`: PostgreSQL pg_dump format with COPY FROM stdin support
  - `--dialect=sqlite`: SQLite .dump format
- **PostgreSQL COPY statement support**: Properly handles `COPY table FROM stdin` blocks
- **Dollar-quoting support**: PostgreSQL dollar-quoted strings (`$$`, `$tag$`) are correctly parsed
- **Comprehensive benchmarks**: Added benchmark suite comparing against 6 competitor tools
- **Test data generator**: Python script for generating synthetic mysqldump files

### Changed

- Improved table name extraction for different identifier quoting styles
- Updated documentation with multi-dialect examples and benchmark results

### Fixed

- Correct handling of double-quote identifiers in PostgreSQL/SQLite
- Edge cases with escaped characters in different SQL dialects