sql-splitter 1.12.4

High-performance CLI tool for splitting large SQL dump files into individual table files
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
# AGENTS.md

This file provides guidance when working with code in this repository.

## Project Overview

High-performance CLI tool written in Rust for splitting large SQL dump files into individual table files.

## Available Commands

### Makefile Commands

Run `make help` to see all available commands. Key commands:

| Command | Description |
|---------|-------------|
| `make build` | Debug build |
| `make release` | Release build |
| `make native` | Optimized build for current CPU (best performance) |
| `make test` | Run all tests |
| `make bench` | Run criterion benchmarks |
| `make profile` | Memory profile all commands (medium dataset) |
| `make profile-large` | Memory profile with large dataset (~250MB) |
| `make profile-mega` | Stress test profile (~2GB: 100 tables × 100k rows) |
| `make fmt` | Format code |
| `make check` | Check code without building |
| `make clippy` | Run clippy lints |
| `make clean` | Clean build artifacts |
| `make install` | Install locally (binary + shell completions) |
| `make verify-realworld` | Verify against real-world SQL dumps |
| `make website-deploy` | Deploy website to Vercel |
| `make docker-bench` | Run benchmarks in Docker (generates 100MB test data) |
| `make man` | Generate man pages in man/ directory |

### Building and Running

```bash
# Build (debug)
cargo build
# or: make build

# Build (release)
cargo build --release
# or: make release

# Build optimized for current CPU (best for benchmarking)
RUSTFLAGS="-C target-cpu=native" cargo build --release
# or: make native

# Run tests
cargo test
# or: make test

# Run benchmarks
cargo bench
# or: make bench

# Run clippy lints
cargo clippy -- -D warnings
# or: make clippy

# Format code
cargo fmt
# or: make fmt
```

### Main Commands

```bash
# Split a MySQL/MariaDB dump file (default)
./target/release/sql-splitter split large-dump.sql --output=tables

# Split a PostgreSQL pg_dump file
./target/release/sql-splitter split pg_dump.sql --output=tables --dialect=postgres

# Split a SQLite .dump file
./target/release/sql-splitter split sqlite.sql --output=tables --dialect=sqlite

# Analyze a SQL file to gather statistics
./target/release/sql-splitter analyze database.sql --progress

# Get help
./target/release/sql-splitter --help
./target/release/sql-splitter split --help
```

### Supported Dialects

| Dialect | Flag | Dump Tool | Key Features |
|---------|------|-----------|--------------|
| MySQL/MariaDB | `--dialect=mysql` (default) | mysqldump | Backtick quoting, backslash escapes |
| PostgreSQL | `--dialect=postgres` | pg_dump | Double-quote identifiers, COPY FROM stdin, dollar-quoting |
| SQLite | `--dialect=sqlite` | sqlite3 .dump | Double-quote identifiers |

## Architecture

### High-Level Design

```
BufReader (fill_buf) → Parser (Streaming) → WriterPool (BufWriter) → Table Files
    64KB Buffer          Statement Buffer       256KB Buffers per table
```

### Key Components

#### `src/parser/mod.rs` - Streaming SQL Parser

- Uses `BufReader::fill_buf()` + `consume()` pattern for streaming reads
- String-aware parsing: tracks escaped quotes and multi-line strings
- Manual table name extraction with regex fallback
- Pre-compiled static regexes via `once_cell::Lazy`

Key functions:
- `read_statement()`: Reads complete SQL statement (handles strings, escaping)
- `parse_statement()`: Identifies statement type and extracts table name
- `determine_buffer_size()`: Selects optimal buffer size based on file size

#### `src/writer/mod.rs` - Buffered File Writers

- `TableWriter`: Manages buffered writes to single table file
- `WriterPool`: HashMap of table writers using `ahash` for fast hashing
- 256KB `BufWriter` per table
- Auto-flush every 100 statements

#### `src/splitter/mod.rs` - Orchestration

- Coordinates parsing and writing
- Maintains processing statistics
- Routes statements to appropriate table writers
- Supports dry-run and table filtering

#### `src/analyzer/mod.rs` - Statistical Analysis

- Counts INSERTs, CREATE TABLEs per table
- Calculates total bytes per table
- Optional progress tracking
- Sorts results by INSERT count

## Performance Characteristics

### Key Optimizations

1. **Streaming I/O**: `fill_buf` + `consume` pattern for zero-copy reading
2. **Manual parsing**: Byte-level table name extraction before regex fallback
3. **Fast hashing**: `ahash::AHashMap` instead of default SipHash
4. **Pre-compiled regexes**: Static initialization via `once_cell::Lazy`
5. **Minimal allocations**: Work with `&[u8]` slices in hot path
6. **Buffered writes**: 256KB buffers with periodic flush

### Buffer Sizes

- File < 1GB: 64KB read buffer (optimal for CPU cache)
- File > 1GB: 256KB read buffer
- All tables: 256KB write buffers with 100-statement buffering

## Testing

```bash
# All tests
cargo test

# Specific module
cargo test parser::tests

# With output
cargo test -- --nocapture
```

## Benchmarking

```bash
# All benchmarks
cargo bench

# Specific benchmark
cargo bench -- read_statement
```

## Memory Profiling

Use GNU time to measure peak memory usage (Maximum Resident Set Size) for commands.

### Prerequisites

```bash
# macOS: Install GNU time
brew install gnu-time

# Linux: GNU time is typically at /usr/bin/time
```

### Quick Profiling

```bash
# Profile a single command with gtime (macOS) or /usr/bin/time (Linux)
gtime -v ./target/release/sql-splitter sample input.sql --percent 10 --output out.sql 2>&1 | grep "Maximum resident set size"

# Full metrics
gtime -v ./target/release/sql-splitter validate large-dump.sql --check-fk 2>&1 | tail -25
```

### Automated Profiling Script

Use the profiling script to consistently benchmark all commands:

```bash
# Profile all commands with medium-sized generated test data
./scripts/profile-memory.sh

# Profile with larger test data
./scripts/profile-memory.sh --size large

# Profile with a specific file
./scripts/profile-memory.sh --file /path/to/dump.sql

# Only generate test fixtures (don't run profiling)
./scripts/profile-memory.sh --generate-only --size xlarge
```

Size configurations:
| Size | Rows/Table | Tables | Approx File Size |
|------|------------|--------|------------------|
| tiny | 500 | 10 | ~0.5MB |
| small | 2,500 | 10 | ~2.5MB |
| medium | 25,000 | 10 | ~25MB |
| large | 125,000 | 10 | ~125MB |
| xlarge | 250,000 | 10 | ~250MB |
| huge | 500,000 | 10 | ~500MB |
| mega | 100,000 | 100 | ~1GB |
| giga | 1,000,000 | 100 | ~10GB (MySQL only) |

### Key Metrics

From GNU time output:
- **Maximum resident set size (kbytes)**: Peak memory usage
- **Elapsed (wall clock) time**: Total execution time
- **User time (seconds)**: CPU time in user mode

### Example Output (large size: ~125MB)

```
Command      Dialect      File Size     Peak RSS   Wall Time  Extra Args
------------------------------------------------------------
analyze      mysql        123.72 MB      6.85 MB     0:00.28
split        mysql        123.72 MB      9.93 MB     0:00.31
validate     mysql        123.72 MB     81.65 MB     0:01.50
validate     mysql        123.72 MB      9.60 MB     0:00.28  --no-fk-checks
sample       mysql        123.72 MB     15.90 MB     0:02.03
diff         mysql        123.72 MB    131.48 MB     0:02.98
redact       mysql        123.72 MB      9.09 MB     0:00.66
graph        mysql        123.72 MB      8.59 MB     0:00.31
order        mysql        123.72 MB    140.85 MB     0:00.41
query        mysql        123.72 MB     99.89 MB     0:36.51
shard        mysql        123.72 MB      5.65 MB     0:00.01
convert      mysql        123.72 MB     12.26 MB     0:00.84
```

### Query Command Performance by Dialect

The `query` command shows significant performance differences based on dialect:

| Dialect | File Size | Query Time | Peak RSS | Notes |
|---------|-----------|------------|----------|-------|
| PostgreSQL | 111 MB | 6.4s | 101 MB | Uses COPY (fast batch import) |
| MySQL | 124 MB | 36.5s | 100 MB | Uses INSERT (row-by-row parsing) |
| SQLite | 124 MB | 37.2s | 210 MB | Uses INSERT (row-by-row parsing) |

PostgreSQL dumps are ~6x faster to query because COPY blocks are efficiently batched.

### Memory Optimization Guidelines

When optimizing for memory:
1. Use streaming/chunked processing instead of loading all data
2. Use hash-based sets (`PkHashSet` with 64-bit hashes) instead of storing full values
3. Write intermediate results to temp files instead of accumulating in memory
4. Process tables sequentially in dependency order

## Key Implementation Details

- **Language**: Rust 2021 edition
- **CLI Framework**: clap v4 with derive macros
- **Regex**: `regex` crate with bytes API
- **HashMap**: `ahash::AHashMap` for performance
- **Buffer management**: `std::io::{BufReader, BufWriter}`
- **Statement types**: CREATE TABLE, INSERT INTO, CREATE INDEX, ALTER TABLE, DROP TABLE, COPY (PostgreSQL)

## Release Process

Follow these steps to create a new release. **Both a git tag AND a GitHub release are required.**

### 1. Pre-release Checks

```bash
# Ensure all tests pass
cargo test

# Ensure it builds in release mode
cargo build --release

# Run lints
cargo clippy

# Verify formatting
cargo fmt --check

# Optional: Run Docker benchmarks to verify performance
./docker/run-benchmark.sh --generate 50
```

### 2. Update Version

1. Update version in `Cargo.toml`:
   ```toml
   version = "X.Y.Z"
   ```

2. Update `CHANGELOG.md`:
   - Add new version section with today's date: `## [X.Y.Z] - YYYY-MM-DD`
   - Move items from `[Unreleased]` to the new version section
   - Document all notable changes under Added/Changed/Fixed/Removed

### 3. Commit, Tag, and Push

```bash
# Stage all release files
git add Cargo.toml Cargo.lock CHANGELOG.md src/ docs/

# Commit with descriptive message
git commit -m "feat: <brief description> (vX.Y.Z)"

# Create annotated tag
git tag -a vX.Y.Z -m "Release vX.Y.Z

<Brief summary of changes>"

# Push commit and tag together
git push origin main --tags
```

### 4. Create GitHub Release (REQUIRED)

**Always create a GitHub release** - this makes the release visible on the releases page and generates release notes.

```bash
# Extract release notes for this version from CHANGELOG.md and create release
gh release create vX.Y.Z \
  --title "vX.Y.Z" \
  --notes "$(sed -n '/## \[X.Y.Z\]/,/## \[/p' CHANGELOG.md | head -n -1)" \
  --latest
```

Or manually at: https://github.com/HelgeSverre/sql-splitter/releases/new
- Select the tag `vX.Y.Z`
- Title: `vX.Y.Z`
- Description: Copy relevant section from CHANGELOG.md

### 5. Automatic crates.io Publish

**crates.io publishing is automatic** when you push a new tag. The GitHub Action workflow handles this.

**Setup requirement**: Add the `CARGO_REGISTRY_TOKEN` secret to your GitHub repository:
1. Go to https://crates.io/settings/tokens and create a new token
2. Go to GitHub repo → Settings → Secrets and variables → Actions
3. Add new secret: `CARGO_REGISTRY_TOKEN` with your crates.io token

Manual publish (if needed):
```bash
cargo publish --dry-run  # Test first
cargo publish            # Publish
```

### 6. Post-release Verification

- [ ] GitHub release visible: https://github.com/HelgeSverre/sql-splitter/releases
- [ ] Tag visible: `git tag -l | grep vX.Y.Z`
- [ ] crates.io updated (if published): https://crates.io/crates/sql-splitter
- [ ] Website auto-deployed via Vercel (if applicable)

### Versioning Guidelines

Follow [Semantic Versioning](https://semver.org/):
- **MAJOR** (X.0.0): Breaking changes to CLI interface or output format
- **MINOR** (0.X.0): New features, new dialects, new commands
- **PATCH** (0.0.X): Bug fixes, performance improvements, documentation

### Quick Release Checklist

```
[ ] cargo test passes
[ ] cargo clippy clean
[ ] Version bumped in Cargo.toml
[ ] CHANGELOG.md updated
[ ] git commit + tag created
[ ] git push origin main --tags
[ ] gh release create vX.Y.Z (REQUIRED!)
[ ] Verify release at github.com/HelgeSverre/sql-splitter/releases
```

## Website llms.txt Maintenance

The file `website/llms.txt` provides LLM-friendly documentation following the [llmstxt.org](https://llmstxt.org) specification. This file helps AI tools understand how to use and install sql-splitter.

### When to Update llms.txt

Update `website/llms.txt` when:
- Adding new CLI commands or subcommands
- Adding/changing command-line flags or options
- Adding support for new SQL dialects
- Adding support for new compression formats
- Changing installation methods
- Updating performance benchmarks significantly
- Adding new major features

### llms.txt Format Requirements

The file must follow this structure (in order):
1. **H1 header**: Project name (`# sql-splitter`)
2. **Blockquote**: Brief summary with key capabilities
3. **Body sections**: Detailed info (no H2 headers yet)
4. **H2 sections**: File lists with URLs to documentation/source

Key guidelines:
- Keep content concise and actionable for LLMs
- Include complete CLI examples with common flags
- Document all supported options in tables
- Link to GitHub source files and documentation
- Use the "Optional" H2 section for secondary resources

### Example Update

When adding a new `--format` flag:

```markdown
## Commands

### split
...
Options:
- `--format <FORMAT>`: Output format: sql, json (default: sql)  # ADD THIS
...
```

## Agent Skills

The file `skills/sql-splitter/SKILL.md` provides a skill definition following the [Agent Skills](https://agentskills.io) specification. This enables AI coding assistants to automatically discover and use sql-splitter.

### Supported Tools

Agent Skills are supported by: Amp, Claude Code, VS Code / GitHub Copilot, Cursor, Goose, Letta, and OpenCode.

### When to Update SKILL.md

Update `skills/sql-splitter/SKILL.md` when:
- Adding new CLI commands
- Changing command patterns or workflows
- Adding new flags that affect common usage patterns
- Updating decision logic for when to use commands

### SKILL.md Format Requirements

The file follows the Agent Skills specification:
1. **YAML frontmatter**: name, description, license, compatibility
2. **Markdown body**: Step-by-step instructions, patterns, and examples

Key guidelines:
- Focus on **when to use** vs **when not to use**
- Provide step-by-step patterns for common workflows
- Include error handling guidance
- Keep instructions actionable and concise

### Installing the Skill

**Amp:**
```bash
amp skill add helgesverre/sql-splitter
```

**Claude Code:**
```bash
git clone https://github.com/helgesverre/sql-splitter.git /tmp/sql-splitter
cp -r /tmp/sql-splitter/skills/sql-splitter ~/.claude/skills/
```

**VS Code / GitHub Copilot:**
```bash
git clone https://github.com/helgesverre/sql-splitter.git /tmp/sql-splitter
cp -r /tmp/sql-splitter/skills/sql-splitter .github/skills/
```

**Cursor:**
```bash
git clone https://github.com/helgesverre/sql-splitter.git /tmp/sql-splitter
cp -r /tmp/sql-splitter/skills/sql-splitter .cursor/skills/
```

**Goose:**
```bash
git clone https://github.com/helgesverre/sql-splitter.git /tmp/sql-splitter
cp -r /tmp/sql-splitter/skills/sql-splitter ~/.config/goose/skills/
```

**Letta:**
```bash
git clone https://github.com/helgesverre/sql-splitter.git /tmp/sql-splitter
cp -r /tmp/sql-splitter/skills/sql-splitter .skills/
```

**OpenCode:**
```bash
git clone https://github.com/helgesverre/sql-splitter.git /tmp/sql-splitter
cp -r /tmp/sql-splitter/skills/sql-splitter ~/.opencode/skills/
```

**Universal Installer (via npx):**
```bash
npx ai-agent-skills install sql-splitter --agent <agent>
# Supported agents: claude, cursor, amp, vscode, goose, opencode
```

### Skill Directory Structure

```
sql-splitter/
└── skills/
    └── sql-splitter/
        └── SKILL.md
```

## Test Performance

### Test Suite Overview

The test suite includes 780+ tests across multiple test files:
- `duckdb_test.rs`: 150 tests for DuckDB query functionality
- Unit tests for parser, splitter, analyzer, converter, etc.
- Integration tests for various commands

### cargo test vs cargo-nextest

For this codebase, both runners perform similarly:

| Runner | Wall Clock | CPU Usage | Notes |
|--------|------------|-----------|-------|
| `cargo test` | ~20s | 342% | Default, in-process parallelism |
| `cargo nextest run` | ~19s | 755% | Per-test process isolation, better parallelism |

**Recommendation:** Use `cargo test` for development (simpler). Consider `cargo nextest` when:
- You need better test failure diagnostics (clearer output)
- You need test retries (`--retries`)
- You want maximum parallelization on many-core machines
- Tests have resource conflicts requiring process isolation