sql-splitter 1.8.0

High-performance CLI tool for splitting large SQL dump files into individual table files
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
# Shard Command Design

**Status**: ✅ Implemented (v1.6.0)  
**Date**: 2025-12-20  
**Source**: Oracle feasibility analysis + real-world schema investigation

## Overview

The `shard` command extracts tenant-specific data from multi-tenant SQL dumps. It resolves FK chains to include all data belonging to a tenant, even from tables that don't have a direct tenant column.

This is a common need in multi-tenant SaaS applications where:
- Most tables have a `company_id` or `tenant_id` column
- Some tables (pivot/junction, child tables) don't have the tenant column directly
- You need to extract one tenant's data for testing, debugging, or migration

**Key differentiator:** Most tools can filter by a column value, but they can't follow FK chains to include related data from tables without that column.

## Real-World Schema Analysis

Analysis of two production Laravel/MySQL multi-tenant applications revealed:

| Metric | App 1 | App 2 |
|--------|-------|-------|
| Total tables | 52 | 125 |
| Tables with `company_id` | 22 (42%) | 44 (35%) |
| Tables without `company_id` | 30 (58%) | 81 (65%) |

**Tables without direct tenant column fall into categories:**

| Category | Examples | How to Handle |
|----------|----------|---------------|
| **Child tables** | `invoice_items``invoices` | Follow FK to tenant-owned parent |
| **Junction/pivot** | `permission_role`, `role_user` | Include if either FK hits tenant data |
| **Self-referential** | `comments.parent_id`, `folders.parent_id` | Closure over parent chain |
| **Lookup/reference** | `permissions`, `roles`, `migrations` | Include fully or skip |
| **System/framework** | `failed_jobs`, `job_batches`, `cache` | Skip by default |

## Command Interface

```bash
# Extract single tenant by company_id
sql-splitter shard dump.sql -o tenant_5.sql \
  --tenant-column company_id \
  --tenant-value 5

# With auto-detection (looks for company_id, tenant_id, organization_id)
sql-splitter shard dump.sql -o tenant_5.sql --tenant-value 5

# Include global lookup tables
sql-splitter shard dump.sql -o tenant_5.sql \
  --tenant-value 5 \
  --include-global lookups

# Specify explicit root tables
sql-splitter shard dump.sql -o tenant_5.sql \
  --tenant-value 5 \
  --root-tables companies,users

# Use config file for complex schemas
sql-splitter shard dump.sql -o tenant_5.sql \
  --tenant-value 5 \
  --config shard.yaml

# Dry run to see what would be extracted
sql-splitter shard dump.sql --tenant-value 5 --dry-run

# Extract multiple tenants to separate files
sql-splitter shard dump.sql -o shards/ \
  --tenant-column company_id \
  --tenant-values 1,2,3,5,8

# Extract by hash partitioning (for parallel loading)
sql-splitter shard dump.sql -o chunks/ \
  --tenant-column id \
  --table companies \
  --hash --partitions 8
```

## CLI Options

| Flag | Description | Default |
|------|-------------|---------|
| `-o, --output` | Output file or directory | stdout |
| `--tenant-column` | Column name for tenant key | auto-detect |
| `--tenant-value` | Tenant key value to extract | required |
| `--tenant-values` | Multiple tenant values (comma-separated) ||
| `--root-tables` | Explicit tenant root tables | auto-detect |
| `--include-global` | Global table handling: `none`, `lookups`, `all` | lookups |
| `-d, --dialect` | SQL dialect | auto-detect |
| `-p, --progress` | Show progress bar | false |
| `--dry-run` | Show statistics without writing | false |
| `--config` | YAML config for table classification ||
| `--strict-fk` | Fail on FK integrity issues | false |
| `--include-schema` | Include CREATE TABLE statements | true |
| `--hash` | Use hash-based sharding | false |
| `--partitions` | Number of hash partitions ||

## How It Works

### High-Level Pipeline

```
┌─────────────────────────────────────────────────────────────────────────┐
│                           SHARD PIPELINE                                │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Phase 0: Schema Analysis                                               │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │ • Parse CREATE TABLE statements                                  │  │
│  │ • Extract columns, PKs, FKs                                      │  │
│  │ • Build dependency graph                                         │  │
│  │ • Classify tables: tenant-root, dependent, junction, global      │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  Phase 1: Internal Split                                                │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │ • Stream through dump file once                                  │  │
│  │ • Split into per-table temp files: tmp/<table>.data              │  │
│  │ • Write schema to tmp/schema.sql                                 │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  Phase 2: Tenant Selection (FK-ordered)                                 │
│  ┌──────────────────────────────────────────────────────────────────┐  │
│  │ • Process tables in topological order (roots first)              │  │
│  │ • For each table:                                                │  │
│  │   - Stream through tmp/<table>.data                              │  │
│  │   - Check tenant column OR FK membership                         │  │
│  │   - Record selected PKs                                          │  │
│  │   - Emit matching rows to output                                 │  │
│  └──────────────────────────────────────────────────────────────────┘  │
│                                                                         │
│  Output: Concatenate schema.sql + selected data in table order          │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘
```

### Table Classification Algorithm

```rust
fn classify_table(
    table: &TableSchema,
    graph: &SchemaGraph,
    config: &ShardConfig,
) -> TableClassification {
    // 1. Check explicit config overrides
    if let Some(role) = config.table_role(&table.name) {
        return role;
    }
    
    // 2. Check for tenant column
    if table.has_column(&config.tenant_column) {
        return TableClassification::TenantRoot;
    }
    
    // 3. Check if reachable from tenant roots via FK chain
    if graph.is_reachable_from_tenant_roots(&table.name) {
        // Has FK path to a tenant table
        if table.is_junction_table() {
            return TableClassification::Junction;
        }
        return TableClassification::TenantDependent;
    }
    
    // 4. Check for known system/framework patterns
    if is_system_table(&table.name) {
        return TableClassification::System;
    }
    
    // 5. Default to lookup/global
    TableClassification::Lookup
}

fn is_system_table(name: &str) -> bool {
    matches!(name, 
        "migrations" | "failed_jobs" | "job_batches" | "jobs" |
        "password_resets" | "sessions" | "cache" | "cache_locks" |
        "telescope_entries" | "telescope_monitoring" |
        "pulse_entries" | "pulse_values" | "pulse_aggregates"
    )
}
```

### FK Chain Resolution

For a table without a tenant column (e.g., `invoice_items`):

```
invoice_items
    └── invoice_id → invoices.id
                         └── company_id = 5 ✓ (tenant match)
```

**Algorithm:**
1. `invoices` is processed first (has `company_id`, is tenant-root)
2. Selected invoice IDs are recorded: `{101, 205, 307, ...}`
3. `invoice_items` is processed later (dependent table)
4. For each row, check: `invoice_id ∈ selected_pks["invoices"]`?
5. If yes, include the row

### Handling Junction Tables

Junction tables like `permission_role`:

```sql
CREATE TABLE permission_role (
    permission_id INT REFERENCES permissions(id),
    role_id INT REFERENCES roles(id)
);
```

Where `roles` has `company_id` but `permissions` is global.

**Inclusion rule:** Include row if ANY FK points to a selected row in a tenant-owned table.

```rust
fn should_include_junction_row(
    row: &Row,
    table: &TableSchema,
    selected_pks: &HashMap<String, PkSet>,
) -> bool {
    for fk in &table.fk_edges {
        if let Some(pks) = selected_pks.get(&fk.to_table) {
            let fk_value = row.get_column(&fk.from_columns);
            if pks.contains(&fk_value) {
                return true;
            }
        }
    }
    false
}
```

### Self-Referential Tables

Tables like `comments.parent_id → comments.id` or `folders.parent_id → folders.id`:

```
comments (id=10, parent_id=5, company_id=1)
    └── parent: comments (id=5, parent_id=2, company_id=1)
        └── parent: comments (id=2, parent_id=NULL, company_id=1)
```

**Algorithm:** Ancestor closure

```rust
fn compute_self_referential_closure(
    table: &str,
    self_fk_column: &str,
    initial_pks: &PkSet,
    table_data: &Path,
) -> PkSet {
    let mut selected = initial_pks.clone();
    let mut needed_parents: PkSet = collect_parent_refs(&selected, self_fk_column);
    
    while !needed_parents.is_empty() {
        // Scan table file for rows with PK in needed_parents
        let newly_selected = scan_for_pks(table_data, &needed_parents);
        
        // Add to selected set
        selected.extend(newly_selected.iter().cloned());
        
        // Find their parents
        let new_parents = collect_parent_refs(&newly_selected, self_fk_column);
        needed_parents = new_parents.difference(&selected).cloned().collect();
    }
    
    selected
}
```

This may require multiple passes over the table's temp file, but only for self-referential tables.

---

## Configuration File

For complex schemas, use a YAML config:

```yaml
# shard.yaml

tenant:
  column: company_id
  # Optional: explicit root tables (otherwise auto-detected)
  root_tables:
    - companies
    - users

# Table classification overrides
tables:
  # System/framework tables (skip entirely)
  migrations:
    role: system
  failed_jobs:
    role: system
  job_batches:
    role: system
  telescope_entries:
    role: system
  
  # Lookup tables (include fully or skip)
  permissions:
    role: lookup
    include: true
  roles:
    role: lookup
    include: true  # Note: roles might have company_id in some apps
  
  # Junction tables (explicit classification)
  permission_role:
    role: junction
  role_user:
    role: junction
  taggables:
    role: junction
  
  # Large tables with special handling
  activity_log:
    role: tenant
    # Future: limit rows
    # max_rows: 1000
  
  # Self-referential tables (auto-detected, but can be explicit)
  comments:
    self_fk: parent_id
  folders:
    self_fk: parent_id

# Global table handling
include_global: lookups  # none | lookups | all
```

---

## Cross-Dialect Considerations

### MySQL / MariaDB

**Characteristics:**
- Backtick-quoted identifiers
- Backslash escapes in strings
- Multi-row INSERT: `INSERT INTO t VALUES (1,'a'), (2,'b'), ...`

**Value parsing:**
```rust
fn parse_mysql_insert_values(stmt: &[u8]) -> Vec<Row> {
    // Handle: strings with \' and \\, NULL, numbers, hex
    // Split by ), (
    // Watch for strings containing ), (
}
```

### PostgreSQL

**Characteristics:**
- Double-quote identifiers (optional)
- `''` for escaped quotes in strings
- COPY FROM stdin format (more common than INSERT)
- Schema-qualified names: `public.users`

**COPY parsing:**
```rust
fn parse_postgres_copy_data(data: &[u8]) -> Vec<Row> {
    // Tab-separated values
    // \N for NULL
    // \\ for backslash, \t for tab, \n for newline
    // Lines terminated by \n
    // Block terminated by \.\n
}
```

### SQLite

**Characteristics:**
- Double-quote or backtick identifiers
- `''` for escaped quotes
- INSERT statements (no COPY)
- Simpler overall

---

## Memory Model

### PK Tracking

```rust
/// Efficient PK value representation
pub enum PkValue {
    Int(i64),          // Most common: auto-increment IDs
    BigInt(i128),      // Large IDs, ULIDs as integers
    Text(Box<str>),    // UUIDs, string PKs
}

/// Composite PK support
pub type PkTuple = SmallVec<[PkValue; 2]>;

/// Per-table set of selected PKs
pub type PkSet = AHashSet<PkTuple>;

/// Sharding state
pub struct ShardState {
    /// Only track PKs for tables that are FK targets
    selected_pks: HashMap<String, PkSet>,
    
    /// Statistics
    stats: ShardStats,
}
```

### Memory Budget

| Scenario | Selected Rows | Memory |
|----------|---------------|--------|
| Small tenant | 10,000 | ~400 KB |
| Medium tenant | 100,000 | ~4 MB |
| Large tenant | 1,000,000 | ~40 MB |
| Very large | 10,000,000 | ~400 MB |

**Guardrails:**
```rust
const DEFAULT_MAX_SELECTED_ROWS: usize = 10_000_000;

if total_selected > config.max_selected_rows {
    return Err(anyhow!(
        "Selected {} rows exceeds limit of {}. \
         This tenant may own most of the data. \
         Use --max-selected-rows to increase limit.",
        total_selected, config.max_selected_rows
    ));
}
```

---

## Performance

### Targets

| Metric | Target | Notes |
|--------|--------|-------|
| Schema analysis | <5s for 100+ tables | Single pass, regex-based |
| Internal split | 200+ MB/s | Reuses existing split logic |
| Tenant selection | 100-150 MB/s | Row parsing overhead |
| Memory | O(selected_rows) | Not O(file_size) |

### Optimizations

1. **Single parse of original dump**: Schema + split in one pass
2. **Per-table temp files**: Process in any order (topological)
3. **Lazy value decoding**: Only parse PK/FK/tenant columns
4. **Fast hashing**: `ahash` for PK sets
5. **Streaming writes**: Buffer and flush incrementally

### Disk Usage

Temp files use approximately the same space as the original dump:
- 10 GB dump → ~10-12 GB temp files
- Cleaned up automatically after completion

---

## Edge Cases

### 1. Tenant Owns Most Data

If a single tenant owns 90% of the data:
- PK sets will be large
- Memory usage increases proportionally
- Consider: Is sharding the right approach? Maybe just filter on import.

### 2. Orphaned FK References (Original Data Issues)

```
⚠️ Warning: 15 rows in 'orders' reference non-existent user_id values
   This indicates orphaned data in the original dump.
```

With `--strict-fk`: Command fails.

### 3. Tables Without Primary Key

```
⚠️ Warning: Table 'legacy_audit' has no PRIMARY KEY
   Cannot track FK references for tenant selection.
   Including all rows (may include data from other tenants).
```

### 4. Circular FK References (Rare)

Detected via SCC analysis. Handled by processing entire cycle as a unit with relaxed inclusion rules.

### 5. Very Deep FK Chains

```
a → b → c → d → e → f → company_id
```

Handled correctly by topological ordering. Table `a` is processed last, after all intermediate tables have their PKs recorded.

---

## Output Format

### Single Tenant Output

```sql
-- Sharded from: production.sql
-- Date: 2025-12-20 14:30:00
-- Tenant column: company_id
-- Tenant value: 5
-- Dialect: mysql
--
-- Statistics:
--   Tables processed: 52
--   Tables with data: 38
--   Tables skipped: 7 (system)
--   Global tables included: 3 (lookups)
--   Total rows: 45,678 (from 1,234,567 original)
--   FK integrity: OK

-- Schema

CREATE TABLE `companies` (...);
CREATE TABLE `users` (...);
...

-- Data

INSERT INTO `companies` VALUES (5, 'Acme Corp', ...);

INSERT INTO `users` VALUES
(12, 'alice@acme.com', 5),
(15, 'bob@acme.com', 5),
...;

INSERT INTO `orders` VALUES ...;
```

### Multi-Tenant Output (Directory)

```
shards/
├── tenant_1.sql
├── tenant_2.sql
├── tenant_5.sql
├── tenant_8.sql
└── global/
    ├── schema.sql
    ├── permissions.sql
    └── roles.sql
```

---

## Example Workflows

### 1. Extract Single Tenant for Debugging

```bash
# Customer reported issue, need their data locally
sql-splitter shard prod-backup.sql.gz -o customer_acme.sql \
  --tenant-value 42 \
  --include-global lookups \
  -p  # Show progress
```

### 2. Create Tenant-Specific Test Fixtures

```bash
# Each test tenant gets own fixture
for tenant_id in 1 2 3; do
  sql-splitter shard dump.sql -o "fixtures/tenant_${tenant_id}.sql" \
    --tenant-value $tenant_id \
    --include-global none \
    --exclude activity_log,telescope_entries
done
```

### 3. Migrate Tenant to Separate Database

```bash
# Extract tenant data for migration
sql-splitter shard prod.sql.gz -o migrate_tenant_5.sql \
  --tenant-value 5 \
  --include-schema \
  --include-global all \
  --strict-fk  # Ensure data integrity

# Import to new database
mysql -u root new_tenant_db < migrate_tenant_5.sql
```

### 4. Hash-Based Parallel Loading

```bash
# Split data into 8 chunks for parallel import
sql-splitter shard dump.sql -o chunks/ \
  --table companies \
  --hash \
  --partitions 8

# Load in parallel
for i in {0..7}; do
  mysql -u root db < "chunks/partition_${i}.sql" &
done
wait
```

---

## Implementation Effort

| Component | Effort | Shared With |
|-----------|--------|-------------|
| CLI argument parsing | 2 hours ||
| Schema graph (FK parsing) | 6 hours | `sample --preserve-relations` |
| Table classification | 3 hours | `sample` |
| Internal split (temp files) | 2 hours ||
| PK tracking data structures | 2 hours | `sample` |
| Row parsing (INSERT/COPY) | 4 hours | `sample` |
| Tenant selection logic | 4 hours ||
| Junction table handling | 2 hours | `sample` |
| Self-FK closure | 3 hours | `sample` |
| Output generation | 2 hours ||
| Config file parsing | 2 hours | `sample` |
| Progress reporting | 1 hour | Existing |
| Multi-tenant sharding | 4 hours ||
| Hash-based sharding | 3 hours ||
| Testing | 8 hours ||
| **Total** | **~48 hours** | |

**Note:** ~20 hours of this is shared with `sample --preserve-relations`, so implementing both features reduces total effort.

---

## Future Enhancements

1. **Streaming multi-tenant**: Extract all tenants in single pass
2. **Key range partitioning**: `--ranges "2024-01-01,2024-06-01,2025-01-01"`
3. **Tenant mapping file**: JSON/YAML mapping tenant_id → shard
4. **Parallel processing**: Process independent tables concurrently
5. **Incremental sharding**: Add new data to existing shards
6. **Compression support**: Output to `.sql.gz`

---

## Risks & Mitigations

| Risk | Mitigation |
|------|------------|
| Tenant owns most data | Memory limit + clear error message |
| Complex FK patterns | Conservative cycle handling + warnings |
| Dialect variations | Comprehensive test fixtures for each dialect |
| Orphaned data in source | Detect and warn; `--strict-fk` for enforcement |
| Very large dumps | Temp file approach keeps memory bounded |

---

## Related

- [Sample Feature]SAMPLE_FEATURE.md — Shares FK graph and PK tracking
- [Split Command]../../src/cmd/split.rs — Base streaming architecture
- [Merge Feature]MERGE_FEATURE.md — Combine split files
- [Additional Ideas]ADDITIONAL_IDEAS.md — Hash sharding, key ranges