# Shard Command Design
**Status**: ✅ Implemented (v1.6.0)
**Date**: 2025-12-20
**Source**: Oracle feasibility analysis + real-world schema investigation
## Overview
The `shard` command extracts tenant-specific data from multi-tenant SQL dumps. It resolves FK chains to include all data belonging to a tenant, even from tables that don't have a direct tenant column.
This is a common need in multi-tenant SaaS applications where:
- Most tables have a `company_id` or `tenant_id` column
- Some tables (pivot/junction, child tables) don't have the tenant column directly
- You need to extract one tenant's data for testing, debugging, or migration
**Key differentiator:** Most tools can filter by a column value, but they can't follow FK chains to include related data from tables without that column.
## Real-World Schema Analysis
Analysis of two production Laravel/MySQL multi-tenant applications revealed:
| Total tables | 52 | 125 |
| Tables with `company_id` | 22 (42%) | 44 (35%) |
| Tables without `company_id` | 30 (58%) | 81 (65%) |
**Tables without direct tenant column fall into categories:**
| **Child tables** | `invoice_items` → `invoices` | Follow FK to tenant-owned parent |
| **Junction/pivot** | `permission_role`, `role_user` | Include if either FK hits tenant data |
| **Self-referential** | `comments.parent_id`, `folders.parent_id` | Closure over parent chain |
| **Lookup/reference** | `permissions`, `roles`, `migrations` | Include fully or skip |
| **System/framework** | `failed_jobs`, `job_batches`, `cache` | Skip by default |
## Command Interface
```bash
# Extract single tenant by company_id
sql-splitter shard dump.sql -o tenant_5.sql \
--tenant-column company_id \
--tenant-value 5
# With auto-detection (looks for company_id, tenant_id, organization_id)
sql-splitter shard dump.sql -o tenant_5.sql --tenant-value 5
# Include global lookup tables
sql-splitter shard dump.sql -o tenant_5.sql \
--tenant-value 5 \
--include-global lookups
# Specify explicit root tables
sql-splitter shard dump.sql -o tenant_5.sql \
--tenant-value 5 \
--root-tables companies,users
# Use config file for complex schemas
sql-splitter shard dump.sql -o tenant_5.sql \
--tenant-value 5 \
--config shard.yaml
# Dry run to see what would be extracted
sql-splitter shard dump.sql --tenant-value 5 --dry-run
# Extract multiple tenants to separate files
sql-splitter shard dump.sql -o shards/ \
--tenant-column company_id \
--tenant-values 1,2,3,5,8
# Extract by hash partitioning (for parallel loading)
sql-splitter shard dump.sql -o chunks/ \
--tenant-column id \
--table companies \
--hash --partitions 8
```
## CLI Options
| `-o, --output` | Output file or directory | stdout |
| `--tenant-column` | Column name for tenant key | auto-detect |
| `--tenant-value` | Tenant key value to extract | required |
| `--tenant-values` | Multiple tenant values (comma-separated) | — |
| `--root-tables` | Explicit tenant root tables | auto-detect |
| `--include-global` | Global table handling: `none`, `lookups`, `all` | lookups |
| `-d, --dialect` | SQL dialect | auto-detect |
| `-p, --progress` | Show progress bar | false |
| `--dry-run` | Show statistics without writing | false |
| `--config` | YAML config for table classification | — |
| `--strict-fk` | Fail on FK integrity issues | false |
| `--include-schema` | Include CREATE TABLE statements | true |
| `--hash` | Use hash-based sharding | false |
| `--partitions` | Number of hash partitions | — |
## How It Works
### High-Level Pipeline
```
┌─────────────────────────────────────────────────────────────────────────┐
│ SHARD PIPELINE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Phase 0: Schema Analysis │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ • Parse CREATE TABLE statements │ │
│ │ • Extract columns, PKs, FKs │ │
│ │ • Build dependency graph │ │
│ │ • Classify tables: tenant-root, dependent, junction, global │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ Phase 1: Internal Split │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ • Stream through dump file once │ │
│ │ • Split into per-table temp files: tmp/<table>.data │ │
│ │ • Write schema to tmp/schema.sql │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ Phase 2: Tenant Selection (FK-ordered) │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ • Process tables in topological order (roots first) │ │
│ │ • For each table: │ │
│ │ - Stream through tmp/<table>.data │ │
│ │ - Check tenant column OR FK membership │ │
│ │ - Record selected PKs │ │
│ │ - Emit matching rows to output │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ Output: Concatenate schema.sql + selected data in table order │
│ │
└─────────────────────────────────────────────────────────────────────────┘
```
### Table Classification Algorithm
```rust
fn classify_table(
table: &TableSchema,
graph: &SchemaGraph,
config: &ShardConfig,
) -> TableClassification {
// 1. Check explicit config overrides
if let Some(role) = config.table_role(&table.name) {
return role;
}
// 2. Check for tenant column
if table.has_column(&config.tenant_column) {
return TableClassification::TenantRoot;
}
// 3. Check if reachable from tenant roots via FK chain
if graph.is_reachable_from_tenant_roots(&table.name) {
// Has FK path to a tenant table
if table.is_junction_table() {
return TableClassification::Junction;
}
return TableClassification::TenantDependent;
}
// 4. Check for known system/framework patterns
if is_system_table(&table.name) {
return TableClassification::System;
}
// 5. Default to lookup/global
TableClassification::Lookup
}
fn is_system_table(name: &str) -> bool {
matches!(name,
"migrations" | "failed_jobs" | "job_batches" | "jobs" |
"password_resets" | "sessions" | "cache" | "cache_locks" |
"telescope_entries" | "telescope_monitoring" |
"pulse_entries" | "pulse_values" | "pulse_aggregates"
)
}
```
### FK Chain Resolution
For a table without a tenant column (e.g., `invoice_items`):
```
invoice_items
└── invoice_id → invoices.id
└── company_id = 5 ✓ (tenant match)
```
**Algorithm:**
1. `invoices` is processed first (has `company_id`, is tenant-root)
2. Selected invoice IDs are recorded: `{101, 205, 307, ...}`
3. `invoice_items` is processed later (dependent table)
4. For each row, check: `invoice_id ∈ selected_pks["invoices"]`?
5. If yes, include the row
### Handling Junction Tables
Junction tables like `permission_role`:
```sql
CREATE TABLE permission_role (
permission_id INT REFERENCES permissions(id),
role_id INT REFERENCES roles(id)
);
```
Where `roles` has `company_id` but `permissions` is global.
**Inclusion rule:** Include row if ANY FK points to a selected row in a tenant-owned table.
```rust
fn should_include_junction_row(
row: &Row,
table: &TableSchema,
selected_pks: &HashMap<String, PkSet>,
) -> bool {
for fk in &table.fk_edges {
if let Some(pks) = selected_pks.get(&fk.to_table) {
let fk_value = row.get_column(&fk.from_columns);
if pks.contains(&fk_value) {
return true;
}
}
}
false
}
```
### Self-Referential Tables
Tables like `comments.parent_id → comments.id` or `folders.parent_id → folders.id`:
```
comments (id=10, parent_id=5, company_id=1)
└── parent: comments (id=5, parent_id=2, company_id=1)
└── parent: comments (id=2, parent_id=NULL, company_id=1)
```
**Algorithm:** Ancestor closure
```rust
fn compute_self_referential_closure(
table: &str,
self_fk_column: &str,
initial_pks: &PkSet,
table_data: &Path,
) -> PkSet {
let mut selected = initial_pks.clone();
let mut needed_parents: PkSet = collect_parent_refs(&selected, self_fk_column);
while !needed_parents.is_empty() {
// Scan table file for rows with PK in needed_parents
let newly_selected = scan_for_pks(table_data, &needed_parents);
// Add to selected set
selected.extend(newly_selected.iter().cloned());
// Find their parents
let new_parents = collect_parent_refs(&newly_selected, self_fk_column);
needed_parents = new_parents.difference(&selected).cloned().collect();
}
selected
}
```
This may require multiple passes over the table's temp file, but only for self-referential tables.
---
## Configuration File
For complex schemas, use a YAML config:
```yaml
# shard.yaml
tenant:
column: company_id
# Optional: explicit root tables (otherwise auto-detected)
root_tables:
- companies
- users
# Table classification overrides
tables:
# System/framework tables (skip entirely)
migrations:
role: system
failed_jobs:
role: system
job_batches:
role: system
telescope_entries:
role: system
# Lookup tables (include fully or skip)
permissions:
role: lookup
include: true
roles:
role: lookup
include: true # Note: roles might have company_id in some apps
# Junction tables (explicit classification)
permission_role:
role: junction
role_user:
role: junction
taggables:
role: junction
# Large tables with special handling
activity_log:
role: tenant
# Future: limit rows
# max_rows: 1000
# Self-referential tables (auto-detected, but can be explicit)
comments:
self_fk: parent_id
folders:
self_fk: parent_id
# Global table handling
include_global: lookups # none | lookups | all
```
---
## Cross-Dialect Considerations
### MySQL / MariaDB
**Characteristics:**
- Backtick-quoted identifiers
- Backslash escapes in strings
- Multi-row INSERT: `INSERT INTO t VALUES (1,'a'), (2,'b'), ...`
**Value parsing:**
```rust
fn parse_mysql_insert_values(stmt: &[u8]) -> Vec<Row> {
// Handle: strings with \' and \\, NULL, numbers, hex
// Split by ), (
// Watch for strings containing ), (
}
```
### PostgreSQL
**Characteristics:**
- Double-quote identifiers (optional)
- `''` for escaped quotes in strings
- COPY FROM stdin format (more common than INSERT)
- Schema-qualified names: `public.users`
**COPY parsing:**
```rust
fn parse_postgres_copy_data(data: &[u8]) -> Vec<Row> {
// Tab-separated values
// \N for NULL
// \\ for backslash, \t for tab, \n for newline
// Lines terminated by \n
// Block terminated by \.\n
}
```
### SQLite
**Characteristics:**
- Double-quote or backtick identifiers
- `''` for escaped quotes
- INSERT statements (no COPY)
- Simpler overall
---
## Memory Model
### PK Tracking
```rust
/// Efficient PK value representation
pub enum PkValue {
Int(i64), // Most common: auto-increment IDs
BigInt(i128), // Large IDs, ULIDs as integers
Text(Box<str>), // UUIDs, string PKs
}
/// Composite PK support
pub type PkTuple = SmallVec<[PkValue; 2]>;
/// Per-table set of selected PKs
pub type PkSet = AHashSet<PkTuple>;
/// Sharding state
pub struct ShardState {
/// Only track PKs for tables that are FK targets
selected_pks: HashMap<String, PkSet>,
/// Statistics
stats: ShardStats,
}
```
### Memory Budget
| Small tenant | 10,000 | ~400 KB |
| Medium tenant | 100,000 | ~4 MB |
| Large tenant | 1,000,000 | ~40 MB |
| Very large | 10,000,000 | ~400 MB |
**Guardrails:**
```rust
const DEFAULT_MAX_SELECTED_ROWS: usize = 10_000_000;
if total_selected > config.max_selected_rows {
return Err(anyhow!(
"Selected {} rows exceeds limit of {}. \
This tenant may own most of the data. \
Use --max-selected-rows to increase limit.",
total_selected, config.max_selected_rows
));
}
```
---
## Performance
### Targets
| Schema analysis | <5s for 100+ tables | Single pass, regex-based |
| Internal split | 200+ MB/s | Reuses existing split logic |
| Tenant selection | 100-150 MB/s | Row parsing overhead |
| Memory | O(selected_rows) | Not O(file_size) |
### Optimizations
1. **Single parse of original dump**: Schema + split in one pass
2. **Per-table temp files**: Process in any order (topological)
3. **Lazy value decoding**: Only parse PK/FK/tenant columns
4. **Fast hashing**: `ahash` for PK sets
5. **Streaming writes**: Buffer and flush incrementally
### Disk Usage
Temp files use approximately the same space as the original dump:
- 10 GB dump → ~10-12 GB temp files
- Cleaned up automatically after completion
---
## Edge Cases
### 1. Tenant Owns Most Data
If a single tenant owns 90% of the data:
- PK sets will be large
- Memory usage increases proportionally
- Consider: Is sharding the right approach? Maybe just filter on import.
### 2. Orphaned FK References (Original Data Issues)
```
⚠️ Warning: 15 rows in 'orders' reference non-existent user_id values
This indicates orphaned data in the original dump.
```
With `--strict-fk`: Command fails.
### 3. Tables Without Primary Key
```
⚠️ Warning: Table 'legacy_audit' has no PRIMARY KEY
Cannot track FK references for tenant selection.
Including all rows (may include data from other tenants).
```
### 4. Circular FK References (Rare)
Detected via SCC analysis. Handled by processing entire cycle as a unit with relaxed inclusion rules.
### 5. Very Deep FK Chains
```
a → b → c → d → e → f → company_id
```
Handled correctly by topological ordering. Table `a` is processed last, after all intermediate tables have their PKs recorded.
---
## Output Format
### Single Tenant Output
```sql
-- Sharded from: production.sql
-- Date: 2025-12-20 14:30:00
-- Tenant column: company_id
-- Tenant value: 5
-- Dialect: mysql
--
-- Statistics:
-- Tables processed: 52
-- Tables with data: 38
-- Tables skipped: 7 (system)
-- Global tables included: 3 (lookups)
-- Total rows: 45,678 (from 1,234,567 original)
-- FK integrity: OK
-- Schema
CREATE TABLE `companies` (...);
CREATE TABLE `users` (...);
...
-- Data
INSERT INTO `companies` VALUES (5, 'Acme Corp', ...);
INSERT INTO `users` VALUES
(12, 'alice@acme.com', 5),
(15, 'bob@acme.com', 5),
...;
INSERT INTO `orders` VALUES ...;
```
### Multi-Tenant Output (Directory)
```
shards/
├── tenant_1.sql
├── tenant_2.sql
├── tenant_5.sql
├── tenant_8.sql
└── global/
├── schema.sql
├── permissions.sql
└── roles.sql
```
---
## Example Workflows
### 1. Extract Single Tenant for Debugging
```bash
# Customer reported issue, need their data locally
sql-splitter shard prod-backup.sql.gz -o customer_acme.sql \
--tenant-value 42 \
--include-global lookups \
-p # Show progress
```
### 2. Create Tenant-Specific Test Fixtures
```bash
# Each test tenant gets own fixture
for tenant_id in 1 2 3; do
sql-splitter shard dump.sql -o "fixtures/tenant_${tenant_id}.sql" \
--tenant-value $tenant_id \
--include-global none \
--exclude activity_log,telescope_entries
done
```
### 3. Migrate Tenant to Separate Database
```bash
# Extract tenant data for migration
sql-splitter shard prod.sql.gz -o migrate_tenant_5.sql \
--tenant-value 5 \
--include-schema \
--include-global all \
--strict-fk # Ensure data integrity
# Import to new database
mysql -u root new_tenant_db < migrate_tenant_5.sql
```
### 4. Hash-Based Parallel Loading
```bash
# Split data into 8 chunks for parallel import
sql-splitter shard dump.sql -o chunks/ \
--table companies \
--hash \
--partitions 8
# Load in parallel
for i in {0..7}; do
mysql -u root db < "chunks/partition_${i}.sql" &
done
wait
```
---
## Implementation Effort
| CLI argument parsing | 2 hours | — |
| Schema graph (FK parsing) | 6 hours | `sample --preserve-relations` |
| Table classification | 3 hours | `sample` |
| Internal split (temp files) | 2 hours | — |
| PK tracking data structures | 2 hours | `sample` |
| Row parsing (INSERT/COPY) | 4 hours | `sample` |
| Tenant selection logic | 4 hours | — |
| Junction table handling | 2 hours | `sample` |
| Self-FK closure | 3 hours | `sample` |
| Output generation | 2 hours | — |
| Config file parsing | 2 hours | `sample` |
| Progress reporting | 1 hour | Existing |
| Multi-tenant sharding | 4 hours | — |
| Hash-based sharding | 3 hours | — |
| Testing | 8 hours | — |
| **Total** | **~48 hours** | |
**Note:** ~20 hours of this is shared with `sample --preserve-relations`, so implementing both features reduces total effort.
---
## Future Enhancements
1. **Streaming multi-tenant**: Extract all tenants in single pass
2. **Key range partitioning**: `--ranges "2024-01-01,2024-06-01,2025-01-01"`
3. **Tenant mapping file**: JSON/YAML mapping tenant_id → shard
4. **Parallel processing**: Process independent tables concurrently
5. **Incremental sharding**: Add new data to existing shards
6. **Compression support**: Output to `.sql.gz`
---
## Risks & Mitigations
| Tenant owns most data | Memory limit + clear error message |
| Complex FK patterns | Conservative cycle handling + warnings |
| Dialect variations | Comprehensive test fixtures for each dialect |
| Orphaned data in source | Detect and warn; `--strict-fk` for enforcement |
| Very large dumps | Temp file approach keeps memory bounded |
---
## Related
- [Sample Feature](SAMPLE_FEATURE.md) — Shares FK graph and PK tracking
- [Split Command](../../src/cmd/split.rs) — Base streaming architecture
- [Merge Feature](MERGE_FEATURE.md) — Combine split files
- [Additional Ideas](ADDITIONAL_IDEAS.md) — Hash sharding, key ranges