# Sample Command Design
**Status**: ✅ Fully Implemented (v1.5.0)
**Date**: 2025-12-20
**Updated**: 2025-12-20 (Full implementation complete)
## Overview
The `sample` command creates reduced datasets from large SQL dumps for development, testing, and CI environments. It extracts a representative subset of data while optionally preserving referential integrity through dependency-aware FK chain resolution.
## Command Interface
```bash
# Sample 10% of each table
sql-splitter sample dump.sql -o dev.sql --percent 10
# Fixed row count per table
sql-splitter sample dump.sql -o dev.sql --rows 1000
# Preserve foreign key relationships (dependency-aware)
sql-splitter sample dump.sql -o dev.sql --rows 500 --preserve-relations
# Start from specific "root" tables and follow FK chains
sql-splitter sample dump.sql -o dev.sql --percent 5 --preserve-relations --root-tables orders
# Sample specific tables only
sql-splitter sample dump.sql -o dev.sql --percent 20 --tables users,posts,comments
# Exclude large tables
sql-splitter sample dump.sql -o dev.sql --percent 10 --exclude cache,logs,sessions
# Include global/lookup tables in full
sql-splitter sample dump.sql -o dev.sql --rows 100 --preserve-relations --include-global lookups
# Reproducible sampling with seed
sql-splitter sample dump.sql -o dev.sql --percent 5 --seed 42
# Different strategies per table
sql-splitter sample dump.sql -o dev.sql --config sample.yaml
# Output statistics only (dry run)
sql-splitter sample dump.sql --percent 10 --dry-run
```
## CLI Options
| `-o, --output` | Output SQL file path | stdout |
| `--percent` | Sample percentage (1-100) | — |
| `--rows` | Fixed row count per table | — |
| `--preserve-relations` | Maintain foreign key integrity via FK chain resolution | false |
| `--root-tables` | Explicit root tables for sampling (comma-separated) | auto-detect |
| `--include-global` | How to handle global/lookup tables: `none`, `lookups`, `all` | lookups |
| `--tables` | Only sample these tables (comma-separated) | all |
| `--exclude` | Exclude these tables (comma-separated) | none |
| `--seed` | Random seed for reproducibility | random |
| `--config` | YAML config for per-table settings | — |
| `-d, --dialect` | SQL dialect: `mysql`, `postgres`, `sqlite` | auto-detect |
| `-p, --progress` | Show progress bar | false |
| `--dry-run` | Show statistics without writing | false |
| `--include-schema` | Include CREATE TABLE statements | true |
| `--strict-fk` | Fail if any FK integrity issues detected | false |
## Sampling Modes
### 1. Percentage-Based (`--percent`)
Sample N% of rows from each table randomly.
```bash
sql-splitter sample dump.sql -o dev.sql --percent 10
```
**Behavior:**
- Each table gets ~10% of its rows
- Minimum 1 row per table (if table has data)
- Random selection using reservoir sampling
### 2. Fixed Row Count (`--rows`)
Sample exactly N rows from each table.
```bash
sql-splitter sample dump.sql -o dev.sql --rows 1000
```
**Behavior:**
- Each table gets up to 1000 rows
- Tables with fewer rows get all rows
- Random selection
### 3. Dependency-Aware Sampling (`--preserve-relations`)
**This is the key differentiating feature.** Most tools sample tables independently; maintaining a consistent relational slice across FK chains is rare and valuable.
```bash
sql-splitter sample dump.sql -o dev.sql --rows 100 --preserve-relations
```
**What it does:**
- Parses CREATE TABLE statements to build FK dependency graph
- Samples from "root" tables first
- Automatically includes all referenced parent rows
- Recursively follows FK chains to ensure referential integrity
- Result: A consistent subset with no broken foreign key references
**Example FK chain resolution:**
```
orders (100 rows sampled)
└── users (referenced user_id values automatically included)
└── products (referenced product_id values automatically included)
└── categories (referenced category_id values automatically included)
```
#### Starting from Specific Root Tables
```bash
# Sample 5% starting from orders, include all referenced data
sql-splitter sample dump.sql -o subset.sql --percent 5 --preserve-relations --root-tables orders
```
This is useful when you want to extract a coherent slice of data centered around specific entities (e.g., "give me 100 orders with all their related data").
### 4. Config-Based (Advanced)
Per-table sampling strategies via YAML config.
```yaml
# sample.yaml
default:
percent: 10
# Table classification for --preserve-relations
classification:
global:
- migrations
- permissions
- roles
system:
- failed_jobs
- job_batches
- cache
lookup:
- countries
- states
- currencies
tables:
users:
rows: 500
posts:
percent: 5
comments:
percent: 2
sessions:
skip: true
audit_logs:
skip: true
products:
rows: 100
# Future: conditional sampling
# where: "active = 1"
```
```bash
sql-splitter sample dump.sql -o dev.sql --config sample.yaml
```
---
## Dependency-Aware Sampling: Deep Dive
### The Problem
In multi-tenant and relational databases, tables have complex FK relationships:
```
invoice_items → invoices → customers → companies
↘ products → categories
```
Simple random sampling breaks these relationships:
- Sampled `invoice_items` may reference `invoices` that weren't sampled
- Import fails with FK constraint violations
- Manual fixing is tedious and error-prone
### The Solution: FK Chain Resolution
Build a dependency graph from schema DDL and use it to:
1. Determine sampling order (topological sort)
2. Track which PK values have been selected
3. Automatically include referenced rows
### Algorithm
```
Phase 0: Schema Analysis (streaming)
├── Parse CREATE TABLE statements
├── Extract column names, types, PK definitions
├── Parse FOREIGN KEY constraints (inline and ALTER TABLE)
├── Build FK dependency graph
└── Classify tables: root, dependent, junction, lookup, system
Phase 1: Internal Split (streaming)
├── Split dump into per-table temp files
└── Separate schema.sql from data files
Phase 2: Dependency-Ordered Selection
├── Process tables in FK-distance order from roots
├── For each table:
│ ├── Stream through table's data file
│ ├── Apply sampling logic (percent/rows)
│ ├── Check FK membership for dependent tables
│ ├── Record selected PK values
│ └── Emit selected rows to output
└── Concatenate schema + selected data
```
### Table Classification
| **Root** | Has no FK dependencies OR explicitly specified | Sample directly, seed for FK chain |
| **Dependent** | References root/dependent tables via FK | Include if FK points to selected row |
| **Junction** | Only FKs + maybe PK (pivot tables) | Include if any FK hits selected row |
| **Lookup** | Global reference data (countries, currencies) | Include fully with `--include-global lookups` |
| **System** | Framework tables (migrations, jobs) | Skip by default |
### PK Tracking Data Structure
```rust
/// Efficient primary key value representation
enum PkValue {
Int(i64),
BigInt(i128),
Text(Box<str>),
Uuid([u8; 16]),
}
/// Support for composite primary keys
type PkTuple = SmallVec<[PkValue; 2]>;
/// Per-table set of selected primary keys
type PkSet = AHashSet<PkTuple>;
/// State maintained during sampling
struct SamplingState {
/// Selected PKs per table (only for tables that are FK targets)
selected_pks: HashMap<String, PkSet>,
/// Statistics
rows_selected: HashMap<String, u64>,
rows_scanned: HashMap<String, u64>,
}
```
**Memory considerations:**
- Only track PKs for tables that are FK targets
- Most PKs are integers (~8-16 bytes each)
- 1 million selected rows ≈ 20-40 MB
- Configurable limit with `--max-selected-rows`
---
## Cross-Dialect FK Parsing
### MySQL / MariaDB
```sql
CREATE TABLE `invoice_items` (
`id` INT PRIMARY KEY AUTO_INCREMENT,
`invoice_id` INT NOT NULL,
`product_id` INT NOT NULL,
CONSTRAINT `fk_invoice` FOREIGN KEY (`invoice_id`) REFERENCES `invoices` (`id`),
CONSTRAINT `fk_product` FOREIGN KEY (`product_id`) REFERENCES `products` (`id`)
) ENGINE=InnoDB;
-- Or via ALTER TABLE
ALTER TABLE `invoice_items`
ADD CONSTRAINT `fk_invoice` FOREIGN KEY (`invoice_id`) REFERENCES `invoices` (`id`);
```
**Parsing notes:**
- Backtick-quoted identifiers
- Inline or ALTER TABLE constraints
- Named constraints with `CONSTRAINT name`
- ON DELETE/UPDATE actions (parse but ignore for sampling)
### PostgreSQL
```sql
CREATE TABLE invoice_items (
id SERIAL PRIMARY KEY,
invoice_id INTEGER NOT NULL REFERENCES invoices(id),
product_id INTEGER NOT NULL,
CONSTRAINT fk_product FOREIGN KEY (product_id) REFERENCES products(id)
);
-- Or via ALTER TABLE
ALTER TABLE invoice_items
ADD CONSTRAINT fk_invoice FOREIGN KEY (invoice_id) REFERENCES invoices(id);
```
**Parsing notes:**
- Double-quote identifiers (optional)
- Inline `REFERENCES table(col)` shorthand
- Schema-qualified names: `schema.table`
- Extract table name from `schema.table` → `table`
### SQLite
```sql
CREATE TABLE invoice_items (
id INTEGER PRIMARY KEY AUTOINCREMENT,
invoice_id INTEGER NOT NULL REFERENCES invoices(id),
product_id INTEGER NOT NULL,
FOREIGN KEY (product_id) REFERENCES products(id)
);
```
**Parsing notes:**
- Double-quote or backtick identifiers
- No named constraints (anonymous)
- No ALTER TABLE ADD CONSTRAINT (FKs must be inline)
- Simpler parsing overall
### FK Regex Patterns
```rust
// Inline REFERENCES (PostgreSQL, SQLite)
});
// CONSTRAINT ... FOREIGN KEY (all dialects)
});
// ALTER TABLE ... ADD CONSTRAINT (MySQL, PostgreSQL)
});
```
---
## Handling Edge Cases
### 1. Circular FK References
**Detection:** Use Tarjan's algorithm to find strongly connected components (SCCs).
**Handling:**
- If SCC size = 1 with self-FK: Special self-referential handling
- If SCC size > 1: Process all tables in cycle as a unit
```rust
// For cycles, use conservative inclusion
fn handle_cycle(tables: &[String], state: &mut SamplingState) {
// Include row if:
// 1. It matches sampling criteria, OR
// 2. Any FK within the cycle points to a selected row
// Warn user about cycle
}
```
**Example:** `comments.parent_id → comments.id`
For self-referential tables:
1. Sample base rows normally
2. Compute ancestor closure (include parents of selected rows)
3. May require multiple passes over the table's temp file
### 2. Tables Without Primary Key
```
⚠️ Warning: Table 'legacy_data' has no PRIMARY KEY
Cannot track FK references; sampling without relation preservation
```
- Skip PK tracking for this table
- Sample normally (random selection)
- Warn user about potential integrity issues
### 3. Composite Primary Keys
```rust
// Track multi-column PKs as tuples
let pk_tuple: PkTuple = smallvec![
PkValue::Int(company_id),
PkValue::Int(order_id),
];
selected_pks.insert(pk_tuple);
```
### 4. Junction Tables
Tables like `permission_role`, `user_roles`, `taggables`:
```sql
CREATE TABLE permission_role (
permission_id INT REFERENCES permissions(id),
role_id INT REFERENCES roles(id),
PRIMARY KEY (permission_id, role_id)
);
```
**Handling:**
- Detect: Table with only FK columns + composite PK
- Include row if ANY FK points to a selected row
- Don't require ALL FKs to match (would be too restrictive)
### 5. Very Small Tables
```rust
fn calculate_sample_size(total_rows: usize, percent: f64) -> usize {
let target = (total_rows as f64 * percent / 100.0).round() as usize;
// Always include at least 1 row if table has data
target.max(1).min(total_rows)
}
```
### 6. Empty Tables
- Include CREATE TABLE in schema
- Skip INSERT generation
- Note in statistics: "0 rows (empty table)"
### 7. Orphaned FK References
During sampling, detect when included rows reference non-included parents:
```rust
struct OrphanStats {
table: String,
fk_name: String,
orphan_count: u64,
}
// After processing, report:
// ⚠️ 15 rows in 'orders' reference non-included 'users' (original data may have orphans)
```
With `--strict-fk`: Fail the command if any orphans detected.
---
## INSERT/COPY Row Parsing
To check FK values and extract PKs, we need to parse individual rows from INSERT and COPY statements.
### MySQL INSERT Parsing
```sql
INSERT INTO `users` (`id`, `name`, `company_id`) VALUES
(1, 'Alice', 5),
(2, 'Bob', 3),
(3, 'Carol', 5);
```
**Parser approach:**
1. Extract column list (if present) to map indices
2. Parse each `(...)` value tuple
3. Handle: strings with escapes, NULLs, numeric types
4. Only decode columns needed for PK/FK checks
```rust
fn parse_insert_values(stmt: &[u8], schema: &TableSchema) -> impl Iterator<Item = Row> {
// Find VALUES keyword
// Iterate through (val1, val2, ...) groups
// Yield decoded rows (only pk/fk columns materialized)
}
```
### PostgreSQL COPY Parsing
```sql
COPY public.users (id, name, company_id) FROM stdin;
1 Alice 5
2 Bob 3
3 Carol 5
\.
```
**Parser approach:**
1. Extract column list from COPY header
2. Split data block by newlines
3. Split each line by `\t`
4. Handle: `\N` (NULL), backslash escapes
```rust
fn parse_copy_data(data: &[u8], schema: &TableSchema) -> impl Iterator<Item = Row> {
// Split by newlines (excluding final \.)
// Split each line by \t
// Decode values, handling \N and escapes
}
```
---
## Implementation Architecture
### Directory Structure
```
src/
├── cmd/
│ └── sample.rs # CLI handler
├── sampler/
│ ├── mod.rs # Public API
│ ├── config.rs # Config parsing (YAML)
│ ├── strategy.rs # Sampling strategies (percent, rows)
│ ├── reservoir.rs # Reservoir sampling algorithm
│ └── writer.rs # Output generation
├── schema/
│ ├── mod.rs # Schema graph types
│ ├── parser.rs # DDL parsing for FK extraction
│ ├── graph.rs # Dependency graph + topological sort
│ └── classifier.rs # Table classification logic
└── row/
├── mod.rs # Row parsing types
├── insert.rs # INSERT value parsing
└── copy.rs # COPY data parsing
```
### Key Types
```rust
/// Schema information for a table
pub struct TableSchema {
pub name: String,
pub columns: Vec<Column>,
pub pk_indices: Vec<usize>,
pub fk_edges: Vec<ForeignKeyEdge>,
pub has_tenant_column: Option<usize>, // For shard feature
}
pub struct Column {
pub name: String,
pub index: usize,
pub data_type: Option<String>,
}
pub struct ForeignKeyEdge {
pub constraint_name: Option<String>,
pub from_columns: Vec<usize>, // Indices in source table
pub to_table: String,
pub to_columns: Vec<String>, // Column names in target
}
/// Complete schema graph
pub struct SchemaGraph {
pub tables: HashMap<String, TableSchema>,
pub fk_graph: HashMap<String, Vec<String>>, // child -> parents
pub reverse_graph: HashMap<String, Vec<String>>, // parent -> children
}
impl SchemaGraph {
pub fn from_ddl(statements: &[&[u8]], dialect: SqlDialect) -> Self;
pub fn topological_order(&self) -> Vec<&str>;
pub fn find_cycles(&self) -> Vec<Vec<&str>>;
pub fn root_tables(&self) -> Vec<&str>;
pub fn classify_table(&self, name: &str) -> TableClassification;
}
```
### Reservoir Sampling
```rust
use rand::{Rng, SeedableRng};
use rand::rngs::StdRng;
pub struct ReservoirSampler<T> {
sample_size: usize,
reservoir: Vec<T>,
count: usize,
rng: StdRng,
}
impl<T> ReservoirSampler<T> {
pub fn new(sample_size: usize, seed: Option<u64>) -> Self {
let rng = match seed {
Some(s) => StdRng::seed_from_u64(s),
None => StdRng::from_entropy(),
};
Self {
sample_size,
reservoir: Vec::with_capacity(sample_size),
count: 0,
rng,
}
}
pub fn add(&mut self, item: T) {
self.count += 1;
if self.reservoir.len() < self.sample_size {
self.reservoir.push(item);
} else {
// Algorithm R: Replace with probability k/n
let j = self.rng.gen_range(0..self.count);
if j < self.sample_size {
self.reservoir[j] = item;
}
}
}
pub fn total_seen(&self) -> usize {
self.count
}
pub fn finish(self) -> Vec<T> {
self.reservoir
}
}
```
---
## Output Format
### Header Comment
```sql
-- Sampled from: production.sql
-- Date: 2025-12-20 12:00:00
-- Mode: 10% per table, preserve-relations
-- Seed: 42
-- Dialect: mysql
--
-- Statistics:
-- Tables sampled: 45
-- Tables skipped: 7 (system/excluded)
-- Total rows: 15,234 (from 152,340 original, 10.0%)
-- FK integrity: OK (no orphans)
```
### Schema Section
```sql
-- Schema statements
CREATE TABLE `users` (
`id` INT PRIMARY KEY AUTO_INCREMENT,
`email` VARCHAR(255),
`company_id` INT,
CONSTRAINT `fk_company` FOREIGN KEY (`company_id`) REFERENCES `companies` (`id`)
);
CREATE INDEX `idx_users_email` ON `users` (`email`);
```
### Data Section (Compact INSERTs)
```sql
-- Data: users (523 rows)
INSERT INTO `users` VALUES
(1, 'alice@example.com', 5),
(5, 'bob@example.com', 5),
(12, 'carol@example.com', 5);
-- Data: orders (1,847 rows)
INSERT INTO `orders` VALUES
...
```
---
## Performance
### Targets
| Throughput | 150-200 MB/s | Slower than split due to parsing overhead |
| Memory | O(selected_rows) | PK tracking for FK targets only |
| Startup | <100ms | Schema analysis phase |
### Optimizations
1. **Streaming architecture**: Never load full file into memory
2. **Lazy row parsing**: Only decode PK/FK columns, skip others
3. **Fast hashing**: `ahash` for PK sets
4. **Temp file I/O**: Per-table files enable random access order
5. **Pre-compiled regexes**: Static initialization via `once_cell`
### Memory Budget
```rust
const DEFAULT_MAX_SELECTED_ROWS: usize = 10_000_000; // ~200MB for int PKs
const PK_MEMORY_WARNING_THRESHOLD: usize = 5_000_000;
if selected_pks.len() > PK_MEMORY_WARNING_THRESHOLD {
eprintln!("⚠️ Large selection: {} PKs tracked, memory usage may be high",
selected_pks.len());
}
```
---
## Testing Strategy
### Unit Tests
- Reservoir sampling: uniform distribution, deterministic with seed
- FK regex parsing: all three dialects, edge cases
- Dependency graph: topological sort, cycle detection
- Row parsing: INSERT values, COPY data, escapes, NULLs
### Integration Tests
```rust
#[test]
fn test_sample_preserves_fk_integrity() {
// Create dump with FK relationships
// Sample with --preserve-relations
// Verify: all FK references in output exist
}
#[test]
fn test_sample_reproducible_with_seed() {
// Sample twice with same seed
// Verify: identical output
}
#[test]
fn test_sample_handles_cycles() {
// Create dump with self-referential table
// Sample with --preserve-relations
// Verify: parent rows included for selected children
}
```
### Property-Based Tests
- Output is always valid SQL for the dialect
- Row counts within expected bounds (±5% for reservoir sampling)
- No orphaned FK references when `--preserve-relations`
---
## Example Workflows
### 1. Create Dev Database Fixture
```bash
# Production: 50GB, Dev: ~500MB (1%)
sql-splitter sample prod.sql.gz -o dev-fixture.sql \
--percent 1 \
--preserve-relations \
--seed 12345
```
### 2. CI Test Data
```bash
# Small, fast, reproducible, FK-safe
sql-splitter sample dump.sql -o ci-data.sql \
--rows 100 \
--preserve-relations \
--seed 42
```
### 3. Demo Dataset Starting from Orders
```bash
# 50 orders with all related data (customers, products, etc.)
sql-splitter sample prod.sql -o demo.sql \
--rows 50 \
--preserve-relations \
--root-tables orders \
--include-global lookups \
--exclude audit_logs,sessions,cache
```
### 4. Advanced Config
```yaml
# team-sample.yaml
default:
percent: 5
classification:
system:
- failed_jobs
- job_batches
- telescope_entries
lookup:
- countries
- currencies
- permissions
tables:
users:
rows: 100 # Always need some users
analytics_events:
percent: 0.1 # Huge table, minimal sample
reference_data:
percent: 100 # Keep all lookup data
temp_tables:
skip: true
```
```bash
sql-splitter sample prod.sql -o team-dev.sql --config team-sample.yaml --preserve-relations
```
---
## Estimated Effort
| CLI and config parsing | 2 hours |
| Reservoir sampling | 2 hours |
| Basic percentage/row sampling | 3 hours |
| Schema graph + FK parsing (shared with shard) | 6 hours |
| Table classification logic | 2 hours |
| INSERT row parsing (MySQL) | 3 hours |
| COPY data parsing (PostgreSQL) | 2 hours |
| Dependency-ordered processing | 4 hours |
| PK tracking + FK membership checks | 3 hours |
| Cycle handling (self-FK, SCCs) | 3 hours |
| Output generation | 2 hours |
| YAML config support | 2 hours |
| Progress reporting | 1 hour |
| Testing (comprehensive) | 8 hours |
| **Total** | **~43 hours** |
**Note:** The schema graph and FK parsing components are shared with the `shard` command, reducing overall implementation effort when both features are built.
---
## Future Enhancements
1. **Conditional sampling**: `where: "created_at > '2024-01-01'"`
2. **Stratified sampling**: Ensure distribution across column values
3. **Weighted sampling**: Priority to recent/important rows
4. **Incremental sampling**: Add to existing sample
5. **Schema-only mode**: Just CREATE statements, no data
6. **Parallel processing**: Sample multiple tables concurrently (independent tables)
---
## Related
- [Split Command](../../src/cmd/split.rs) — Base streaming architecture
- [Shard Feature](SHARD_FEATURE.md) — Tenant-based extraction (shares FK graph)
- [Merge Feature](MERGE_FEATURE.md) — Combine split files
- [Redact Feature](IDEAS.md#6-redact-command--anonymize-sensitive-data) — Data anonymization