remnant
Summary
A command line tool for creating random sub-samples of data. remnant can sample rows from CSV files or from all tables in a PostgreSQL database, outputting the results as CSV or Parquet files. When sampling a database, it also generates a SQL rebuild script to recreate the schema and load the sampled data.
Installation
cargo build --release
Usage
remnant has two subcommands: csv for sampling CSV files and pg for sampling PostgreSQL databases.
CSV Sampling
Sample a percentage of rows from a CSV file:
remnant csv --file <FILE> --outfile <OUTFILE> [OPTIONS]
Options:
-f, --file <FILE> Input CSV file (required)
-o, --outfile <OUTFILE> Output CSV file (required)
-p, --percent <PERCENT> Percentage of rows to sample [default: 10]
-m, --max-records <MAX_RECORDS> Records for schema inference [default: 10000]
-s, --seed <SEED> Random seed for reproducibility
Example:
PostgreSQL Database Sampling
Sample a percentage of rows from every table in a PostgreSQL database:
remnant pg --connection-string <CONNECTION_STRING> --outdir <OUTDIR> [OPTIONS]
Options:
-c, --connection-string <CONNECTION_STRING> PostgreSQL connection string (required)
-o, --outdir <OUTDIR> Output directory for data files (required)
-p, --percent <PERCENT> Percentage of rows to sample [default: 10]
-s, --seed <SEED> Random seed for reproducibility
-f, --format <FORMAT> Output format: csv or parquet [default: csv]
Example:
# Sample 10% of each table as CSV files
# Sample 5% as Parquet files with a fixed seed
Output: The output directory will contain one data file per table (e.g., users.csv, orders.csv) plus a rebuild.sql script.
Rebuild Script
When using the pg subcommand, remnant automatically generates a rebuild.sql file in the output directory. This script contains everything needed to recreate the sampled database:
CREATE TABLEstatements with original column types, constraints, and defaults\COPYcommands to load data from the CSV files- Foreign key constraints (added after data load to avoid ordering issues)
- Index creation statements
To rebuild the database from the sampled data:
Note: Parquet format output includes comments in place of \COPY commands, since PostgreSQL does not natively support Parquet import.
Library Usage
remnant can also be used as a Rust library. Add it to your Cargo.toml:
[]
= "0.1"
CSV sampling
PostgreSQL sampling
use OutputFormat;
async
Core sampling function
If you already have a Polars DataFrame, you can sample it directly:
use *;
use sample_df;
let df = new.unwrap;
let sampled = sample_df.unwrap;
For full API documentation, run cargo doc --open.
Project Structure
src/
main.rs -- CLI argument parsing and subcommand dispatch
lib.rs -- Library crate root
sampling.rs -- Shared sampling logic (sample_df)
csv.rs -- CSV file subsetting
pg.rs -- PostgreSQL subsetting, schema introspection, and SQL generation
Running Tests
Unit Tests
cargo test
Unit tests cover sampling logic, CSV round-trips, type mapping, SQL generation, seed conversion, row count calculation, and file output (CSV/Parquet). No database required.
Integration Tests (PostgreSQL)
Integration tests require a running PostgreSQL instance. Set TEST_DATABASE_URL
and enable the integration feature:
These tests create and drop temporary tables (prefixed with remnant_test_)
automatically. Use a dedicated test database to avoid interfering with other data.