csvdb
Version-control your relational data like code.
Note: This is beta software. The API and file format may change. Use with caution in production.
SQLite and DuckDB files are binary — git can't diff them, reviewers can't read them, and merges are impossible. csvdb converts your database into a directory of plain-text CSV files + schema.sql, fully diffable and round-trip lossless. Convert back to SQLite, DuckDB, or Parquet when you need query performance.
# git diff myapp.csvdb/rates.csv
"date","rate"
"2024-01-01","4.50"
-"2024-04-01","4.25"
+"2024-04-01","3.75"
+"2024-07-01","3.50"
Every change is a readable, reviewable line in a PR. No binary blobs, no "file changed" with no context.
Use cases:
- Seed data and test fixtures committed alongside code
- Config and lookup tables reviewed in PRs before deploy
- CI integrity checks:
csvdb checksum data.csvdb/ | grep $EXPECTED - Migrating between SQLite, DuckDB, and Parquet without ETL scripts
- Manual edits in a spreadsheet or text editor, rebuild with one command
- Audit trail:
git blameon any CSV row shows who changed it and when
Directory Layouts
A .csvdb directory contains:
mydb.csvdb/
csvdb.toml # format version, export settings
schema.sql # CREATE TABLE, CREATE INDEX, CREATE VIEW
users.csv # one file per table
orders.csv
A .parquetdb directory has the same structure with Parquet files instead of CSVs:
mydb.parquetdb/
csvdb.toml # format version, export settings
schema.sql # CREATE TABLE, CREATE INDEX, CREATE VIEW
users.parquet # one file per table
orders.parquet
The schema defines the structure. The data files hold the data. csvdb.toml records the format version and the settings used to produce the export.
Why csvdb
CSV format works with standard tools:
- Edit with any text editor or spreadsheet
- Diff and merge with git
- Process with awk, pandas, Excel
SQLite/DuckDB format provides fast access:
- Indexed lookups without scanning entire files
- Views for complex joins and computed columns
- Full SQL query support
- Single-file distribution
Parquet format provides columnar storage:
- Efficient compression and encoding
- Fast analytical queries
- Wide ecosystem support (Spark, pandas, DuckDB, etc.)
- Per-table
.parquetfiles in a.parquetdbdirectory
csvdb lets you store data as CSV (human-readable, git-friendly) and convert to SQLite, DuckDB, or Parquet when you need query performance.
Installation
# Rust (via cargo)
# Python library (import csvdb)
# Standalone binary (via pip/pipx/uvx)
Quick Start
# Convert an existing SQLite database to csvdb
# Edit data
# Rebuild database
# Or export to Parquet
Commands
init — Create csvdb from raw CSV files
# From a directory of CSV files
# From a single CSV file
Creates a .csvdb directory by:
- Inferring schema from CSV headers and data types
- Detecting primary keys (columns named
idor<table>_id) - Detecting foreign keys (columns like
user_idreferencingusers.id) - Copying CSV files
Options:
--no-pk-detection- Disable automatic primary key detection--no-fk-detection- Disable automatic foreign key detection
to-csvdb — Export database to csvdb
# From SQLite
# From DuckDB
# From Parquet
Creates mydb.csvdb/ containing:
schema.sql- table definitions, indexes, views*.csv- one file per table, sorted by primary key
Supports multiple input formats:
- SQLite (
.sqlite,.sqlite3,.db) - DuckDB (
.duckdb) - parquetdb (
.parquetdbdirectory) - Parquet (
.parquetsingle file)
Options:
-o, --output <dir>- Custom output directory--order <mode>- Row ordering mode (see below)--null-mode <mode>- NULL representation in CSV (see below)--pipe- Write to temp directory, output only path (for piping)
to-sqlite — Build SQLite database
Creates mydb.sqlite from a csvdb or parquetdb directory.
Options:
--force- Overwrite existing output file--tables <list>- Only include these tables (comma-separated)--exclude <list>- Exclude these tables (comma-separated)
to-duckdb — Build DuckDB database
Creates mydb.duckdb from a csvdb or parquetdb directory.
Options:
--force- Overwrite existing output file--tables <list>- Only include these tables (comma-separated)--exclude <list>- Exclude these tables (comma-separated)
to-parquetdb — Convert any format to Parquet
# From SQLite
# From DuckDB
# From csvdb
# From a single Parquet file
Creates mydb.parquetdb/ containing:
schema.sql- table definitions, indexes, viewscsvdb.toml- format version and export settings*.parquet- one Parquet file per table
Supports multiple input formats:
- SQLite (
.sqlite,.sqlite3,.db) - DuckDB (
.duckdb) - csvdb (
.csvdbdirectory) - parquetdb (
.parquetdbdirectory) - Parquet (
.parquetsingle file)
Options:
-o, --output <dir>- Custom output directory--order <mode>- Row ordering mode (see below)--null-mode <mode>- NULL representation (see below)--pipe- Write to temp directory, output only path (for piping)--force- Overwrite existing output directory--tables <list>- Only include these tables (comma-separated)--exclude <list>- Exclude these tables (comma-separated)
checksum — Verify data integrity
Computes a SHA-256 checksum of the database content. The checksum is:
- Format-independent: Same data produces same hash regardless of format
- Deterministic: Same data always produces same hash
- Content-based: Includes schema structure and all row data
Use checksums to verify roundtrip conversions:
Primary Key Requirement
By default, every table must have an explicit primary key. Rows are sorted by primary key when exporting to CSV. By enforcing a stable row order, csvdb guarantees that identical data always produces identical CSV files, making git diffs meaningful and noise-free.
Tables Without Primary Keys
For tables without a primary key (event logs, append-only tables), use the --order option:
# Order by all columns (deterministic but may have issues with duplicates)
# Add a synthetic __csvdb_rowid column (best for event/log tables)
Order Modes
| Mode | Description | Best For |
|---|---|---|
pk (default) |
Order by primary key | Tables with natural keys |
all-columns |
Order by all columns | Reference tables without PK |
add-synthetic-key |
Add __csvdb_rowid column |
Event logs, append-only data |
NULL Handling
CSV has no native NULL concept. csvdb uses explicit conventions to preserve NULLs across database roundtrips.
By default, CSV files use \N (PostgreSQL convention) to represent NULL values:
"id","name","value"
"1","\N","42" # name is NULL
"2","","42" # name is empty string
"3","hello","\N" # value is NULL
This preserves the distinction between NULL and empty string through roundtrips:
- SQLite roundtrip: NULL and empty string are fully preserved
- DuckDB roundtrip: NULL is preserved. DuckDB limitation: empty strings may become NULL due to a Rust driver limitation.
--null-mode
| Mode | NULL representation | Lossless? | Use case |
|---|---|---|---|
marker (default) |
\N |
Yes | Roundtrip-safe, distinguishes NULL from empty string |
empty |
empty string | No | Simpler CSV, but cannot distinguish NULL from "" |
literal |
NULL |
No | Human-readable, but cannot distinguish NULL from the string "NULL" |
Lossy modes print a warning to stderr. Use --pipe to suppress warnings.
CSV Dialect
csvdb produces a strict, deterministic CSV dialect:
| Property | Value |
|---|---|
| Encoding | UTF-8 |
| Delimiter | , (comma) |
| Quote character | " (double quote) |
| Quoting | Always — every field is quoted, including headers |
| Quote escaping | Doubled ("") per RFC 4180 |
| Record terminator | \n (LF), not CRLF |
| Header row | Always present as the first row |
| Row ordering | Sorted by primary key (deterministic) |
| NULL representation | Configurable via --null-mode (see above) |
This is mostly RFC 4180 compliant, with one deliberate deviation: line endings use LF instead of CRLF. This produces cleaner git diffs and avoids mixed-endings issues on Unix systems.
Newlines embedded within field values are preserved as-is inside quoted fields. The Rust csv crate handles quoting and escaping automatically.
See FORMAT.md for the full normative format specification.
Gotchas
Things that may surprise you on day one:
-
String-based sorting. PK sort is lexicographic on strings, not numeric.
"10"sorts before"2". If you need numeric order, use a zero-padded string or an INTEGER primary key (integers sort correctly because shorter strings come first and same-length digit strings sort numerically). -
Schema inference is limited.
csvdb initonly infers three types:INTEGER,REAL,TEXT. It won't detect dates, booleans, or blobs. Editschema.sqlafter init if you need richer types. -
PK detection stops tracking at 100k values. During
init, uniqueness tracking for primary key candidates stops after 100,000 values. If the column was unique up to that point, it's still used as the PK. -
Float precision in checksums. Values are normalized to 10 decimal places for checksumming.
42.0normalizes to42(integer-valued floats become integers). Very small precision differences across databases are absorbed. -
DuckDB empty string limitation. Empty strings in TEXT columns may become NULL when round-tripping through DuckDB due to a Rust driver limitation.
-
Blob values are hex strings in CSV. BLOB data is stored as lowercase hex (e.g.
cafe). It roundtrips correctly through SQLite and DuckDB. -
No duplicate PK validation during CSV read. Duplicate primary keys are not caught when reading CSV files. They will cause an error at database INSERT time.
-
DuckDB indexes are not exported. Index metadata is not available from DuckDB sources. Indexes defined in a csvdb
schema.sqlare preserved when converting between csvdb and SQLite, but not when the source is DuckDB. -
Views are not dependency-ordered. Views are written in alphabetical order. If view A references view B, you may need to manually reorder them in
schema.sql. -
__csvdb_rowidis reserved. The column name__csvdb_rowidis used by theadd-synthetic-keyorder mode. Don't use it in your own schemas.
Examples
The examples/ directory contains ready-to-use examples:
examples/store.csvdb/— A hand-written csvdb directory with two tables, an index, a view, and NULL valuesexamples/raw-csvs/— Plain CSV files for demonstratingcsvdb init
See examples/README.md for usage instructions.
Workflows
Git-Tracked Data
Store data in git, rebuild databases as needed:
# Initial setup: export existing database
# Daily workflow: edit CSVs, commit, rebuild
Deploy to Production
Use csvdb as the source of truth. Track schema and data in git, export to SQLite for deployment:
# Define your schema and seed data in csvdb format
# Edit data directly as CSV
# Commit to git — schema and data are versioned together
# Build SQLite for deployment
Changes go through normal code review. git diff shows exactly which rows changed. Rollback is git revert.
Data Review via Pull Request
Treat data changes like code changes:
# Edit the CSV
# Open PR — reviewers see the exact row-level diff
Because CSVs are sorted by primary key, the diff contains only actual changes — no noise from row reordering.
Piping Commands
Use --pipe for one-liner conversions:
# SQLite → DuckDB via pipe
|
# SQLite → Parquet via pipe
|
The --pipe flag:
- Writes to system temp directory
- Outputs only the path (no "Created:" prefix)
- Uses forward slashes for cross-platform compatibility
Database Migration
Convert between database formats:
# SQLite to DuckDB
# DuckDB to SQLite
# SQLite to Parquet
# Parquet to SQLite
# Verify no data loss
# Checksums match = data preserved
Diff and Review Changes
Use git to review data changes:
# See what changed
# See changes to specific table
# Blame: who changed what
CI/CD Integration
Verify data integrity in CI:
#!/bin/bash
# Rebuild from csvdb source
# Verify checksum matches expected
EXPECTED="a1b2c3d4..."
ACTUAL=
[ ||
Python Bindings
csvdb provides native Python bindings via PyO3, giving you direct access to all csvdb functions without subprocess overhead.
Install
API
# Convert between formats
# Incremental export (only re-exports changed tables)
=
# result: {"path": "...", "added": [...], "updated": [...], "unchanged": [...], "removed": [...]}
# Checksum (format-independent, deterministic)
=
# SQL queries (read-only, returns list of dicts)
=
# Diff two databases
=
# Validate structure
=
# Initialize csvdb from raw CSV files
=
# Selective export
Development
Perl Bindings
csvdb provides Perl bindings via a C FFI shared library and FFI::Platypus.
Setup
# Build the shared library
# Install dependencies (macOS)
LDFLAGS="-L/opt/homebrew/opt/libffi/lib" \
CPPFLAGS="-I/opt/homebrew/opt/libffi/include" \
# Install dependencies (Linux)
Running Examples
API
;
(), ;
my $csvdb_path = (input => , force => 1);
my $sqlite_path = (input => , force => 1);
my $duckdb_path = (input => , force => 1);
my $hash = (input => );
my $csv = (path => , query => );
my $rc = (left => , right => );
my $rc = (input => );
Running Tests
Project Structure
csvdb/ # Core library + CLI binary
src/
main.rs # CLI (clap)
lib.rs
commands/
init.rs # CSV files -> csvdb (schema inference)
to_csv.rs # any format -> csvdb
to_sqlite.rs # any format -> SQLite
to_duckdb.rs # any format -> DuckDB
to_parquetdb.rs # any format -> parquetdb (Parquet)
checksum.rs # Format-independent checksums
validate.rs # Structural integrity checks
diff.rs # Compare two databases
sql.rs # Read-only SQL queries
core/
schema.rs # Parse/emit schema.sql, type normalization
table.rs # Row operations, PK handling
csv.rs # Deterministic CSV I/O
input.rs # Input format detection
csvdb-python/ # Python bindings (PyO3)
src/lib.rs
examples/
basic_usage.py
advanced_usage.py
csvdb-ffi/ # C FFI for Perl and other languages
src/lib.rs
perl/ # Perl module (FFI::Platypus)
lib/Csvdb.pm
examples/basic_usage.pl
tests/functional/ # Python functional tests
conftest.py
test_commands.py
test_performance.py
pyproject.toml
Development
Testing
# Rust unit tests
# Python functional tests (189 tests)
# Cross-platform (avoids .venv collision)
License
MIT