csvdb
Note: This is beta software. The API and file format may change. Use with caution in production.
A deterministic, git-friendly way to store relational data as CSV and convert it to SQLite or DuckDB when you need fast queries.
csvdb Directory Layout
A .csvdb directory contains:
mydb.csvdb/
csvdb.toml # format version, export settings
schema.sql # CREATE TABLE, CREATE INDEX, CREATE VIEW
users.csv # one file per table
orders.csv
The schema defines the structure. The CSVs hold the data. csvdb.toml records the format version and the settings used to produce the export.
Why csvdb
CSV format works with standard tools:
- Edit with any text editor or spreadsheet
- Diff and merge with git
- Process with awk, pandas, Excel
SQLite/DuckDB format provides fast access:
- Indexed lookups without scanning entire files
- Views for complex joins and computed columns
- Full SQL query support
- Single-file distribution
csvdb lets you store data as CSV (human-readable, git-friendly) and convert to SQLite or DuckDB when you need query performance.
Installation
Quick Start
# Convert an existing SQLite database to csvdb
# Edit data
# Rebuild database
Commands
init — Create csvdb from raw CSV files
Creates raw_csvs.csvdb/ by:
- Inferring schema from CSV headers and data types
- Detecting primary keys (columns named
idor<table>_id) - Copying CSV files
Options:
--no-pk-detection- Disable automatic primary key detection
to-csvdb — Export database to csvdb
# From SQLite
# From DuckDB
Creates mydb.csvdb/ containing:
schema.sql- table definitions, indexes, views*.csv- one file per table, sorted by primary key
Supports multiple input formats:
- SQLite (
.sqlite,.sqlite3,.db) - DuckDB (
.duckdb)
Options:
-o, --output <dir>- Custom output directory--order <mode>- Row ordering mode (see below)--null-mode <mode>- NULL representation in CSV (see below)--pipe- Write to temp directory, output only path (for piping)
to-sqlite — Build SQLite database from csvdb
Creates mydb.sqlite from the schema and CSV files.
to-duckdb — Build DuckDB database from csvdb
Creates mydb.duckdb from the schema and CSV files.
checksum — Verify data integrity
Computes a SHA-256 checksum of the database content. The checksum is:
- Format-independent: Same data produces same hash regardless of format
- Deterministic: Same data always produces same hash
- Content-based: Includes schema structure and all row data
Use checksums to verify roundtrip conversions:
Primary Key Requirement
By default, every table must have an explicit primary key. Rows are sorted by primary key when exporting to CSV. By enforcing a stable row order, csvdb guarantees that identical data always produces identical CSV files, making git diffs meaningful and noise-free.
Tables Without Primary Keys
For tables without a primary key (event logs, append-only tables), use the --order option:
# Order by all columns (deterministic but may have issues with duplicates)
# Add a synthetic __csvdb_rowid column (best for event/log tables)
Order Modes
| Mode | Description | Best For |
|---|---|---|
pk (default) |
Order by primary key | Tables with natural keys |
all-columns |
Order by all columns | Reference tables without PK |
add-synthetic-key |
Add __csvdb_rowid column |
Event logs, append-only data |
NULL Handling
CSV has no native NULL concept. csvdb uses explicit conventions to preserve NULLs across database roundtrips.
By default, CSV files use \N (PostgreSQL convention) to represent NULL values:
"id","name","value"
"1","\N","42" # name is NULL
"2","","42" # name is empty string
"3","hello","\N" # value is NULL
This preserves the distinction between NULL and empty string through roundtrips:
- SQLite roundtrip: NULL and empty string are fully preserved
- DuckDB roundtrip: NULL is preserved. DuckDB limitation: empty strings may become NULL due to a Rust driver limitation.
--null-mode
| Mode | NULL representation | Lossless? | Use case |
|---|---|---|---|
marker (default) |
\N |
Yes | Roundtrip-safe, distinguishes NULL from empty string |
empty |
empty string | No | Simpler CSV, but cannot distinguish NULL from "" |
literal |
NULL |
No | Human-readable, but cannot distinguish NULL from the string "NULL" |
Lossy modes print a warning to stderr. Use --pipe to suppress warnings.
CSV Dialect
csvdb produces a strict, deterministic CSV dialect:
| Property | Value |
|---|---|
| Encoding | UTF-8 |
| Delimiter | , (comma) |
| Quote character | " (double quote) |
| Quoting | Always — every field is quoted, including headers |
| Quote escaping | Doubled ("") per RFC 4180 |
| Record terminator | \n (LF), not CRLF |
| Header row | Always present as the first row |
| Row ordering | Sorted by primary key (deterministic) |
| NULL representation | Configurable via --null-mode (see above) |
This is mostly RFC 4180 compliant, with one deliberate deviation: line endings use LF instead of CRLF. This produces cleaner git diffs and avoids mixed-endings issues on Unix systems.
Newlines embedded within field values are preserved as-is inside quoted fields. The Rust csv crate handles quoting and escaping automatically.
Workflows
Git-Tracked Data
Store data in git, rebuild databases as needed:
# Initial setup: export existing database
# Daily workflow: edit CSVs, commit, rebuild
Deploy to Production
Use csvdb as the source of truth. Track schema and data in git, export to SQLite for deployment:
# Define your schema and seed data in csvdb format
# Edit data directly as CSV
# Commit to git — schema and data are versioned together
# Build SQLite for deployment
Changes go through normal code review. git diff shows exactly which rows changed. Rollback is git revert.
Data Review via Pull Request
Treat data changes like code changes:
# Edit the CSV
# Open PR — reviewers see the exact row-level diff
Because CSVs are sorted by primary key, the diff contains only actual changes — no noise from row reordering.
Piping Commands
Use --pipe for one-liner conversions:
# SQLite → DuckDB via pipe
|
The --pipe flag:
- Writes to system temp directory
- Outputs only the path (no "Created:" prefix)
- Uses forward slashes for cross-platform compatibility
Database Migration
Convert between database formats:
# SQLite to DuckDB
# DuckDB to SQLite
# Verify no data loss
# Checksums match = data preserved
Diff and Review Changes
Use git to review data changes:
# See what changed
# See changes to specific table
# Blame: who changed what
CI/CD Integration
Verify data integrity in CI:
#!/bin/bash
# Rebuild from csvdb source
# Verify checksum matches expected
EXPECTED="a1b2c3d4..."
ACTUAL=
[ ||
Project Structure
src/
main.rs # CLI (clap)
lib.rs
commands/
init.rs # CSV files -> csvdb (schema inference)
to_csv.rs # SQLite/DuckDB -> csvdb
to_sqlite.rs # csvdb -> SQLite
to_duckdb.rs # csvdb -> DuckDB
checksum.rs # Format-independent checksums
core/
schema.rs # Parse/emit schema.sql, type normalization
table.rs # Row operations, PK handling
csv.rs # Deterministic CSV I/O
tests/
functional/
conftest.py # Pytest fixtures
test_commands.py # Functional tests
pyproject.toml # Python dependencies (uv)
Development
Testing
# Rust unit tests
# Python functional tests (151 tests)
# Cross-platform (avoids .venv collision)
License
MIT