duckquill
A Rust text2sql engine for agents, packaged as duckquill and published on crates.io as v0.2.5.
The primary user is an agent running in Codex CLI or Claude Code that opens this repo, reads AGENTS.md plus .codex/skills/duckquill-ops/SKILL.md, and hands off to .codex/skills/object-storage-ops/SKILL.md when the workflow becomes S3/MinIO-specific.
Core idea
raw file -> parquet -> schema -> query -> benchmark
Why this shape:
- Parquet lets DuckDB prune columns and push predicates
schemagrounds SQL before executionparquet_selectivestays the default pathfull_downloadexists as a measurable fallback, not the main happy path
Architecture and product intent:
Who this is for
Codex CLI / Claude Code
Use this repo when the agent needs:
- one Cargo package
- one binary entrypoint
- local CLI loops faster than standing up a bigger service
- repo-local guidance for parquet/object-storage workflows
Preferred agent loop:
cargo run -- convertcargo run -- schemacargo run -- querycargo run -- serveonly when the HTTP contract itself needs verificationcargo run -- benchmarkbefore changing guidance aroundfull_download
Example prompts:
Use the duckquill-ops skill. Convert ./testdata/owid-covid-latest.csv to parquet, inspect schema, then run a parquet_selective query. If the workflow moves to S3 or MinIO, hand off to object-storage-ops.
Agent startup contract
For agent operators, the startup contract is:
- read
AGENTS.mdfirst - load
.codex/skills/duckquill-ops/SKILL.mdfor the broad install + operator workflow - if the workflow touches S3/MinIO/object storage, also load
.codex/skills/object-storage-ops/SKILL.md - install with:
cargo install duckquilloutside the repocargo install --path .inside a checkoutcargo install --git https://github.com/sigridjineth/text2sql duckquillonly for unreleased repo changes
- if
duckquillis not onPATHyet, exportPATH="$HOME/.cargo/bin:$PATH"and verify withduckquill --help - then run
convert -> schema -> query - when answering an end user, cite the exact query/command you used and the result rows or aggregate that support the answer
- start
serveonly when the HTTP contract itself needs verification - run
benchmarkbefore recommendingfull_downloadas a fallback
Humans
Humans can use the binary directly, but the docs are optimized for agent operators first.
What the binary exposes
CLI:
serveconvertschemaquerybenchmark
HTTP:
GET /GET /healthPOST /schemaPOST /queryPOST /convertPOST /benchmark
Query modes:
parquet_selective— the default and the recommended path for larger datasets plus remote/object-storage parquethybrid— for local parquet files at or below 10 MiB, run the materialize-firstfull_downloadpath; otherwise resolve toparquet_selectivefull_download
Install
duckquill is published on crates.io:
If you need unreleased changes from the current GitHub repository instead of the published crate:
For local development from this checkout:
If duckquill is not found after install, add Cargo's bin directory to your shell first:
Quickstart
1) Convert a real fixture to parquet
If a CSV is not UTF-8, pass an explicit encoding label:
For Korean public-data workflows, the same flag is where you would pass labels such as cp949 or euc-kr.
2) Inspect schema first
3) Query from the CLI
Query is intentionally sandboxed to read-only analytical SQL over the registered dataset alias:
- allowed:
SELECT ... FROM dataset ... - allowed:
WITH ... SELECT ... FROM dataset ... - rejected:
COPY,INSTALL,LOAD,ATTACH,CREATE,ALTER,DROP,INSERT,UPDATE,DELETE,SET - rejected: direct file/network access helpers such as
read_parquet(...),parquet_scan(...),read_csv(...),read_csv_auto(...),csv_scan(...),read_ndjson(...),read_json(...),read_json_auto(...),read_json_objects(...),delta_scan(...),iceberg_scan(...),read_blob(...),read_text(...),glob(...), and parquet metadata helpers inside user SQL - rejected: multi-statement SQL
- defense-in-depth: query connections also disable extension auto-install/auto-load, lock configuration changes, and restrict external file access to the configured dataset location only
- operator warning: validator + connection hardening reduce blast radius, but this is still not a full SQL sandbox; keep the CLI/HTTP query surfaces behind trusted local or single-tenant boundaries
4) Benchmark the current approach
See BENCHMARK.md for the measured local results and the Hugging Face comparison case.
HTTP examples
Start the server
GET / returns the configured service name, crate version, and a small capability list for quick operator sanity checks.
Inspect schema via HTTP
/schema now returns typed columns plus a small preview_rows sample to help agents ground SQL generation with real row context.
Query via HTTP
Use hybrid when you want the binary to choose the small-file fast path automatically:
- local parquet files
<= 10 MiBresolve tofull_download - larger local parquet files and remote/object-storage datasets resolve to
parquet_selective
The same read-only query contract applies to HTTP /query; it is not designed for arbitrary DuckDB session control or ad-hoc file/network access. Treat it as a local developer / agent workflow surface, not a publicly exposed multi-tenant SQL service.
Object storage
The binary supports S3 / MinIO directly. hybrid is safe for mixed local/remote workflows because remote and object-storage datasets still resolve to parquet_selective.
Convert to S3:
TEXT2SQL_S3_REGION=ap-northeast-2 \
TEXT2SQL_S3_ENDPOINT=http://127.0.0.1:9000 \
TEXT2SQL_S3_ACCESS_KEY_ID=minioadmin \
TEXT2SQL_S3_SECRET_ACCESS_KEY=minioadmin \
TEXT2SQL_S3_ALLOW_HTTP=true \
For agent-facing workflow guidance, use:
.codex/skills/duckquill-ops/SKILL.mdfor broad install + convert/schema/query/benchmark usage.codex/skills/object-storage-ops/SKILL.mdfor S3/MinIO-specific setup, remote Parquet, and credential debugging
Current guidance
- keep
parquet_selectiveas the default for larger datasets and object storage - use
hybridwhen you want the binary to auto-pickfull_downloadfor local parquet files<= 10 MiBand otherwise stay parquet-first - use
full_downloadfor debugging or when you intentionally want full materialization first - quoted local parquet globs already work for multi-file queries, e.g.
--dataset './tmp/shard-*.parquet' - globbed parquet inputs do not currently add a filename/source column automatically; for file-by-file comparisons, either query each file separately or prepare Parquet with an explicit source column upstream
- blank numeric CSV cells already aggregate as NULL after convert;
TRY_CASTis not required for that current path - use
--csv-encoding <label>when the CSV is not UTF-8 - spreadsheet ingestion currently passes XLS/XLSX cells through CSV text before Parquet inference; mixed numeric/text spreadsheet columns can therefore land as
Utf8, so clean the column in the sheet first or export to CSV and convert with--csv-encoding <label>when typing matters - when agents answer users from schema/query runs, include the exact command or SQL used plus the result rows/aggregates that support the answer
- benchmark before changing that recommendation
- load
.codex/skills/duckquill-ops/SKILL.mdfor general operator guidance and hand off to.codex/skills/object-storage-ops/SKILL.mdwhen the workflow becomes object-storage-specific - do not claim selective reads lose matching data unless tests prove it
Real fixtures in this repo
testdata/owid-covid-latest.csvtestdata/owid-covid-latest.jsontestdata/canada-wastewater-aggregate.csvtestdata/keyfoods_0708.xlsx
Acceptance snapshots
Convert

Schema

Query

Benchmark

Verification
Current status
What is proven locally:
- convert works
- schema works
- query works
- benchmark works
- a real Hugging Face parquet comparison case is documented in
BENCHMARK.md
What is still environment-blocked on this machine:
- writable remote MinIO/S3 acceptance for live end-to-end object-storage benchmarking