duckquill
A Rust text2sql engine for agents, packaged as duckquill.
The primary user is an agent running in Codex CLI or Claude Code that opens this repo, reads AGENTS.md plus .codex/skills/object-storage-ops/SKILL.md, and drives one binary through a schema-first workflow.
Core idea
raw file -> parquet -> schema -> query -> benchmark
Why this shape:
- Parquet lets DuckDB prune columns and push predicates
schemagrounds SQL before executionparquet_selectivestays the default pathfull_downloadexists as a measurable fallback, not the main happy path
Architecture and product intent:
Who this is for
Codex CLI / Claude Code
Use this repo when the agent needs:
- one Cargo package
- one binary entrypoint
- local CLI loops faster than standing up a bigger service
- repo-local guidance for parquet/object-storage workflows
Preferred agent loop:
cargo run -- convertcargo run -- schemacargo run -- querycargo run -- serveonly when the HTTP contract itself needs verificationcargo run -- benchmarkbefore changing guidance aroundfull_download
Example prompt:
Use the object-storage-ops skill. Convert ./testdata/owid-covid-latest.csv to parquet, inspect schema, then run a parquet_selective query.
Humans
Humans can use the binary directly, but the docs are optimized for agent operators first.
What the binary exposes
CLI:
serveconvertschemaquerybenchmark
HTTP:
GET /healthPOST /schemaPOST /queryPOST /convertPOST /benchmark
Query modes:
hybrid— for local parquet files at or below 10 MiB, run the materialize-firstfull_downloadpath; otherwise resolve toparquet_selectiveparquet_selectivefull_download
Install
duckquill is published on crates.io:
If you need unreleased changes from the current GitHub repository instead of the published crate:
For local development from this checkout:
Quickstart
1) Convert a real fixture to parquet
If a CSV is not UTF-8, pass an explicit encoding label:
For Korean public-data workflows, the same flag is where you would pass labels such as cp949 or euc-kr.
2) Inspect schema first
3) Query from the CLI
Query is intentionally sandboxed to read-only analytical SQL over the registered dataset alias:
- allowed:
SELECT ... FROM dataset ... - allowed:
WITH ... SELECT ... FROM dataset ... - rejected:
COPY,INSTALL,LOAD,ATTACH,CREATE,ALTER,DROP,INSERT,UPDATE,DELETE,SET - rejected: direct file/network access helpers such as
read_parquet(...),parquet_scan(...),read_csv(...),csv_scan(...),read_json(...),read_blob(...),read_text(...), and parquet metadata helpers inside user SQL - rejected: multi-statement SQL
- defense-in-depth: query connections also disable extension auto-install/auto-load, lock configuration changes, and restrict external file access to the configured dataset location only
- operator warning: validator + connection hardening reduce blast radius, but this is still not a full SQL sandbox; keep the CLI/HTTP query surfaces behind trusted local or single-tenant boundaries
4) Benchmark the current approach
See BENCHMARK.md for the measured local results and the Hugging Face comparison case.
HTTP examples
Start the server
Inspect schema via HTTP
/schema now returns typed columns plus a small preview_rows sample to help agents ground SQL generation with real row context.
Query via HTTP
Use hybrid when you want the binary to choose the small-file fast path automatically:
- local parquet files
<= 10 MiBresolve tofull_download - larger local parquet files and remote/object-storage datasets resolve to
parquet_selective
The same read-only query contract applies to HTTP /query; it is not designed for arbitrary DuckDB session control or ad-hoc file/network access. Treat it as a local developer / agent workflow surface, not a publicly exposed multi-tenant SQL service.
Object storage
The binary supports S3 / MinIO directly.
Convert to S3:
TEXT2SQL_S3_REGION=ap-northeast-2 \
TEXT2SQL_S3_ENDPOINT=http://127.0.0.1:9000 \
TEXT2SQL_S3_ACCESS_KEY_ID=minioadmin \
TEXT2SQL_S3_SECRET_ACCESS_KEY=minioadmin \
TEXT2SQL_S3_ALLOW_HTTP=true \
For agent-facing object-storage workflow guidance, use:
.codex/skills/object-storage-ops/SKILL.md
Current guidance
- keep
parquet_selectiveas the default for larger datasets and object storage - use
hybridwhen you want the binary to auto-pickfull_downloadfor local parquet files<= 10 MiBand otherwise stay parquet-first - use
full_downloadfor debugging or when you intentionally want full materialization first - quoted local parquet globs already work for multi-file queries, e.g.
--dataset './tmp/shard-*.parquet' - blank numeric CSV cells already aggregate as NULL after convert;
TRY_CASTis not required for that current path - use
--csv-encoding <label>when the CSV is not UTF-8 - benchmark before changing that recommendation
- do not claim selective reads lose matching data unless tests prove it
Real fixtures in this repo
testdata/owid-covid-latest.csvtestdata/owid-covid-latest.jsontestdata/canada-wastewater-aggregate.csvtestdata/keyfoods_0708.xlsx
Acceptance snapshots
Convert

Schema

Query

Benchmark

Verification
Current status
What is proven locally:
- convert works
- schema works
- query works
- benchmark works
- a real Hugging Face parquet comparison case is documented in
BENCHMARK.md
What is still environment-blocked on this machine:
- writable remote MinIO/S3 acceptance for live end-to-end object-storage benchmarking