duckquill

A Rust text2sql engine for agents, packaged as duckquill and published on crates.io as v0.2.3.

The primary user is an agent running in Codex CLI or Claude Code that opens this repo, reads AGENTS.md plus .codex/skills/object-storage-ops/SKILL.md, and drives one binary through a schema-first workflow.

Core idea

raw file -> parquet -> schema -> query -> benchmark

Why this shape:

Parquet lets DuckDB prune columns and push predicates
schema grounds SQL before execution
parquet_selective stays the default path
full_download exists as a measurable fallback, not the main happy path

Architecture and product intent:

Who this is for

Codex CLI / Claude Code

Use this repo when the agent needs:

one Cargo package
one binary entrypoint
local CLI loops faster than standing up a bigger service
repo-local guidance for parquet/object-storage workflows

Preferred agent loop:

cargo run -- convert
cargo run -- schema
cargo run -- query
cargo run -- serve only when the HTTP contract itself needs verification
cargo run -- benchmark before changing guidance around full_download

Example prompt:

Use the object-storage-ops skill. Convert ./testdata/owid-covid-latest.csv to parquet, inspect schema, then run a parquet_selective query.

Agent startup contract

For agent operators, the startup contract is:

read AGENTS.md first
if the workflow touches S3/MinIO/object storage, load .codex/skills/object-storage-ops/SKILL.md
install with:
- cargo install duckquill outside the repo
- cargo install --path . inside a checkout
- cargo install --git https://github.com/sigridjineth/text2sql duckquill only for unreleased repo changes
if duckquill is not on PATH yet, export PATH="$HOME/.cargo/bin:$PATH" and verify with duckquill --help
then run convert -> schema -> query
when answering an end user, cite the exact query/command you used and the result rows or aggregate that support the answer
start serve only when the HTTP contract itself needs verification

Humans

Humans can use the binary directly, but the docs are optimized for agent operators first.

What the binary exposes

CLI:

serve
convert
schema
query
benchmark

HTTP:

GET /
GET /health
POST /schema
POST /query
POST /convert
POST /benchmark

Query modes:

parquet_selective — the default and the recommended path for larger datasets plus remote/object-storage parquet
hybrid — for local parquet files at or below 10 MiB, run the materialize-first full_download path; otherwise resolve to parquet_selective
full_download

Install

duckquill is published on crates.io:

cargo install duckquill

If you need unreleased changes from the current GitHub repository instead of the published crate:

cargo install --git https://github.com/sigridjineth/text2sql duckquill

For local development from this checkout:

cargo install --path .

If duckquill is not found after install, add Cargo's bin directory to your shell first:

export PATH="$HOME/.cargo/bin:$PATH"
duckquill --help

Quickstart

1) Convert a real fixture to parquet

cargo run -- convert \
  --input ./testdata/owid-covid-latest.csv \
  --output ./tmp/owid-covid-latest.parquet \
  --overwrite

If a CSV is not UTF-8, pass an explicit encoding label:

cargo run -- convert \
  --input ./data/seoul-public-data.csv \
  --csv-encoding windows-1252 \
  --output ./data/seoul-public-data.parquet \
  --overwrite

For Korean public-data workflows, the same flag is where you would pass labels such as cp949 or euc-kr.

2) Inspect schema first

cargo run -- schema \
  --dataset ./tmp/owid-covid-latest.parquet

3) Query from the CLI

cargo run -- query \
  --dataset ./tmp/owid-covid-latest.parquet \
  --sql "SELECT location, population FROM dataset WHERE continent = 'Asia' AND population IS NOT NULL ORDER BY population DESC LIMIT 3"

Query is intentionally sandboxed to read-only analytical SQL over the registered dataset alias:

allowed: SELECT ... FROM dataset ...
allowed: WITH ... SELECT ... FROM dataset ...
rejected: COPY, INSTALL, LOAD, ATTACH, CREATE, ALTER, DROP, INSERT, UPDATE, DELETE, SET
rejected: direct file/network access helpers such as read_parquet(...), parquet_scan(...), read_csv(...), read_csv_auto(...), csv_scan(...), read_ndjson(...), read_json(...), read_json_auto(...), read_json_objects(...), delta_scan(...), iceberg_scan(...), read_blob(...), read_text(...), glob(...), and parquet metadata helpers inside user SQL
rejected: multi-statement SQL
defense-in-depth: query connections also disable extension auto-install/auto-load, lock configuration changes, and restrict external file access to the configured dataset location only
operator warning: validator + connection hardening reduce blast radius, but this is still not a full SQL sandbox; keep the CLI/HTTP query surfaces behind trusted local or single-tenant boundaries

4) Benchmark the current approach

cargo run -- benchmark --output-dir ./tmp/benchmark-report

See BENCHMARK.md for the measured local results and the Hugging Face comparison case.

HTTP examples

Start the server

cargo run -- serve --host 127.0.0.1 --port 3000

GET / returns the configured service name, crate version, and a small capability list for quick operator sanity checks.

Inspect schema via HTTP

{
  "dataset": {
    "uri": "./benchmark-data/benchmark-10000.parquet",
    "storage": { "type": "local" }
  },
  "table_name": "dataset"
}

/schema now returns typed columns plus a small preview_rows sample to help agents ground SQL generation with real row context.

Query via HTTP

{
  "sql": "SELECT symbol, AVG(trade_value) AS avg_trade_value FROM dataset WHERE trade_month = 9 GROUP BY symbol ORDER BY avg_trade_value DESC",
  "table_name": "dataset",
  "mode": "hybrid",
  "dataset": {
    "uri": "./benchmark-data/benchmark-10000.parquet",
    "storage": { "type": "local" }
  }
}

Use hybrid when you want the binary to choose the small-file fast path automatically:

local parquet files <= 10 MiB resolve to full_download
larger local parquet files and remote/object-storage datasets resolve to parquet_selective

The same read-only query contract applies to HTTP /query; it is not designed for arbitrary DuckDB session control or ad-hoc file/network access. Treat it as a local developer / agent workflow surface, not a publicly exposed multi-tenant SQL service.

Object storage

The binary supports S3 / MinIO directly. hybrid is safe for mixed local/remote workflows because remote and object-storage datasets still resolve to parquet_selective.

Convert to S3:

TEXT2SQL_S3_REGION=ap-northeast-2 \
TEXT2SQL_S3_ENDPOINT=http://127.0.0.1:9000 \
TEXT2SQL_S3_ACCESS_KEY_ID=minioadmin \
TEXT2SQL_S3_SECRET_ACCESS_KEY=minioadmin \
TEXT2SQL_S3_ALLOW_HTTP=true \
cargo run -- convert \
  --input ./data/prices.csv \
  --output s3://bench/prices.parquet

For agent-facing object-storage workflow guidance, use:

.codex/skills/object-storage-ops/SKILL.md

Current guidance

keep parquet_selective as the default for larger datasets and object storage
use hybrid when you want the binary to auto-pick full_download for local parquet files <= 10 MiB and otherwise stay parquet-first
use full_download for debugging or when you intentionally want full materialization first
quoted local parquet globs already work for multi-file queries, e.g. --dataset './tmp/shard-*.parquet'
globbed parquet inputs do not currently add a filename/source column automatically; for file-by-file comparisons, either query each file separately or prepare Parquet with an explicit source column upstream
blank numeric CSV cells already aggregate as NULL after convert; TRY_CAST is not required for that current path
use --csv-encoding <label> when the CSV is not UTF-8
spreadsheet ingestion currently passes XLS/XLSX cells through CSV text before Parquet inference; mixed numeric/text spreadsheet columns can therefore land as Utf8, so clean the column in the sheet first or export to CSV and convert with --csv-encoding <label> when typing matters
when agents answer users from schema/query runs, include the exact command or SQL used plus the result rows/aggregates that support the answer
benchmark before changing that recommendation
do not claim selective reads lose matching data unless tests prove it

Real fixtures in this repo

testdata/owid-covid-latest.csv
testdata/owid-covid-latest.json
testdata/canada-wastewater-aggregate.csv
testdata/keyfoods_0708.xlsx

Acceptance snapshots

Convert

Local convert acceptance

Schema

Local schema acceptance

Query

Local query acceptance

Benchmark

Local benchmark acceptance

Verification

cargo fmt --check
cargo test
cargo check
cargo clippy --all-targets --all-features -- -D warnings

Current status

What is proven locally:

convert works
schema works
query works
benchmark works
a real Hugging Face parquet comparison case is documented in BENCHMARK.md

What is still environment-blocked on this machine:

writable remote MinIO/S3 acceptance for live end-to-end object-storage benchmarking

duckquill 0.2.3