duckquill

A Rust text2sql engine for agents, packaged as duckquill.

The primary user is an agent running in Codex CLI or Claude Code that opens this repo, reads AGENTS.md plus .codex/skills/object-storage-ops/SKILL.md, and drives one binary through a schema-first workflow.

Core idea

raw file -> parquet -> schema -> query -> benchmark

Why this shape:

Parquet lets DuckDB prune columns and push predicates
schema grounds SQL before execution
parquet_selective stays the default path
full_download exists as a measurable fallback, not the main happy path

Architecture and product intent:

Who this is for

Codex CLI / Claude Code

Use this repo when the agent needs:

one Cargo package
one binary entrypoint
local CLI loops faster than standing up a bigger service
repo-local guidance for parquet/object-storage workflows

Preferred agent loop:

cargo run -- convert
cargo run -- schema
cargo run -- query
cargo run -- serve only when the HTTP contract itself needs verification
cargo run -- benchmark before changing guidance around full_download

Example prompt:

Use the object-storage-ops skill. Convert ./testdata/owid-covid-latest.csv to parquet, inspect schema, then run a parquet_selective query.

Humans

Humans can use the binary directly, but the docs are optimized for agent operators first.

What the binary exposes

CLI:

serve
convert
schema
query
benchmark

HTTP:

GET /health
POST /schema
POST /query
POST /convert
POST /benchmark

Query modes:

hybrid — for local parquet files at or below 10 MiB, run the materialize-first full_download path; otherwise resolve to parquet_selective
parquet_selective
full_download

Install

duckquill is published on crates.io:

cargo install duckquill

If you need unreleased changes from the current GitHub repository instead of the published crate:

cargo install --git https://github.com/sigridjineth/text2sql duckquill

For local development from this checkout:

cargo install --path .

Quickstart

1) Convert a real fixture to parquet

cargo run -- convert \
  --input ./testdata/owid-covid-latest.csv \
  --output ./tmp/owid-covid-latest.parquet \
  --overwrite

If a CSV is not UTF-8, pass an explicit encoding label:

cargo run -- convert \
  --input ./data/seoul-public-data.csv \
  --csv-encoding windows-1252 \
  --output ./data/seoul-public-data.parquet \
  --overwrite

For Korean public-data workflows, the same flag is where you would pass labels such as cp949 or euc-kr.

2) Inspect schema first

cargo run -- schema \
  --dataset ./tmp/owid-covid-latest.parquet

3) Query from the CLI

cargo run -- query \
  --dataset ./tmp/owid-covid-latest.parquet \
  --sql "SELECT location, population FROM dataset WHERE continent = 'Asia' AND population IS NOT NULL ORDER BY population DESC LIMIT 3"

Query is intentionally sandboxed to read-only analytical SQL over the registered dataset alias:

allowed: SELECT ... FROM dataset ...
allowed: WITH ... SELECT ... FROM dataset ...
rejected: COPY, INSTALL, LOAD, ATTACH, CREATE, ALTER, DROP, INSERT, UPDATE, DELETE, SET
rejected: direct file/network access helpers such as read_parquet(...), parquet_scan(...), read_csv(...), csv_scan(...), read_json(...), read_blob(...), read_text(...), and parquet metadata helpers inside user SQL
rejected: multi-statement SQL
defense-in-depth: query connections also disable extension auto-install/auto-load, lock configuration changes, and restrict external file access to the configured dataset location only
operator warning: validator + connection hardening reduce blast radius, but this is still not a full SQL sandbox; keep the CLI/HTTP query surfaces behind trusted local or single-tenant boundaries

4) Benchmark the current approach

cargo run -- benchmark --output-dir ./tmp/benchmark-report

See BENCHMARK.md for the measured local results and the Hugging Face comparison case.

HTTP examples

Start the server

cargo run -- serve --host 127.0.0.1 --port 3000

Inspect schema via HTTP

{
  "dataset": {
    "uri": "./benchmark-data/benchmark-10000.parquet",
    "storage": { "type": "local" }
  },
  "table_name": "dataset"
}

/schema now returns typed columns plus a small preview_rows sample to help agents ground SQL generation with real row context.

Query via HTTP

{
  "sql": "SELECT symbol, AVG(trade_value) AS avg_trade_value FROM dataset WHERE trade_month = 9 GROUP BY symbol ORDER BY avg_trade_value DESC",
  "table_name": "dataset",
  "mode": "hybrid",
  "dataset": {
    "uri": "./benchmark-data/benchmark-10000.parquet",
    "storage": { "type": "local" }
  }
}

Use hybrid when you want the binary to choose the small-file fast path automatically:

local parquet files <= 10 MiB resolve to full_download
larger local parquet files and remote/object-storage datasets resolve to parquet_selective

The same read-only query contract applies to HTTP /query; it is not designed for arbitrary DuckDB session control or ad-hoc file/network access. Treat it as a local developer / agent workflow surface, not a publicly exposed multi-tenant SQL service.

Object storage

The binary supports S3 / MinIO directly.

Convert to S3:

TEXT2SQL_S3_REGION=ap-northeast-2 \
TEXT2SQL_S3_ENDPOINT=http://127.0.0.1:9000 \
TEXT2SQL_S3_ACCESS_KEY_ID=minioadmin \
TEXT2SQL_S3_SECRET_ACCESS_KEY=minioadmin \
TEXT2SQL_S3_ALLOW_HTTP=true \
cargo run -- convert \
  --input ./data/prices.csv \
  --output s3://bench/prices.parquet

For agent-facing object-storage workflow guidance, use:

.codex/skills/object-storage-ops/SKILL.md

Current guidance

keep parquet_selective as the default for larger datasets and object storage
use hybrid when you want the binary to auto-pick full_download for local parquet files <= 10 MiB and otherwise stay parquet-first
use full_download for debugging or when you intentionally want full materialization first
quoted local parquet globs already work for multi-file queries, e.g. --dataset './tmp/shard-*.parquet'
blank numeric CSV cells already aggregate as NULL after convert; TRY_CAST is not required for that current path
use --csv-encoding <label> when the CSV is not UTF-8
benchmark before changing that recommendation
do not claim selective reads lose matching data unless tests prove it

Real fixtures in this repo

testdata/owid-covid-latest.csv
testdata/owid-covid-latest.json
testdata/canada-wastewater-aggregate.csv
testdata/keyfoods_0708.xlsx

Acceptance snapshots

Convert

Local convert acceptance

Schema

Local schema acceptance

Query

Local query acceptance

Benchmark

Local benchmark acceptance

Verification

cargo fmt --check
cargo test
cargo check
cargo clippy --all-targets --all-features -- -D warnings

Current status

What is proven locally:

convert works
schema works
query works
benchmark works
a real Hugging Face parquet comparison case is documented in BENCHMARK.md

What is still environment-blocked on this machine:

writable remote MinIO/S3 acceptance for live end-to-end object-storage benchmarking

duckquill 0.2.2