duckquill 0.2.1

Parquet-backed text2sql engine and CLI for schema-first querying workflows
Documentation

duckquill

A Rust text2sql engine for agents, packaged as duckquill.

The primary user is an agent running in Codex CLI or Claude Code that opens this repo, reads AGENTS.md plus .codex/skills/object-storage-ops/SKILL.md, and drives one binary through a schema-first workflow.

Core idea

raw file -> parquet -> schema -> query -> benchmark

Why this shape:

  • Parquet lets DuckDB prune columns and push predicates
  • schema grounds SQL before execution
  • parquet_selective stays the default path
  • full_download exists as a measurable fallback, not the main happy path

Architecture and product intent:

Who this is for

Codex CLI / Claude Code

Use this repo when the agent needs:

  • one Cargo package
  • one binary entrypoint
  • local CLI loops faster than standing up a bigger service
  • repo-local guidance for parquet/object-storage workflows

Preferred agent loop:

  1. cargo run -- convert
  2. cargo run -- schema
  3. cargo run -- query
  4. cargo run -- serve only when the HTTP contract itself needs verification
  5. cargo run -- benchmark before changing guidance around full_download

Example prompt:

Use the object-storage-ops skill. Convert ./testdata/owid-covid-latest.csv to parquet, inspect schema, then run a parquet_selective query.

Humans

Humans can use the binary directly, but the docs are optimized for agent operators first.

What the binary exposes

CLI:

  • serve
  • convert
  • schema
  • query
  • benchmark

HTTP:

  • GET /health
  • POST /schema
  • POST /query
  • POST /convert
  • POST /benchmark

Query modes:

  • hybrid — for local parquet files at or below 10 MiB, run the materialize-first full_download path; otherwise resolve to parquet_selective
  • parquet_selective
  • full_download

Install

If the crate is published on crates.io:

cargo install duckquill

If you want to install the current GitHub version before a crates.io release:

cargo install --git https://github.com/sigridjineth/text2sql duckquill

For local development from this checkout:

cargo install --path .

Quickstart

1) Convert a real fixture to parquet

cargo run -- convert \
  --input ./testdata/owid-covid-latest.csv \
  --output ./tmp/owid-covid-latest.parquet \
  --overwrite

If a CSV is not UTF-8, pass an explicit encoding label:

cargo run -- convert \
  --input ./data/seoul-public-data.csv \
  --csv-encoding windows-1252 \
  --output ./data/seoul-public-data.parquet \
  --overwrite

For Korean public-data workflows, the same flag is where you would pass labels such as cp949 or euc-kr.

2) Inspect schema first

cargo run -- schema \
  --dataset ./tmp/owid-covid-latest.parquet

3) Query from the CLI

cargo run -- query \
  --dataset ./tmp/owid-covid-latest.parquet \
  --sql "SELECT location, population FROM dataset WHERE continent = 'Asia' AND population IS NOT NULL ORDER BY population DESC LIMIT 3"

4) Benchmark the current approach

cargo run -- benchmark --output-dir ./tmp/benchmark-report

See BENCHMARK.md for the measured local results and the Hugging Face comparison case.

HTTP examples

Start the server

cargo run -- serve --host 127.0.0.1 --port 3000

Inspect schema via HTTP

{
  "dataset": {
    "uri": "./benchmark-data/benchmark-10000.parquet",
    "storage": { "type": "local" }
  },
  "table_name": "dataset"
}

/schema now returns typed columns plus a small preview_rows sample to help agents ground SQL generation with real row context.

Query via HTTP

{
  "sql": "SELECT symbol, AVG(trade_value) AS avg_trade_value FROM dataset WHERE trade_month = 9 GROUP BY symbol ORDER BY avg_trade_value DESC",
  "table_name": "dataset",
  "mode": "hybrid",
  "dataset": {
    "uri": "./benchmark-data/benchmark-10000.parquet",
    "storage": { "type": "local" }
  }
}

Use hybrid when you want the binary to choose the small-file fast path automatically:

  • local parquet files <= 10 MiB resolve to full_download
  • larger local parquet files and remote/object-storage datasets resolve to parquet_selective

Object storage

The binary supports S3 / MinIO directly.

Convert to S3:

TEXT2SQL_S3_REGION=ap-northeast-2 \
TEXT2SQL_S3_ENDPOINT=http://127.0.0.1:9000 \
TEXT2SQL_S3_ACCESS_KEY_ID=minioadmin \
TEXT2SQL_S3_SECRET_ACCESS_KEY=minioadmin \
TEXT2SQL_S3_ALLOW_HTTP=true \
cargo run -- convert \
  --input ./data/prices.csv \
  --output s3://bench/prices.parquet

For agent-facing object-storage workflow guidance, use:

  • .codex/skills/object-storage-ops/SKILL.md

Current guidance

  • keep parquet_selective as the default for larger datasets and object storage
  • use hybrid when you want the binary to auto-pick full_download for local parquet files <= 10 MiB and otherwise stay parquet-first
  • use full_download for debugging or when you intentionally want full materialization first
  • quoted local parquet globs already work for multi-file queries, e.g. --dataset './tmp/shard-*.parquet'
  • blank numeric CSV cells already aggregate as NULL after convert; TRY_CAST is not required for that current path
  • use --csv-encoding <label> when the CSV is not UTF-8
  • benchmark before changing that recommendation
  • do not claim selective reads lose matching data unless tests prove it

Real fixtures in this repo

  • testdata/owid-covid-latest.csv
  • testdata/owid-covid-latest.json
  • testdata/canada-wastewater-aggregate.csv
  • testdata/keyfoods_0708.xlsx

Acceptance snapshots

Convert

Local convert acceptance

Schema

Local schema acceptance

Query

Local query acceptance

Benchmark

Local benchmark acceptance

Verification

cargo fmt --check
cargo test
cargo check
cargo clippy --all-targets --all-features -- -D warnings

Current status

What is proven locally:

  • convert works
  • schema works
  • query works
  • benchmark works
  • a real Hugging Face parquet comparison case is documented in BENCHMARK.md

What is still environment-blocked on this machine:

  • writable remote MinIO/S3 acceptance for live end-to-end object-storage benchmarking