duckquill 0.2.3

Parquet-backed text2sql engine and CLI for schema-first querying workflows
Documentation

duckquill

A Rust text2sql engine for agents, packaged as duckquill and published on crates.io as v0.2.3.

The primary user is an agent running in Codex CLI or Claude Code that opens this repo, reads AGENTS.md plus .codex/skills/object-storage-ops/SKILL.md, and drives one binary through a schema-first workflow.

Core idea

raw file -> parquet -> schema -> query -> benchmark

Why this shape:

  • Parquet lets DuckDB prune columns and push predicates
  • schema grounds SQL before execution
  • parquet_selective stays the default path
  • full_download exists as a measurable fallback, not the main happy path

Architecture and product intent:

Who this is for

Codex CLI / Claude Code

Use this repo when the agent needs:

  • one Cargo package
  • one binary entrypoint
  • local CLI loops faster than standing up a bigger service
  • repo-local guidance for parquet/object-storage workflows

Preferred agent loop:

  1. cargo run -- convert
  2. cargo run -- schema
  3. cargo run -- query
  4. cargo run -- serve only when the HTTP contract itself needs verification
  5. cargo run -- benchmark before changing guidance around full_download

Example prompt:

Use the object-storage-ops skill. Convert ./testdata/owid-covid-latest.csv to parquet, inspect schema, then run a parquet_selective query.

Agent startup contract

For agent operators, the startup contract is:

  1. read AGENTS.md first
  2. if the workflow touches S3/MinIO/object storage, load .codex/skills/object-storage-ops/SKILL.md
  3. install with:
    • cargo install duckquill outside the repo
    • cargo install --path . inside a checkout
    • cargo install --git https://github.com/sigridjineth/text2sql duckquill only for unreleased repo changes
  4. if duckquill is not on PATH yet, export PATH="$HOME/.cargo/bin:$PATH" and verify with duckquill --help
  5. then run convert -> schema -> query
  6. when answering an end user, cite the exact query/command you used and the result rows or aggregate that support the answer
  7. start serve only when the HTTP contract itself needs verification

Humans

Humans can use the binary directly, but the docs are optimized for agent operators first.

What the binary exposes

CLI:

  • serve
  • convert
  • schema
  • query
  • benchmark

HTTP:

  • GET /
  • GET /health
  • POST /schema
  • POST /query
  • POST /convert
  • POST /benchmark

Query modes:

  • parquet_selective — the default and the recommended path for larger datasets plus remote/object-storage parquet
  • hybrid — for local parquet files at or below 10 MiB, run the materialize-first full_download path; otherwise resolve to parquet_selective
  • full_download

Install

duckquill is published on crates.io:

cargo install duckquill

If you need unreleased changes from the current GitHub repository instead of the published crate:

cargo install --git https://github.com/sigridjineth/text2sql duckquill

For local development from this checkout:

cargo install --path .

If duckquill is not found after install, add Cargo's bin directory to your shell first:

export PATH="$HOME/.cargo/bin:$PATH"
duckquill --help

Quickstart

1) Convert a real fixture to parquet

cargo run -- convert \
  --input ./testdata/owid-covid-latest.csv \
  --output ./tmp/owid-covid-latest.parquet \
  --overwrite

If a CSV is not UTF-8, pass an explicit encoding label:

cargo run -- convert \
  --input ./data/seoul-public-data.csv \
  --csv-encoding windows-1252 \
  --output ./data/seoul-public-data.parquet \
  --overwrite

For Korean public-data workflows, the same flag is where you would pass labels such as cp949 or euc-kr.

2) Inspect schema first

cargo run -- schema \
  --dataset ./tmp/owid-covid-latest.parquet

3) Query from the CLI

cargo run -- query \
  --dataset ./tmp/owid-covid-latest.parquet \
  --sql "SELECT location, population FROM dataset WHERE continent = 'Asia' AND population IS NOT NULL ORDER BY population DESC LIMIT 3"

Query is intentionally sandboxed to read-only analytical SQL over the registered dataset alias:

  • allowed: SELECT ... FROM dataset ...
  • allowed: WITH ... SELECT ... FROM dataset ...
  • rejected: COPY, INSTALL, LOAD, ATTACH, CREATE, ALTER, DROP, INSERT, UPDATE, DELETE, SET
  • rejected: direct file/network access helpers such as read_parquet(...), parquet_scan(...), read_csv(...), read_csv_auto(...), csv_scan(...), read_ndjson(...), read_json(...), read_json_auto(...), read_json_objects(...), delta_scan(...), iceberg_scan(...), read_blob(...), read_text(...), glob(...), and parquet metadata helpers inside user SQL
  • rejected: multi-statement SQL
  • defense-in-depth: query connections also disable extension auto-install/auto-load, lock configuration changes, and restrict external file access to the configured dataset location only
  • operator warning: validator + connection hardening reduce blast radius, but this is still not a full SQL sandbox; keep the CLI/HTTP query surfaces behind trusted local or single-tenant boundaries

4) Benchmark the current approach

cargo run -- benchmark --output-dir ./tmp/benchmark-report

See BENCHMARK.md for the measured local results and the Hugging Face comparison case.

HTTP examples

Start the server

cargo run -- serve --host 127.0.0.1 --port 3000

GET / returns the configured service name, crate version, and a small capability list for quick operator sanity checks.

Inspect schema via HTTP

{
  "dataset": {
    "uri": "./benchmark-data/benchmark-10000.parquet",
    "storage": { "type": "local" }
  },
  "table_name": "dataset"
}

/schema now returns typed columns plus a small preview_rows sample to help agents ground SQL generation with real row context.

Query via HTTP

{
  "sql": "SELECT symbol, AVG(trade_value) AS avg_trade_value FROM dataset WHERE trade_month = 9 GROUP BY symbol ORDER BY avg_trade_value DESC",
  "table_name": "dataset",
  "mode": "hybrid",
  "dataset": {
    "uri": "./benchmark-data/benchmark-10000.parquet",
    "storage": { "type": "local" }
  }
}

Use hybrid when you want the binary to choose the small-file fast path automatically:

  • local parquet files <= 10 MiB resolve to full_download
  • larger local parquet files and remote/object-storage datasets resolve to parquet_selective

The same read-only query contract applies to HTTP /query; it is not designed for arbitrary DuckDB session control or ad-hoc file/network access. Treat it as a local developer / agent workflow surface, not a publicly exposed multi-tenant SQL service.

Object storage

The binary supports S3 / MinIO directly. hybrid is safe for mixed local/remote workflows because remote and object-storage datasets still resolve to parquet_selective.

Convert to S3:

TEXT2SQL_S3_REGION=ap-northeast-2 \
TEXT2SQL_S3_ENDPOINT=http://127.0.0.1:9000 \
TEXT2SQL_S3_ACCESS_KEY_ID=minioadmin \
TEXT2SQL_S3_SECRET_ACCESS_KEY=minioadmin \
TEXT2SQL_S3_ALLOW_HTTP=true \
cargo run -- convert \
  --input ./data/prices.csv \
  --output s3://bench/prices.parquet

For agent-facing object-storage workflow guidance, use:

  • .codex/skills/object-storage-ops/SKILL.md

Current guidance

  • keep parquet_selective as the default for larger datasets and object storage
  • use hybrid when you want the binary to auto-pick full_download for local parquet files <= 10 MiB and otherwise stay parquet-first
  • use full_download for debugging or when you intentionally want full materialization first
  • quoted local parquet globs already work for multi-file queries, e.g. --dataset './tmp/shard-*.parquet'
  • globbed parquet inputs do not currently add a filename/source column automatically; for file-by-file comparisons, either query each file separately or prepare Parquet with an explicit source column upstream
  • blank numeric CSV cells already aggregate as NULL after convert; TRY_CAST is not required for that current path
  • use --csv-encoding <label> when the CSV is not UTF-8
  • spreadsheet ingestion currently passes XLS/XLSX cells through CSV text before Parquet inference; mixed numeric/text spreadsheet columns can therefore land as Utf8, so clean the column in the sheet first or export to CSV and convert with --csv-encoding <label> when typing matters
  • when agents answer users from schema/query runs, include the exact command or SQL used plus the result rows/aggregates that support the answer
  • benchmark before changing that recommendation
  • do not claim selective reads lose matching data unless tests prove it

Real fixtures in this repo

  • testdata/owid-covid-latest.csv
  • testdata/owid-covid-latest.json
  • testdata/canada-wastewater-aggregate.csv
  • testdata/keyfoods_0708.xlsx

Acceptance snapshots

Convert

Local convert acceptance

Schema

Local schema acceptance

Query

Local query acceptance

Benchmark

Local benchmark acceptance

Verification

cargo fmt --check
cargo test
cargo check
cargo clippy --all-targets --all-features -- -D warnings

Current status

What is proven locally:

  • convert works
  • schema works
  • query works
  • benchmark works
  • a real Hugging Face parquet comparison case is documented in BENCHMARK.md

What is still environment-blocked on this machine:

  • writable remote MinIO/S3 acceptance for live end-to-end object-storage benchmarking