# duckquill
A Rust text2sql engine for agents, packaged as `duckquill` and published on crates.io as **v0.2.5**.
The primary user is an agent running in **Codex CLI** or **Claude Code** that opens this repo, reads `AGENTS.md` plus `.codex/skills/duckquill-ops/SKILL.md`, and hands off to `.codex/skills/object-storage-ops/SKILL.md` when the workflow becomes S3/MinIO-specific.
## Core idea
```text
raw file -> parquet -> schema -> query -> benchmark
```
Why this shape:
- Parquet lets DuckDB prune columns and push predicates
- `schema` grounds SQL before execution
- `parquet_selective` stays the default path
- `full_download` exists as a measurable fallback, not the main happy path
Architecture and product intent:
- [`ARCHITECUTRE.md`](ARCHITECUTRE.md)
- [`BENCHMARK.md`](BENCHMARK.md)
- [`docs/foundation.md`](docs/foundation.md)
- [`docs/schema-exposure-audit.md`](docs/schema-exposure-audit.md)
## Who this is for
### Codex CLI / Claude Code
Use this repo when the agent needs:
- one Cargo package
- one binary entrypoint
- local CLI loops faster than standing up a bigger service
- repo-local guidance for parquet/object-storage workflows
Preferred agent loop:
1. `cargo run -- convert`
2. `cargo run -- schema`
3. `cargo run -- query`
4. `cargo run -- serve` only when the HTTP contract itself needs verification
5. `cargo run -- benchmark` before changing guidance around `full_download`
Example prompts:
```text
Use the duckquill-ops skill. Convert ./testdata/owid-covid-latest.csv to parquet, inspect schema, then run a parquet_selective query. If the workflow moves to S3 or MinIO, hand off to object-storage-ops.
```
#### Agent startup contract
For agent operators, the startup contract is:
1. read `AGENTS.md` first
2. load `.codex/skills/duckquill-ops/SKILL.md` for the broad install + operator workflow
3. if the workflow touches S3/MinIO/object storage, also load `.codex/skills/object-storage-ops/SKILL.md`
4. install with:
- `cargo install duckquill` outside the repo
- `cargo install --path .` inside a checkout
- `cargo install --git https://github.com/sigridjineth/text2sql duckquill` only for unreleased repo changes
5. if `duckquill` is not on `PATH` yet, export `PATH="$HOME/.cargo/bin:$PATH"` and verify with `duckquill --help`
6. then run `convert -> schema -> query`
7. when answering an end user, cite the exact query/command you used and the result rows or aggregate that support the answer
8. start `serve` only when the HTTP contract itself needs verification
9. run `benchmark` before recommending `full_download` as a fallback
### Humans
Humans can use the binary directly, but the docs are optimized for agent operators first.
## What the binary exposes
CLI:
- `serve`
- `convert`
- `schema`
- `query`
- `benchmark`
HTTP:
- `GET /`
- `GET /health`
- `POST /schema`
- `POST /query`
- `POST /convert`
- `POST /benchmark`
Query modes:
- `parquet_selective` — the default and the recommended path for larger datasets plus remote/object-storage parquet
- `hybrid` — for local parquet files at or below 10 MiB, run the materialize-first `full_download` path; otherwise resolve to `parquet_selective`
- `full_download`
## Install
`duckquill` is published on crates.io:
```bash
cargo install duckquill
```
If you need unreleased changes from the current GitHub repository instead of the published crate:
```bash
cargo install --git https://github.com/sigridjineth/text2sql duckquill
```
For local development from this checkout:
```bash
cargo install --path .
```
If `duckquill` is not found after install, add Cargo's bin directory to your shell first:
```bash
export PATH="$HOME/.cargo/bin:$PATH"
duckquill --help
```
## Quickstart
### 1) Convert a real fixture to parquet
```bash
cargo run -- convert \
--input ./testdata/owid-covid-latest.csv \
--output ./tmp/owid-covid-latest.parquet \
--overwrite
```
If a CSV is not UTF-8, pass an explicit encoding label:
```bash
cargo run -- convert \
--input ./data/seoul-public-data.csv \
--csv-encoding windows-1252 \
--output ./data/seoul-public-data.parquet \
--overwrite
```
For Korean public-data workflows, the same flag is where you would pass labels such as `cp949` or `euc-kr`.
### 2) Inspect schema first
```bash
cargo run -- schema \
--dataset ./tmp/owid-covid-latest.parquet
```
### 3) Query from the CLI
```bash
cargo run -- query \
--dataset ./tmp/owid-covid-latest.parquet \
--sql "SELECT location, population FROM dataset WHERE continent = 'Asia' AND population IS NOT NULL ORDER BY population DESC LIMIT 3"
```
Query is intentionally sandboxed to **read-only analytical SQL** over the registered `dataset` alias:
- allowed: `SELECT ... FROM dataset ...`
- allowed: `WITH ... SELECT ... FROM dataset ...`
- rejected: `COPY`, `INSTALL`, `LOAD`, `ATTACH`, `CREATE`, `ALTER`, `DROP`, `INSERT`, `UPDATE`, `DELETE`, `SET`
- rejected: direct file/network access helpers such as `read_parquet(...)`, `parquet_scan(...)`, `read_csv(...)`, `read_csv_auto(...)`, `csv_scan(...)`, `read_ndjson(...)`, `read_json(...)`, `read_json_auto(...)`, `read_json_objects(...)`, `delta_scan(...)`, `iceberg_scan(...)`, `read_blob(...)`, `read_text(...)`, `glob(...)`, and parquet metadata helpers inside user SQL
- rejected: multi-statement SQL
- defense-in-depth: query connections also disable extension auto-install/auto-load, lock configuration changes, and restrict external file access to the configured dataset location only
- operator warning: validator + connection hardening reduce blast radius, but this is still not a full SQL sandbox; keep the CLI/HTTP query surfaces behind trusted local or single-tenant boundaries
### 4) Benchmark the current approach
```bash
cargo run -- benchmark --output-dir ./tmp/benchmark-report
```
See [`BENCHMARK.md`](BENCHMARK.md) for the measured local results and the Hugging Face comparison case.
## HTTP examples
### Start the server
```bash
cargo run -- serve --host 127.0.0.1 --port 3000
```
`GET /` returns the configured service name, crate version, and a small capability list for quick operator sanity checks.
### Inspect schema via HTTP
```json
{
"dataset": {
"uri": "./benchmark-data/benchmark-10000.parquet",
"storage": { "type": "local" }
},
"table_name": "dataset"
}
```
`/schema` now returns typed columns plus a small `preview_rows` sample to help agents ground SQL generation with real row context.
### Query via HTTP
```json
{
"sql": "SELECT symbol, AVG(trade_value) AS avg_trade_value FROM dataset WHERE trade_month = 9 GROUP BY symbol ORDER BY avg_trade_value DESC",
"table_name": "dataset",
"mode": "hybrid",
"dataset": {
"uri": "./benchmark-data/benchmark-10000.parquet",
"storage": { "type": "local" }
}
}
```
Use `hybrid` when you want the binary to choose the small-file fast path automatically:
- local parquet files `<= 10 MiB` resolve to `full_download`
- larger local parquet files and remote/object-storage datasets resolve to `parquet_selective`
The same read-only query contract applies to HTTP `/query`; it is not designed for arbitrary DuckDB session control or ad-hoc file/network access. Treat it as a local developer / agent workflow surface, not a publicly exposed multi-tenant SQL service.
## Object storage
The binary supports S3 / MinIO directly. `hybrid` is safe for mixed local/remote workflows because remote and object-storage datasets still resolve to `parquet_selective`.
Convert to S3:
```bash
TEXT2SQL_S3_REGION=ap-northeast-2 \
TEXT2SQL_S3_ENDPOINT=http://127.0.0.1:9000 \
TEXT2SQL_S3_ACCESS_KEY_ID=minioadmin \
TEXT2SQL_S3_SECRET_ACCESS_KEY=minioadmin \
TEXT2SQL_S3_ALLOW_HTTP=true \
cargo run -- convert \
--input ./data/prices.csv \
--output s3://bench/prices.parquet
```
For agent-facing workflow guidance, use:
- `.codex/skills/duckquill-ops/SKILL.md` for broad install + convert/schema/query/benchmark usage
- `.codex/skills/object-storage-ops/SKILL.md` for S3/MinIO-specific setup, remote Parquet, and credential debugging
## Current guidance
- keep `parquet_selective` as the default for larger datasets and object storage
- use `hybrid` when you want the binary to auto-pick `full_download` for local parquet files `<= 10 MiB` and otherwise stay parquet-first
- use `full_download` for debugging or when you intentionally want full materialization first
- quoted local parquet globs already work for multi-file queries, e.g. `--dataset './tmp/shard-*.parquet'`
- globbed parquet inputs do **not** currently add a filename/source column automatically; for file-by-file comparisons, either query each file separately or prepare Parquet with an explicit source column upstream
- blank numeric CSV cells already aggregate as NULL after convert; `TRY_CAST` is not required for that current path
- use `--csv-encoding <label>` when the CSV is not UTF-8
- spreadsheet ingestion currently passes XLS/XLSX cells through CSV text before Parquet inference; mixed numeric/text spreadsheet columns can therefore land as `Utf8`, so clean the column in the sheet first or export to CSV and convert with `--csv-encoding <label>` when typing matters
- when agents answer users from schema/query runs, include the exact command or SQL used plus the result rows/aggregates that support the answer
- benchmark before changing that recommendation
- load `.codex/skills/duckquill-ops/SKILL.md` for general operator guidance and hand off to `.codex/skills/object-storage-ops/SKILL.md` when the workflow becomes object-storage-specific
- do not claim selective reads lose matching data unless tests prove it
## Real fixtures in this repo
- `testdata/owid-covid-latest.csv`
- `testdata/owid-covid-latest.json`
- `testdata/canada-wastewater-aggregate.csv`
- `testdata/keyfoods_0708.xlsx`
## Acceptance snapshots
### Convert

### Schema

### Query

### Benchmark

## Verification
```bash
cargo fmt --check
cargo test
cargo check
cargo clippy --all-targets --all-features -- -D warnings
```
## Current status
What is proven locally:
- convert works
- schema works
- query works
- benchmark works
- a real Hugging Face parquet comparison case is documented in `BENCHMARK.md`
What is still environment-blocked on this machine:
- writable remote MinIO/S3 acceptance for live end-to-end object-storage benchmarking