duckquill 0.2.5

# duckquill

A Rust text2sql engine for agents, packaged as `duckquill` and published on crates.io as **v0.2.5**.

The primary user is an agent running in **Codex CLI** or **Claude Code** that opens this repo, reads `AGENTS.md` plus `.codex/skills/duckquill-ops/SKILL.md`, and hands off to `.codex/skills/object-storage-ops/SKILL.md` when the workflow becomes S3/MinIO-specific.

## Core idea

```text
raw file -> parquet -> schema -> query -> benchmark
```

Why this shape:
- Parquet lets DuckDB prune columns and push predicates
- `schema` grounds SQL before execution
- `parquet_selective` stays the default path
- `full_download` exists as a measurable fallback, not the main happy path

Architecture and product intent:
- [`ARCHITECUTRE.md`](ARCHITECUTRE.md)
- [`BENCHMARK.md`](BENCHMARK.md)
- [`docs/foundation.md`](docs/foundation.md)
- [`docs/schema-exposure-audit.md`](docs/schema-exposure-audit.md)

## Who this is for

### Codex CLI / Claude Code

Use this repo when the agent needs:
- one Cargo package
- one binary entrypoint
- local CLI loops faster than standing up a bigger service
- repo-local guidance for parquet/object-storage workflows

Preferred agent loop:
1. `cargo run -- convert`
2. `cargo run -- schema`
3. `cargo run -- query`
4. `cargo run -- serve` only when the HTTP contract itself needs verification
5. `cargo run -- benchmark` before changing guidance around `full_download`

Example prompts:

```text
Use the duckquill-ops skill. Convert ./testdata/owid-covid-latest.csv to parquet, inspect schema, then run a parquet_selective query. If the workflow moves to S3 or MinIO, hand off to object-storage-ops.
```

#### Agent startup contract

For agent operators, the startup contract is:

1. read `AGENTS.md` first
2. load `.codex/skills/duckquill-ops/SKILL.md` for the broad install + operator workflow
3. if the workflow touches S3/MinIO/object storage, also load `.codex/skills/object-storage-ops/SKILL.md`
4. install with:
   - `cargo install duckquill` outside the repo
   - `cargo install --path .` inside a checkout
   - `cargo install --git https://github.com/sigridjineth/text2sql duckquill` only for unreleased repo changes
5. if `duckquill` is not on `PATH` yet, export `PATH="$HOME/.cargo/bin:$PATH"` and verify with `duckquill --help`
6. then run `convert -> schema -> query`
7. when answering an end user, cite the exact query/command you used and the result rows or aggregate that support the answer
8. start `serve` only when the HTTP contract itself needs verification
9. run `benchmark` before recommending `full_download` as a fallback

### Humans

Humans can use the binary directly, but the docs are optimized for agent operators first.

## What the binary exposes

CLI:
- `serve`
- `convert`
- `schema`
- `query`
- `benchmark`

HTTP:
- `GET /`
- `GET /health`
- `POST /schema`
- `POST /query`
- `POST /convert`
- `POST /benchmark`

Query modes:
- `parquet_selective` — the default and the recommended path for larger datasets plus remote/object-storage parquet
- `hybrid` — for local parquet files at or below 10 MiB, run the materialize-first `full_download` path; otherwise resolve to `parquet_selective`
- `full_download`

## Install

`duckquill` is published on crates.io:

```bash
cargo install duckquill
```

If you need unreleased changes from the current GitHub repository instead of the published crate:

```bash
cargo install --git https://github.com/sigridjineth/text2sql duckquill
```

For local development from this checkout:

```bash
cargo install --path .
```

If `duckquill` is not found after install, add Cargo's bin directory to your shell first:

```bash
export PATH="$HOME/.cargo/bin:$PATH"
duckquill --help
```

## Quickstart

### 1) Convert a real fixture to parquet

```bash
cargo run -- convert \
  --input ./testdata/owid-covid-latest.csv \
  --output ./tmp/owid-covid-latest.parquet \
  --overwrite
```

If a CSV is not UTF-8, pass an explicit encoding label:

```bash
cargo run -- convert \
  --input ./data/seoul-public-data.csv \
  --csv-encoding windows-1252 \
  --output ./data/seoul-public-data.parquet \
  --overwrite
```

For Korean public-data workflows, the same flag is where you would pass labels such as `cp949` or `euc-kr`.

### 2) Inspect schema first

```bash
cargo run -- schema \
  --dataset ./tmp/owid-covid-latest.parquet
```

### 3) Query from the CLI

```bash
cargo run -- query \
  --dataset ./tmp/owid-covid-latest.parquet \
  --sql "SELECT location, population FROM dataset WHERE continent = 'Asia' AND population IS NOT NULL ORDER BY population DESC LIMIT 3"
```

Query is intentionally sandboxed to **read-only analytical SQL** over the registered `dataset` alias:

- allowed: `SELECT ... FROM dataset ...`
- allowed: `WITH ... SELECT ... FROM dataset ...`
- rejected: `COPY`, `INSTALL`, `LOAD`, `ATTACH`, `CREATE`, `ALTER`, `DROP`, `INSERT`, `UPDATE`, `DELETE`, `SET`
- rejected: direct file/network access helpers such as `read_parquet(...)`, `parquet_scan(...)`, `read_csv(...)`, `read_csv_auto(...)`, `csv_scan(...)`, `read_ndjson(...)`, `read_json(...)`, `read_json_auto(...)`, `read_json_objects(...)`, `delta_scan(...)`, `iceberg_scan(...)`, `read_blob(...)`, `read_text(...)`, `glob(...)`, and parquet metadata helpers inside user SQL
- rejected: multi-statement SQL
- defense-in-depth: query connections also disable extension auto-install/auto-load, lock configuration changes, and restrict external file access to the configured dataset location only
- operator warning: validator + connection hardening reduce blast radius, but this is still not a full SQL sandbox; keep the CLI/HTTP query surfaces behind trusted local or single-tenant boundaries

### 4) Benchmark the current approach

```bash
cargo run -- benchmark --output-dir ./tmp/benchmark-report
```

See [`BENCHMARK.md`](BENCHMARK.md) for the measured local results and the Hugging Face comparison case.

## HTTP examples

### Start the server

```bash
cargo run -- serve --host 127.0.0.1 --port 3000
```

`GET /` returns the configured service name, crate version, and a small capability list for quick operator sanity checks.

### Inspect schema via HTTP

```json
{
  "dataset": {
    "uri": "./benchmark-data/benchmark-10000.parquet",
    "storage": { "type": "local" }
  },
  "table_name": "dataset"
}
```

`/schema` now returns typed columns plus a small `preview_rows` sample to help agents ground SQL generation with real row context.

### Query via HTTP

```json
{
  "sql": "SELECT symbol, AVG(trade_value) AS avg_trade_value FROM dataset WHERE trade_month = 9 GROUP BY symbol ORDER BY avg_trade_value DESC",
  "table_name": "dataset",
  "mode": "hybrid",
  "dataset": {
    "uri": "./benchmark-data/benchmark-10000.parquet",
    "storage": { "type": "local" }
  }
}
```

Use `hybrid` when you want the binary to choose the small-file fast path automatically:
- local parquet files `<= 10 MiB` resolve to `full_download`
- larger local parquet files and remote/object-storage datasets resolve to `parquet_selective`

The same read-only query contract applies to HTTP `/query`; it is not designed for arbitrary DuckDB session control or ad-hoc file/network access. Treat it as a local developer / agent workflow surface, not a publicly exposed multi-tenant SQL service.

## Object storage

The binary supports S3 / MinIO directly. `hybrid` is safe for mixed local/remote workflows because remote and object-storage datasets still resolve to `parquet_selective`.

Convert to S3:

```bash
TEXT2SQL_S3_REGION=ap-northeast-2 \
TEXT2SQL_S3_ENDPOINT=http://127.0.0.1:9000 \
TEXT2SQL_S3_ACCESS_KEY_ID=minioadmin \
TEXT2SQL_S3_SECRET_ACCESS_KEY=minioadmin \
TEXT2SQL_S3_ALLOW_HTTP=true \
cargo run -- convert \
  --input ./data/prices.csv \
  --output s3://bench/prices.parquet
```

For agent-facing workflow guidance, use:
- `.codex/skills/duckquill-ops/SKILL.md` for broad install + convert/schema/query/benchmark usage
- `.codex/skills/object-storage-ops/SKILL.md` for S3/MinIO-specific setup, remote Parquet, and credential debugging

## Current guidance

- keep `parquet_selective` as the default for larger datasets and object storage
- use `hybrid` when you want the binary to auto-pick `full_download` for local parquet files `<= 10 MiB` and otherwise stay parquet-first
- use `full_download` for debugging or when you intentionally want full materialization first
- quoted local parquet globs already work for multi-file queries, e.g. `--dataset './tmp/shard-*.parquet'`
- globbed parquet inputs do **not** currently add a filename/source column automatically; for file-by-file comparisons, either query each file separately or prepare Parquet with an explicit source column upstream
- blank numeric CSV cells already aggregate as NULL after convert; `TRY_CAST` is not required for that current path
- use `--csv-encoding <label>` when the CSV is not UTF-8
- spreadsheet ingestion currently passes XLS/XLSX cells through CSV text before Parquet inference; mixed numeric/text spreadsheet columns can therefore land as `Utf8`, so clean the column in the sheet first or export to CSV and convert with `--csv-encoding <label>` when typing matters
- when agents answer users from schema/query runs, include the exact command or SQL used plus the result rows/aggregates that support the answer
- benchmark before changing that recommendation
- load `.codex/skills/duckquill-ops/SKILL.md` for general operator guidance and hand off to `.codex/skills/object-storage-ops/SKILL.md` when the workflow becomes object-storage-specific
- do not claim selective reads lose matching data unless tests prove it

## Real fixtures in this repo

- `testdata/owid-covid-latest.csv`
- `testdata/owid-covid-latest.json`
- `testdata/canada-wastewater-aggregate.csv`
- `testdata/keyfoods_0708.xlsx`

## Acceptance snapshots

### Convert
![Local convert acceptance](docs/assets/acceptance/convert-local.png)

### Schema
![Local schema acceptance](docs/assets/acceptance/schema-local.png)

### Query
![Local query acceptance](docs/assets/acceptance/query-local.png)

### Benchmark
![Local benchmark acceptance](docs/assets/acceptance/benchmark-local.png)

## Verification

```bash
cargo fmt --check
cargo test
cargo check
cargo clippy --all-targets --all-features -- -D warnings
```

## Current status

What is proven locally:
- convert works
- schema works
- query works
- benchmark works
- a real Hugging Face parquet comparison case is documented in `BENCHMARK.md`

What is still environment-blocked on this machine:
- writable remote MinIO/S3 acceptance for live end-to-end object-storage benchmarking