duckquill 0.2.2

Parquet-backed text2sql engine and CLI for schema-first querying workflows
Documentation
# duckquill

A Rust text2sql engine for agents, packaged as `duckquill`.

The primary user is an agent running in **Codex CLI** or **Claude Code** that opens this repo, reads `AGENTS.md` plus `.codex/skills/object-storage-ops/SKILL.md`, and drives one binary through a schema-first workflow.

## Core idea

```text
raw file -> parquet -> schema -> query -> benchmark
```

Why this shape:
- Parquet lets DuckDB prune columns and push predicates
- `schema` grounds SQL before execution
- `parquet_selective` stays the default path
- `full_download` exists as a measurable fallback, not the main happy path

Architecture and product intent:
- [`ARCHITECUTRE.md`]ARCHITECUTRE.md
- [`BENCHMARK.md`]BENCHMARK.md

## Who this is for

### Codex CLI / Claude Code

Use this repo when the agent needs:
- one Cargo package
- one binary entrypoint
- local CLI loops faster than standing up a bigger service
- repo-local guidance for parquet/object-storage workflows

Preferred agent loop:
1. `cargo run -- convert`
2. `cargo run -- schema`
3. `cargo run -- query`
4. `cargo run -- serve` only when the HTTP contract itself needs verification
5. `cargo run -- benchmark` before changing guidance around `full_download`

Example prompt:

```text
Use the object-storage-ops skill. Convert ./testdata/owid-covid-latest.csv to parquet, inspect schema, then run a parquet_selective query.
```

### Humans

Humans can use the binary directly, but the docs are optimized for agent operators first.

## What the binary exposes

CLI:
- `serve`
- `convert`
- `schema`
- `query`
- `benchmark`

HTTP:
- `GET /health`
- `POST /schema`
- `POST /query`
- `POST /convert`
- `POST /benchmark`

Query modes:
- `hybrid` — for local parquet files at or below 10 MiB, run the materialize-first `full_download` path; otherwise resolve to `parquet_selective`
- `parquet_selective`
- `full_download`

## Install

`duckquill` is published on crates.io:

```bash
cargo install duckquill
```

If you need unreleased changes from the current GitHub repository instead of the published crate:

```bash
cargo install --git https://github.com/sigridjineth/text2sql duckquill
```

For local development from this checkout:

```bash
cargo install --path .
```

## Quickstart

### 1) Convert a real fixture to parquet

```bash
cargo run -- convert \
  --input ./testdata/owid-covid-latest.csv \
  --output ./tmp/owid-covid-latest.parquet \
  --overwrite
```

If a CSV is not UTF-8, pass an explicit encoding label:

```bash
cargo run -- convert \
  --input ./data/seoul-public-data.csv \
  --csv-encoding windows-1252 \
  --output ./data/seoul-public-data.parquet \
  --overwrite
```

For Korean public-data workflows, the same flag is where you would pass labels such as `cp949` or `euc-kr`.

### 2) Inspect schema first

```bash
cargo run -- schema \
  --dataset ./tmp/owid-covid-latest.parquet
```

### 3) Query from the CLI

```bash
cargo run -- query \
  --dataset ./tmp/owid-covid-latest.parquet \
  --sql "SELECT location, population FROM dataset WHERE continent = 'Asia' AND population IS NOT NULL ORDER BY population DESC LIMIT 3"
```

Query is intentionally sandboxed to **read-only analytical SQL** over the registered `dataset` alias:

- allowed: `SELECT ... FROM dataset ...`
- allowed: `WITH ... SELECT ... FROM dataset ...`
- rejected: `COPY`, `INSTALL`, `LOAD`, `ATTACH`, `CREATE`, `ALTER`, `DROP`, `INSERT`, `UPDATE`, `DELETE`, `SET`
- rejected: direct file/network access helpers such as `read_parquet(...)`, `parquet_scan(...)`, `read_csv(...)`, `csv_scan(...)`, `read_json(...)`, `read_blob(...)`, `read_text(...)`, and parquet metadata helpers inside user SQL
- rejected: multi-statement SQL
- defense-in-depth: query connections also disable extension auto-install/auto-load, lock configuration changes, and restrict external file access to the configured dataset location only
- operator warning: validator + connection hardening reduce blast radius, but this is still not a full SQL sandbox; keep the CLI/HTTP query surfaces behind trusted local or single-tenant boundaries

### 4) Benchmark the current approach

```bash
cargo run -- benchmark --output-dir ./tmp/benchmark-report
```

See [`BENCHMARK.md`](BENCHMARK.md) for the measured local results and the Hugging Face comparison case.

## HTTP examples

### Start the server

```bash
cargo run -- serve --host 127.0.0.1 --port 3000
```

### Inspect schema via HTTP

```json
{
  "dataset": {
    "uri": "./benchmark-data/benchmark-10000.parquet",
    "storage": { "type": "local" }
  },
  "table_name": "dataset"
}
```

`/schema` now returns typed columns plus a small `preview_rows` sample to help agents ground SQL generation with real row context.

### Query via HTTP

```json
{
  "sql": "SELECT symbol, AVG(trade_value) AS avg_trade_value FROM dataset WHERE trade_month = 9 GROUP BY symbol ORDER BY avg_trade_value DESC",
  "table_name": "dataset",
  "mode": "hybrid",
  "dataset": {
    "uri": "./benchmark-data/benchmark-10000.parquet",
    "storage": { "type": "local" }
  }
}
```

Use `hybrid` when you want the binary to choose the small-file fast path automatically:
- local parquet files `<= 10 MiB` resolve to `full_download`
- larger local parquet files and remote/object-storage datasets resolve to `parquet_selective`

The same read-only query contract applies to HTTP `/query`; it is not designed for arbitrary DuckDB session control or ad-hoc file/network access. Treat it as a local developer / agent workflow surface, not a publicly exposed multi-tenant SQL service.

## Object storage

The binary supports S3 / MinIO directly.

Convert to S3:

```bash
TEXT2SQL_S3_REGION=ap-northeast-2 \
TEXT2SQL_S3_ENDPOINT=http://127.0.0.1:9000 \
TEXT2SQL_S3_ACCESS_KEY_ID=minioadmin \
TEXT2SQL_S3_SECRET_ACCESS_KEY=minioadmin \
TEXT2SQL_S3_ALLOW_HTTP=true \
cargo run -- convert \
  --input ./data/prices.csv \
  --output s3://bench/prices.parquet
```

For agent-facing object-storage workflow guidance, use:
- `.codex/skills/object-storage-ops/SKILL.md`

## Current guidance

- keep `parquet_selective` as the default for larger datasets and object storage
- use `hybrid` when you want the binary to auto-pick `full_download` for local parquet files `<= 10 MiB` and otherwise stay parquet-first
- use `full_download` for debugging or when you intentionally want full materialization first
- quoted local parquet globs already work for multi-file queries, e.g. `--dataset './tmp/shard-*.parquet'`
- blank numeric CSV cells already aggregate as NULL after convert; `TRY_CAST` is not required for that current path
- use `--csv-encoding <label>` when the CSV is not UTF-8
- benchmark before changing that recommendation
- do not claim selective reads lose matching data unless tests prove it

## Real fixtures in this repo

- `testdata/owid-covid-latest.csv`
- `testdata/owid-covid-latest.json`
- `testdata/canada-wastewater-aggregate.csv`
- `testdata/keyfoods_0708.xlsx`

## Acceptance snapshots

### Convert
![Local convert acceptance](docs/assets/acceptance/convert-local.png)

### Schema
![Local schema acceptance](docs/assets/acceptance/schema-local.png)

### Query
![Local query acceptance](docs/assets/acceptance/query-local.png)

### Benchmark
![Local benchmark acceptance](docs/assets/acceptance/benchmark-local.png)

## Verification

```bash
cargo fmt --check
cargo test
cargo check
cargo clippy --all-targets --all-features -- -D warnings
```

## Current status

What is proven locally:
- convert works
- schema works
- query works
- benchmark works
- a real Hugging Face parquet comparison case is documented in `BENCHMARK.md`

What is still environment-blocked on this machine:
- writable remote MinIO/S3 acceptance for live end-to-end object-storage benchmarking