sql-cli 1.79.1 - Docs.rs

# Feature Roadmap — 2026 Q2

_Living document. Written 2026-04-20 after the READ_TEXT/GREP/READ_WORDS landing._

Organised by **leverage per hour of work**, not urgency. The guiding principle
is the one-shop philosophy (see `docs/` and project memory): prefer features
that make a *new class of data queryable* over yet another scalar function.

---

## Tier 1 — Readers (highest leverage)

Each reader below turns a new class of data into a queryable table. They all
follow the established `TableGenerator` pattern in
`src/sql/generators/` — look at `file_readers.rs` (READ_TEXT, GREP, READ_WORDS)
as the template.

### READ_JSONL(path [, match_regex])  ✅ landed (2026-05-04)

- Newline-delimited JSON. One row per JSON object; schema is the *union* of
  keys across the first 100 records so heterogeneous logs (auth events with
  user_id, http events with status, db events with query_ms) coexist as
  columns. Optional regex pre-filters lines before JSON parsing.
- The same auto-detect (array vs JSONL) was wired into the file-level JSON
  loader, so `sql-cli logs.json` Just Works either way. `.jsonl` and `.ndjson`
  extensions are recognised alongside `.json`.
- See `examples/jsonl_logs.sql` and `data/app_logs.jsonl` for the demo.

### READ_FIELDS(path, delim_regex [, match_regex])

- The awk-style reader we designed but didn't build on 2026-04-19.
- Yields `(line_num, field_num, field)` — user pivots with `MAX(CASE WHEN
  field_num = 1 THEN field END) AS col1` etc.
- Default `delim_regex` to `\s+` when NULL/omitted (classical awk behavior
  already agreed).
- Real log files (access.log, syslog, colon-separated configs) need this.

### READ_STDIN([match_regex])  ✅ landed (2026-05-17)

- `cat file | sql-cli -q "SELECT ... FROM READ_STDIN()"` works.
- Yields `(line_num, line)` like READ_TEXT; optional regex pre-filter.
- Cached once per process via `OnceLock` so multiple `READ_STDIN()` calls in
  the same query (CTE self-joins, etc.) see the same rows — stdin is
  consumable, this preserves composability.
- TTY-check on `is_terminal()` short-circuits with a helpful error instead of
  blocking forever on keyboard input.
- See `examples/stdin_pipeline.sql` for usage.

### Stdin sentinel `-` for all line-based readers  ✅ landed (2026-05-17)

- `READ_TEXT('-')`, `READ_JSONL('-')`, `GREP('-', pat)`, `READ_WORDS('-')` all
  read stdin (cached, same buffer shared with `READ_STDIN()`).
- Critically this unlocks **field-level JSONL pipelines** without needing a
  scalar `JSON_VALUE` function:
  ```
  cat events.jsonl | sql-cli -q "SELECT level, COUNT(*) FROM READ_JSONL('-')
                                  GROUP BY level"
  ```
- Universal Unix `-` convention; no surprise for shell users.

### READ_CSV(path)  ✅ landed (2026-05-17)

- `READ_CSV('data/trades.csv')` and `cat foo.csv | sql-cli -q "SELECT ... FROM
  READ_CSV('-')"` both work.
- Reuses `AdvancedCsvLoader::load_csv_from_reader` so type inference, string
  interning, and column-categorisation heuristics are inherited from the main
  CSV loader — behaviour matches `sql-cli file.csv -q "..."`.
- v1 is minimal (no `has_header`/`delimiter` args). Add those later if a real
  use case asks for them.

With READ_CSV the reader family now covers the four core formats — CSV, JSON
array files, JSONL, plain text — and all of them accept `-` for stdin. This
is the "complete loop" the user flagged on 2026-05-17.

#### Arbitrary delimiter argument  ✅ landed (2026-05-23)

```sql
READ_CSV(path [, delimiter])    -- ',' default; '|' for PSV, '\t' for TSV
```

Delivered across four commits, all on the shared `CsvReadOptions` plumbing
(`da3e739` infra → `c32986a` READ_CSV arg → `a488fac` CLI flag → `1515112`
WEB CTE clause):

- **READ_CSV 2nd arg** resolves the delimiter as: explicit arg wins → else
  extension auto-detect (`.tsv` → tab, `.psv` → pipe) → else comma (and comma
  for stdin `-`). `parse_delimiter_arg` accepts `\t`/`\n`/`\r` escapes since
  literal tabs in SQL strings are awkward; multi-char / non-ASCII error clearly.
- **Extension auto-detect for every caller** — the convenience loaders call
  `detect_delimiter_from_path` internally, so `sql-cli sales.psv` Just Works.
- **CLI `--delimiter` flag** for explicit override at the file level.
- **WEB CTE `DELIMITER` clause** for non-comma HTTP sources.
- Resolved delimiter is recorded in `table.metadata["delimiter"]` for F5
  visibility; `is_null_field` is parameterised on the delimiter so NULL
  detection works for non-comma sources.
- Demo: `examples/iris_tsv.sql` over `data/iris.tsv`.

This closes the four-format reader loop with full delimiter flexibility. The
only deferred follow-up is optional `has_header` (Bool, default true) for
headerless CSVs where we'd synthesise `col_1`, `col_2`, etc. — left until a
real use case asks for it.

### Glob support in existing readers

- `READ_TEXT('logs/*.log')` yields an extra `source_file` column.
- One new column; query shape unchanged.
- Multi-file ingestion without UNION ALL is a big ergonomic win.

### READ_HTTP(url [, headers_json])

- We have web CTE but a first-class reader is more discoverable.
- Natural pair with existing HTTP fetcher (see `src/web/http_fetcher.rs`).
- Consider returning raw body + a second reader like READ_JSON_HTTP that parses.

### READ_PARQUET(path)

- Columnar format is common in data work. Would need the `parquet` crate.
- Heavier lift (schema handling, type mapping) — only tackle if there's a
  clear use case, since it adds a non-trivial dependency.

---

## Tier 2 — Ship what we have

### examples/everything.sql

- Already planned in project memory. Curated tour of the whole engine
  (~30-40 deterministic queries). Ideal for a lull — pure composition,
  no new code, doubles as formal regression test via captured expectations.
- Keep deterministic: no timestamps, no filesystem paths, use RANGE/UNION ALL
  for data generation where possible.

### More captured expectations

- `examples/text_processing_demo.sql` is a great candidate to promote from
  smoke test to formal. `uv run python tests/integration/test_examples.py
  --capture text_processing_demo` after confirming output is stable.
- Each captured expectation = one more regression guard.

### README refresh

- The one-shop story deserves a headline section. Current README undersells
  what makes this project distinctive.
- Concrete "wow" query example: full-book scan at sub-100ms, heterogeneous
  sources in one query (join CSV with READ_TEXT output), etc.

### Benchmark integration

- `benchmark_baseline.json` and `benchmark_results_20260411_223032.json` sit
  uncommitted in the repo root. Either wire them into CI (perf regression
  guard) or delete them.

### Session-level config for safety caps

- `MAX_LINES_PER_FILE = 1_000_000` is hardcoded in `file_readers.rs`.
- Expose via `SET max_lines_per_file = N` so power users can raise it
  without recompiling. Same mechanism can cover future reader caps.

---

## Tier 3 — Engine gaps (narrower, lower leverage)

Per `project_leetcode_gap_next.md`, end-to-end coverage sits at 87%. Remaining
gaps are small and contained.

### MySQL DATE_FORMAT specifier translation  ✅ landed (2026-05-31)

- `DateFormatFunction` now runs a single-pass `translate_mysql_date_format`
  before delegating to `FormatDateFunction`. MySQL-only specifiers are remapped:
  `%i`→`%M` (minute), `%M`→`%B` (full month name), `%W`→`%A` (full weekday),
  `%s`→`%S` (seconds, dodging strftime's epoch `%s`), `%f`→`%6f` (microseconds).
  Shared specifiers (`%Y %m %d %H %h %p %a %b %r %T %%` …) pass through verbatim.
- Single-pass is deliberate: a sequential string-replace would double-translate
  because `%i`→`%M` and `%M`→`%B` collide on the intermediate `%M`.
- `DATE_FORMAT('...', '%W, %M %d, %Y %H:%i:%s')` → `Sunday, May 31, 2026 14:05:09`.
- 4 unit tests in `format.rs::date_format_tests`. Demo + formal regression
  guard: `examples/date_format_mysql.sql` →
  `examples/expectations/date_format_mysql.json`.

### BETWEEN-in-CASE → now landed (2026-04-19)

- ✅ Fixed. LC 3705 runs verbatim.

### GROUP BY output projection bug

- `SELECT user_id + 10 AS shifted FROM t GROUP BY user_id` outputs
  `user_id, shifted` instead of just `shifted`. Real defect.
- **Risk:** may be load-bearing — other code might depend on group keys
  being present in the output. Audit carefully before changing; probe path
  is `apply_group_by` in the query engine.

### HAVING expressions parity — probe first

- HAVING arithmetic + aliases work. Systematic sweep: does HAVING+CASE,
  HAVING+function-calls, HAVING+scalar-subqueries work? File gaps as they
  surface.

### 608 InSubquery in CASE  ✅ landed (2026-05-31)

- Was broken (verified): `CASE WHEN x IN (SELECT ...) THEN ...` errored with
  `Unsupported expression type for arithmetic evaluation`. Two-part root cause,
  both fixed:
  1. `SubqueryExecutor::process_expression` didn't recurse into `CaseExpression`
     / `SimpleCaseExpression` / `Not`, so the subquery was never pre-executed
     and rewritten to an `InList`. Added those arms (line ~345 used to say
     `// CaseWhen doesn't exist in current AST, skip for now`).
  2. Once rewritten to `InList`/`NotInList`, the *arithmetic* evaluator had no
     dispatch for those nodes (only the WHERE evaluator did). Added `InList`/
     `NotInList` arms to `arithmetic_evaluator.rs` returning Boolean via the
     shared `compare_with_op`.
- Same fix also closes Tier 3b's "scalar subquery inside CASE" (below).
- Demo + formal regression guard: `examples/subquery_in_case.sql` (scalar,
  IN, and NOT IN subqueries inside CASE), captured to
  `examples/expectations/subquery_in_case.json`.

### Functional dependency relaxation (LC 3521, 3601) — needs scoping

- MySQL allows `SELECT p1.category FROM ... GROUP BY product_id` when
  category is FD on the grouped key. Don't implement blind.
- Three options: (a) full FD analysis, (b) "pick any value from group" with
  config flag, (c) detect JOIN-primary-key case specifically.
- Have the conversation first.

### Permanent out of scope

- LC 177 CREATE FUNCTION (procedural SQL).

---

## Tier 3b — Quirks surfaced while writing showcase SQL

Found while building `examples/us_states.sql` (2026-05-04). Each is small,
contained, and has a clean workaround — so they don't block the showcase, but
they're worth documenting before they're forgotten. Reproducers all use
`data/us_states.csv`.

### `__hidden_orderby_1` error when ORDER BY references the inner alias of a GROUP BY CTE

```sql
WITH lengths AS (SELECT name, LENGTH(name) AS n FROM us_states)
SELECT n AS name_length, COUNT(*) AS states
FROM lengths GROUP BY n
ORDER BY n;       -- → "Column '__hidden_orderby_1' not found"
```

- Workaround: `ORDER BY name_length` (the projected alias).
- Hidden-orderby promotion (the same machinery that fixed LC 1341) doesn't
  account for GROUP BY rewriting `n` out of the post-aggregation projection.
- ~probably a 10-line fix in whatever resolves the hidden-orderby column
  against the group-by output schema.

### Scalar subqueries inside CASE not supported in arithmetic eval  ✅ landed (2026-05-31)

```sql
SELECT CASE WHEN latitude = (SELECT MAX(latitude) FROM us_states)
            THEN 'Northernmost' END FROM us_states;   -- now works
```

- Fixed alongside Tier 3's "608 InSubquery in CASE" (same `SubqueryExecutor`
  recursion gap — see above). `SubqueryExecutor::process_expression` now
  descends into CASE branches, so the scalar subquery is pre-executed and
  materialised to a literal before arithmetic evaluation.
- The four-compass-extremes showcase query can now be a single CASE-labelled
  query instead of four `ORDER BY ... LIMIT 1` batches.

### Cross-CTE column equality silently filters to 0 rows

```sql
WITH conus AS (...),
     agg AS (SELECT MAX(latitude) AS lat_max FROM conus)
SELECT c.name FROM conus c CROSS JOIN agg a
WHERE c.latitude = a.lat_max;       -- → 0 rows
```

- The CROSS JOIN expands rows correctly (verified with `SELECT * LIMIT 3`),
  but post-join equality between a base column and an aggregate-CTE column
  yields no matches. Same shape in `JOIN ... ON OR ...` returns just the
  first matching row.
- Suspicion: type or precision mismatch when comparing the float column
  against the aggregated value. Worth a test case before fixing.
- Workaround: use a scalar subquery in WHERE — `WHERE latitude = (SELECT MAX(latitude) FROM conus)`
  works.

### UNION ALL legs ignore inner ORDER BY/LIMIT and lose per-leg literals

```sql
SELECT 'A' AS dir, name FROM us_states ORDER BY latitude DESC LIMIT 1
UNION ALL
SELECT 'B' AS dir, name FROM us_states ORDER BY latitude ASC LIMIT 1;
-- → all rows from both sides, all labelled 'A'
```

- Each leg's `ORDER BY ... LIMIT 1` is dropped, AND the literal alias from
  the first leg propagates to all rows.
- Wrapping each leg in `SELECT * FROM (... LIMIT 1) sub` doesn't help.
- Probable cause: the UNION ALL planner pulls SELECT lists from the first
  leg only and doesn't preserve per-leg limit/order. Two distinct bugs in
  one place.
- Workaround in showcase: four separate `GO` batches.

---

## Suggested working cadence

- **One reader per session** when there's time for a full slice.
- **Tier 2 items** (everything.sql, README, expectations) fit into shorter sessions.
- **Tier 3 engine gaps** are good "while waiting for something else" work.
- Update this document as items land — tick them off, add new gaps as they
  surface.

## Related reading

- `docs/DEVELOPMENT_ROADMAP.md` — older roadmap (bitwise ops, mostly shipped).
- `src/sql/generators/file_readers.rs` — template for new readers.
- Project memory: `project_leetcode_gap_next.md`, `project_design_philosophy.md`,
  `project_everything_sql.md`.