sql-cli 1.73.1

SQL query tool for CSV/JSON with both interactive TUI and non-interactive CLI modes - perfect for exploration and automation
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
# SQL CLI Expressive Power Analysis & AST Rewriting Strategy

## Executive Summary

**Key Finding:** SQL CLI has **near-parity expressive power** between subqueries and top-level queries. The primary limitation is **correlated subqueries**, which require variable binding/row context passing.

**Strategic Recommendation:** Implement an **AST Preprocessing/Rewriting layer** to automatically transform unsupported SQL patterns into supported CTE-based equivalents. This avoids massive executor changes while expanding SQL compatibility.

---

## Current Expressive Power Matrix

### ✅ What Works EVERYWHERE (Top-level, Subqueries, CTEs)

| Feature | Top-Level | FROM Subquery | WHERE Subquery | SELECT Scalar | CTE |
|---------|-----------|---------------|----------------|---------------|-----|
| Window Functions ||| N/A |||
| Aggregates (GROUP BY) ||| N/A |||
| JOINs ||| N/A |||
| JOIN Expressions (Phase 1+2) ||| N/A |||
| UNION/INTERSECT/EXCEPT ||||||
| ORDER BY ||| N/A | N/A ||
| LIMIT ||| N/A | N/A ||
| Functions (all 200+) ||||||
| Expressions (arithmetic, etc.) ||||||
| CASE expressions ||||||

**Conclusion:** We have **full expressive parity** for non-correlated operations!

---

## ❌ What Doesn't Work (and Why)

### 1. Correlated Subqueries

**Example:**
```sql
-- Does NOT work: o.id reference inside subquery
SELECT
    o.id,
    (SELECT MAX(amount) FROM orders WHERE customer_id = o.id) as max_order
FROM customers o;
```

**Why it fails:**
- Subquery executor has no access to "outer row" context
- Would need to pass `o.id` binding into inner query
- Requires variable substitution mechanism

**Can it be rewritten?** ✅ **YES** (with limitations)

```sql
-- Rewrite as CTE with JOIN
WITH order_maxes AS (
    SELECT customer_id, MAX(amount) as max_order
    FROM orders
    GROUP BY customer_id
)
SELECT
    o.id,
    om.max_order
FROM customers o
LEFT JOIN order_maxes om ON o.id = om.customer_id;
```

**Limitation:** Only works if the correlated reference is in a JOIN/WHERE condition, not in arbitrary expressions.

### 2. Lateral Joins (CROSS APPLY style)

**Example:**
```sql
-- Does NOT work
SELECT *
FROM customers c
CROSS APPLY (SELECT TOP 3 * FROM orders WHERE customer_id = c.id ORDER BY date DESC) AS recent_orders;
```

**Can it be rewritten?** ⚠️ **PARTIALLY**

Only if we can express it as a window function:

```sql
WITH ranked_orders AS (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date DESC) as rn
    FROM orders
)
SELECT c.*, ro.*
FROM customers c
JOIN ranked_orders ro ON c.id = ro.customer_id
WHERE ro.rn <= 3;
```

### 3. Recursive CTEs with Depth/Breadth Control

**Example:**
```sql
-- Works but no way to limit depth during recursion
WITH RECURSIVE hierarchy AS (
    SELECT id, parent_id, 1 as depth
    FROM nodes
    WHERE parent_id IS NULL
    UNION ALL
    SELECT n.id, n.parent_id, h.depth + 1
    FROM nodes n
    JOIN hierarchy h ON n.parent_id = h.id
    WHERE h.depth < 5  -- This works but is inefficient
)
SELECT * FROM hierarchy;
```

**Can it be rewritten?** ✅ **YES** (already supported)

Our recursive CTE implementation already supports this!

### 4. UPDATE/DELETE with Correlated Subqueries

**Example:**
```sql
-- Does NOT work (no UPDATE/DELETE support at all)
UPDATE products
SET price = (SELECT AVG(price) FROM products WHERE category = products.category);
```

**Can it be rewritten?** ❌ **NO** (would need DML support)

This requires a full DML implementation, not just AST rewriting.

---

## AST Preprocessing/Rewriting Strategy

### Architecture

```
┌─────────────┐
│   Parser    │
│             │
│  SQL Text   │
└─────┬───────┘
      │ Produces AST
┌─────────────────────────────────┐
│   AST Preprocessor/Rewriter     │◄──── NEW LAYER
│                                 │
│  Pattern Detection & Rewriting  │
└─────┬───────────────────────────┘
      │ Transformed AST
┌─────────────┐
│  Executor   │
│             │
│ (Unchanged) │
└─────────────┘
```

### Rewriting Rules

#### Rule 1: Scalar Correlated Subquery → CTE + JOIN

**Pattern:**
```sql
SELECT
    a.id,
    (SELECT aggregate_fn(...) FROM b WHERE b.fk = a.id) as result
FROM a;
```

**Rewrite to:**
```sql
WITH _cte_agg AS (
    SELECT fk, aggregate_fn(...) as result
    FROM b
    GROUP BY fk
)
SELECT
    a.id,
    _cte_agg.result
FROM a
LEFT JOIN _cte_agg ON a.id = _cte_agg.fk;
```

**Applicability:** ✅ Works for aggregates with simple correlation (equality in WHERE)

#### Rule 2: IN Correlated Subquery → EXISTS via SEMI JOIN

**Pattern:**
```sql
SELECT * FROM a
WHERE a.id IN (SELECT b.fk FROM b WHERE b.status = 'active');
```

**Rewrite to:**
```sql
WITH _cte_exists AS (
    SELECT DISTINCT fk FROM b WHERE status = 'active'
)
SELECT a.*
FROM a
JOIN _cte_exists ON a.id = _cte_exists.fk;
```

**Applicability:** ✅ Always works for IN/EXISTS patterns

#### Rule 3: Top-N per Group → Window Function

**Pattern:**
```sql
SELECT *
FROM orders o
WHERE (
    SELECT COUNT(*) FROM orders o2
    WHERE o2.customer_id = o.customer_id AND o2.date > o.date
) < 3;
```

**Rewrite to:**
```sql
WITH ranked AS (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date DESC) as rn
    FROM orders
)
SELECT * FROM ranked WHERE rn <= 3;
```

**Applicability:** ✅ Works for many "top-N per group" patterns

#### Rule 4: Nested Aggregates → Multiple CTEs

**Pattern:**
```sql
SELECT
    dept,
    AVG(total_sales) as avg_dept_sales
FROM (
    SELECT
        employee_id,
        dept,
        SUM(amount) as total_sales
    FROM sales
    GROUP BY employee_id, dept
) AS emp_sales
GROUP BY dept;
```

**Rewrite to:**
```sql
WITH emp_sales AS (
    SELECT
        employee_id,
        dept,
        SUM(amount) as total_sales
    FROM sales
    GROUP BY employee_id, dept
),
dept_avgs AS (
    SELECT
        dept,
        AVG(total_sales) as avg_dept_sales
    FROM emp_sales
    GROUP BY dept
)
SELECT * FROM dept_avgs;
```

**Applicability:** ✅ Always works (actually simplifies the query!)

---

## What CAN'T Be Rewritten?

### 1. True Row-by-Row Correlation with Complex Expressions

```sql
-- Cannot rewrite if correlation involves complex computation
SELECT
    id,
    (SELECT COUNT(*) FROM events e WHERE DATEDIFF(e.date, outer.created) < outer.threshold_days)
FROM configs outer;
```

**Why:** The `outer.threshold_days` is used in an expression, not just a JOIN condition.

**Workaround:** User must restructure using CROSS JOIN + WHERE (manual CTE)

### 2. Mutually Recursive Queries

```sql
-- Cannot represent without special syntax
WITH RECURSIVE
    even_nums(n) AS (SELECT 1 UNION ALL SELECT n+1 FROM odd_nums WHERE n < 10),
    odd_nums(n) AS (SELECT 0 UNION ALL SELECT n+1 FROM even_nums WHERE n < 10)
SELECT * FROM even_nums UNION ALL SELECT * FROM odd_nums;
```

**Why:** Circular dependency between CTEs

**Status:** Theoretically possible but requires dependency resolution

### 3. Lateral Joins with True Lateral Correlation

```sql
-- True lateral join where function depends on left row
SELECT * FROM customers c
CROSS APPLY generate_series(1, c.num_orders) AS s(i);
```

**Why:** The `c.num_orders` directly parameterizes a table-valued function

**Workaround:** Only possible if we implement table-valued function "unrolling"

---

## Implementation Plan: AST Preprocessor

### Phase 1: Infrastructure (Week 1-2)

**Goal:** Build the preprocessing framework

```rust
// src/preprocessing/mod.rs
pub trait ASTRewriter {
    fn rewrite(&self, ast: SelectStatement) -> Result<SelectStatement>;
}

pub struct PreprocessorPipeline {
    rewriters: Vec<Box<dyn ASTRewriter>>,
}

impl PreprocessorPipeline {
    pub fn apply(&self, ast: SelectStatement) -> Result<SelectStatement> {
        let mut current = ast;
        for rewriter in &self.rewriters {
            current = rewriter.rewrite(current)?;
        }
        Ok(current)
    }
}
```

**Files:**
- `src/preprocessing/mod.rs` - Pipeline framework
- `src/preprocessing/pattern_matcher.rs` - AST pattern matching utilities
- `src/preprocessing/cte_builder.rs` - Helper to construct CTE nodes

### Phase 2: Correlated Scalar Subquery Rewriter (Week 3-4)

**Goal:** Rewrite scalar correlated subqueries to CTE + JOIN

**Algorithm:**
1. Detect `ScalarSubquery` in SELECT list
2. Analyze subquery for correlation (references to outer query)
3. If correlation is simple equality in WHERE → extract to CTE
4. Generate aggregate CTE with GROUP BY on correlated column
5. Replace scalar subquery with LEFT JOIN to CTE

**Benefit:** Enables most common correlated subquery patterns

### Phase 3: IN/EXISTS Subquery Rewriter (Week 5)

**Goal:** Rewrite IN/EXISTS to SEMI JOIN pattern

**Algorithm:**
1. Detect `InSubquery` in WHERE clause
2. Extract subquery to CTE with DISTINCT
3. Replace IN with INNER JOIN (or LEFT JOIN + IS NOT NULL check)

**Benefit:** Performance + readability

### Phase 4: Top-N per Group Rewriter (Week 6)

**Goal:** Rewrite top-N correlated patterns to window functions

**Pattern Detection:**
```sql
WHERE (SELECT COUNT(*) FROM same_table WHERE correlation AND condition) < N
```

**Rewrite:**
```sql
WITH ranked AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY correlated_col ORDER BY condition) as rn
    FROM same_table
)
SELECT * FROM ranked WHERE rn <= N
```

### Phase 5: Documentation & Examples (Week 7)

- Document all rewriting rules
- Provide before/after examples
- Add debugging mode to show rewritten queries
- Performance benchmarks

---

## Debugging & Observability

### Show Rewritten Query

```bash
# New flag to see what the preprocessor did
./target/release/sql-cli -q "SELECT ..." --show-rewritten
```

Output:
```
=== Original Query ===
SELECT id, (SELECT MAX(amount) FROM orders WHERE cust_id = c.id) FROM customers c;

=== Rewritten Query ===
WITH _cte_0 AS (
    SELECT cust_id, MAX(amount) as _result_0 FROM orders GROUP BY cust_id
)
SELECT c.id, _cte_0._result_0
FROM customers c LEFT JOIN _cte_0 ON c.id = _cte_0.cust_id;

=== Execution ===
...
```

### Statistics

Track which rewrites are applied:
```
Preprocessing Statistics:
- Scalar correlated subqueries hoisted: 3
- IN subqueries converted to SEMI JOIN: 1
- Top-N patterns converted to window functions: 0
- Total preprocessing time: 2.3ms
```

---

## Benefits of This Approach

### 1. **No Executor Changes** ✅
- Executor sees standard CTEs and JOINs
- No need to implement variable binding
- No need to thread row context through subqueries

### 2. **Automatic Optimization** ✅
- Rewrites often produce MORE efficient queries
- CTE materialization vs repeated subquery execution
- Enables future CTE optimization (memoization, etc.)

### 3. **SQL Compatibility** ✅
- Users can write "natural" SQL with subqueries
- Under the hood, we convert to efficient CTEs
- Best of both worlds

### 4. **Debuggability** ✅
- `--show-rewritten` lets users see what's happening
- Learn CTE patterns by seeing rewrites
- Easier to optimize when you see the CTE form

### 5. **Incremental Implementation** ✅
- Add rewriters one at a time
- Each rewriter is independent
- Can ship partial functionality

---

## Risks & Limitations

### 1. **Not All Correlations Can Be Rewritten**
Some complex expressions with outer references can't be converted to CTEs.

**Mitigation:**
- Detect unrewritable patterns
- Provide clear error message
- Suggest manual CTE refactoring

### 2. **Query Plan Changes**
Rewriting changes execution plan - could be slower in some cases.

**Mitigation:**
- Benchmark rewrites
- Add `--no-preprocessing` flag to disable
- Make individual rewrites configurable

### 3. **Debugging Complexity**
Users see error messages on rewritten query, not original.

**Mitigation:**
- Map error locations back to original query
- Always show both queries in verbose mode

### 4. **Maintenance Burden**
More code to maintain, more edge cases.

**Mitigation:**
- Comprehensive test suite for each rewriter
- Clear documentation of rewrite rules
- Modular design allows disabling buggy rewrites

---

## Success Metrics

### Coverage
- **Target:** 90% of common correlated subquery patterns rewritable
- **Measure:** Test against TPC-H, real-world SQL samples

### Performance
- **Target:** Rewrites should not slow queries by >10%
- **Measure:** Benchmark suite comparing original CTE vs rewritten

### Usability
- **Target:** 80% of users shouldn't need to understand rewrites
- **Measure:** User feedback, support tickets

---

## Alternative: Full Correlated Subquery Implementation

### What It Would Require

1. **Variable Binding System**
   - Pass outer row values into subquery
   - Scope management (nested subqueries)
   - Type checking across scopes

2. **Execution Model Changes**
   - Execute subquery once per outer row (O(n×m) complexity)
   - Cache subquery results for same parameter values
   - Thread safety for concurrent evaluation

3. **Optimizer Integration**
   - Detect when correlated subquery can be decorrelated
   - Choose between nested loop, hash join, etc.
   - Cost-based decision making

**Estimated Effort:** 6-8 weeks vs 4-6 weeks for preprocessor

**Risk:** Much higher - touches executor core

**Benefit:** Marginally better for edge cases that can't be rewritten

**Recommendation:** ❌ **Don't do it** - preprocessor is better ROI

---

## Conclusion

### What We Know

1. **Expressive Power:** Near-complete parity between top-level and subqueries (except correlated)
2. **Gap:** Correlated subqueries are the main limitation
3. **Solution:** 90%+ of correlated subqueries can be rewritten to CTEs
4. **Strategy:** AST preprocessing is lower risk than executor changes

### Recommended Path Forward

**Phase 1 (Immediate):** Build AST preprocessing infrastructure

**Phase 2 (Next):** Implement scalar correlated subquery rewriter

**Phase 3 (Future):** Add IN/EXISTS and Top-N rewriters

**Phase 4 (Polish):** Debugging tools and documentation

### What This Enables

Users can write:
```sql
-- Natural SQL with subqueries
SELECT
    customer_name,
    (SELECT COUNT(*) FROM orders WHERE customer_id = c.id) as order_count
FROM customers c
WHERE region = 'EMEA';
```

We automatically transform to:
```sql
-- Efficient CTE-based execution
WITH _order_counts AS (
    SELECT customer_id, COUNT(*) as cnt FROM orders GROUP BY customer_id
)
SELECT
    c.customer_name,
    COALESCE(oc.cnt, 0) as order_count
FROM customers c
LEFT JOIN _order_counts oc ON c.id = oc.customer_id
WHERE c.region = 'EMEA';
```

Best of both worlds! 🎉

---

**Next Steps:**
1. Review and approve this analysis
2. Create detailed design doc for preprocessor architecture
3. Implement Phase 1: Infrastructure
4. Build test suite of correlated subquery patterns
5. Implement Phase 2: First rewriter