sql-cli 1.73.1

SQL query tool for CSV/JSON with both interactive TUI and non-interactive CLI modes - perfect for exploration and automation
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
# JOIN Expression Support Analysis & Implementation Proposal

## Executive Summary

**Current State**: JOIN conditions only support simple column references
**User Need**: Support functions and expressions in JOIN conditions
**Complexity**: Medium - requires changes to parser, AST, and executor
**Benefit**: High - enables common real-world scenarios like TRIM(), case-insensitive joins, etc.

---

## 1. What Currently Works ✓

### Multiple Column Joins (Fully Supported)
```sql
-- Multiple AND conditions work perfectly
SELECT * FROM orders
JOIN executions
  ON orders.id = executions.order_id
 AND orders.account = executions.account
 AND orders.date = executions.date;
```

### All Comparison Operators
```sql
-- Equality, inequality, and range operators all work
SELECT * FROM buy
JOIN sell
  ON buy.instrument = sell.instrument
 AND buy.price > sell.price
 AND buy.quantity <> sell.quantity;
```

### Smart Algorithm Selection
- **Hash Join**: Used for single equality condition (O(n+m) complexity)
- **Nested Loop**: Used for multiple conditions or inequalities (O(n*m) complexity)
- Automatic selection based on condition analysis

### Advanced Features
- Subquery JOINs (DerivedTable)
- CTE JOINs
- Table aliases
- All JOIN types: INNER, LEFT, RIGHT, CROSS (FULL not yet implemented)

---

## 2. Critical Gap: No Expressions in JOIN Conditions ✗

### What Doesn't Work Today

**User's Exact Use Cases:**
```sql
-- ❌ Function on right side (padding/trimming issue)
ON fix_portfolio.portfolio = TRIM(fund_portfolio.Name)

-- ❌ Function on either side (string manipulation)
ON lhs.order_id = SUBSTRING(rhs.field, 1, 10)
ON lhs.order_id = SPLIT_LEFT(rhs.field, '.')

-- ❌ Case-insensitive comparison
ON LOWER(a.email) = LOWER(b.email)

-- ❌ Arithmetic expressions
ON users.id = orders.user_id + 1
ON ROUND(buy.price, 2) = ROUND(sell.price, 2)

-- ❌ Complex transformations
ON DATE_TRUNC('day', a.timestamp) = DATE_TRUNC('day', b.timestamp)
```

### Why It Doesn't Work

**Parser Limitation** (`src/sql/recursive_parser.rs:1820-1844`):
```rust
fn parse_single_join_condition(&mut self) -> Result<SingleJoinCondition, String> {
    let left_column = self.parse_column_reference()?;  // ❌ Only column names
    let operator = /* ... */;
    let right_column = self.parse_column_reference()?; // ❌ Only column names

    Ok(SingleJoinCondition {
        left_column,   // String, not Expression
        operator,
        right_column,  // String, not Expression
    })
}
```

**AST Limitation** (`src/sql/parser/ast.rs:481-487`):
```rust
pub struct SingleJoinCondition {
    pub left_column: String,    // ❌ Just a string
    pub operator: JoinOperator,
    pub right_column: String,   // ❌ Just a string
}
```

Compare this to WHERE clauses, GROUP BY, and HAVING which all use `SqlExpression`:
```rust
pub struct SelectStatement {
    pub where_clause: Option<WhereClause>,        // Uses SqlExpression
    pub group_by: Option<Vec<SqlExpression>>,     // Uses SqlExpression ✓
    pub having: Option<SqlExpression>,            // Uses SqlExpression ✓
    pub joins: Vec<JoinClause>,                   // Uses String ✗
}
```

---

## 3. Proposed Solution: Expression-Based JOIN Conditions

### Design Overview

Upgrade JOIN conditions to use `SqlExpression` (already defined in `ast.rs:136-200`) instead of `String`.

**Benefits:**
- Reuse existing expression evaluation infrastructure
- Consistent with WHERE, GROUP BY, HAVING
- Supports all functions in function registry
- Enables complex transformations

**Complexity:**
- **Parser changes**: Medium (swap `parse_column_reference()``parse_expression()`)
- **AST changes**: Low (change String → SqlExpression)
- **Executor changes**: Medium (evaluate expressions before comparison)

### Architecture Changes

#### Change 1: AST Structure
```rust
// BEFORE (src/sql/parser/ast.rs:481-487)
pub struct SingleJoinCondition {
    pub left_column: String,
    pub operator: JoinOperator,
    pub right_column: String,
}

// AFTER
pub struct SingleJoinCondition {
    pub left_expr: SqlExpression,    // Can be Column, FunctionCall, BinaryOp, etc.
    pub operator: JoinOperator,
    pub right_expr: SqlExpression,   // Can be Column, FunctionCall, BinaryOp, etc.
}
```

#### Change 2: Parser
```rust
// BEFORE (src/sql/recursive_parser.rs:1820-1844)
fn parse_single_join_condition(&mut self) -> Result<SingleJoinCondition, String> {
    let left_column = self.parse_column_reference()?;
    let operator = /* parse operator */;
    let right_column = self.parse_column_reference()?;

    Ok(SingleJoinCondition { left_column, operator, right_column })
}

// AFTER
fn parse_single_join_condition(&mut self) -> Result<SingleJoinCondition, String> {
    let left_expr = self.parse_expression()?;  // Supports functions, ops, etc.
    let operator = /* parse operator */;
    let right_expr = self.parse_expression()?; // Supports functions, ops, etc.

    Ok(SingleJoinCondition { left_expr, operator, right_expr })
}
```

#### Change 3: Executor

**Hash Join** (`src/data/hash_join.rs:221-529`):
Currently extracts column indices:
```rust
// BEFORE: Assumes simple column names
let left_col_idx = find_column_index(&left_table, &condition.left_column)?;
let right_col_idx = find_column_index(&right_table, &condition.right_column)?;
```

**AFTER**: Evaluate expressions for each row
```rust
// Build hash table
for row in left_table.rows {
    let key_value = evaluate_expression(&condition.left_expr, row, left_table)?;
    hash_map.entry(key_value).or_insert(vec![]).push(row);
}

// Probe hash table
for row in right_table.rows {
    let key_value = evaluate_expression(&condition.right_expr, row, right_table)?;
    if let Some(matching_rows) = hash_map.get(&key_value) {
        // Emit joined rows
    }
}
```

**Nested Loop Join** (`src/data/hash_join.rs:637-1092`):
Currently compares column values:
```rust
// BEFORE
let left_value = &left_row[left_idx];
let right_value = &right_row[right_idx];
let matches = compare_values(left_value, right_value, operator);
```

**AFTER**: Evaluate both expressions
```rust
let left_value = evaluate_expression(&condition.left_expr, left_row, left_table)?;
let right_value = evaluate_expression(&condition.right_expr, right_row, right_table)?;
let matches = compare_values(&left_value, &right_value, operator);
```

---

## 4. Implementation Complexity Analysis

### Low Complexity: AST Changes
- **Files**: `src/sql/parser/ast.rs`
- **Changes**: Rename fields, change types
- **Risk**: Low (breaking change, but straightforward)

### Medium Complexity: Parser Changes
- **Files**: `src/sql/recursive_parser.rs`
- **Changes**:
  - Replace `parse_column_reference()` with `parse_expression()`
  - Expression parser already exists and is mature
  - Need to handle qualified column names (table.column) in expressions
- **Risk**: Medium (parser context tracking for table references)

### Medium Complexity: Executor Changes
- **Files**:
  - `src/data/hash_join.rs` (hash join and nested loop algorithms)
  - `src/data/arithmetic_evaluator.rs` (expression evaluation)
- **Changes**:
  - Add expression evaluation in join execution
  - Handle table context for column resolution (left vs right table)
  - Performance consideration: expression evaluation per row pair in nested loop
- **Risk**: Medium (performance impact, context handling)

### High Complexity Areas to Watch
1. **Column Context Resolution**:
   - In JOIN condition `ON users.id = orders.user_id`, need to resolve which table each column comes from
   - Current expression evaluator may not have table context awareness
   - May need to pass table metadata to evaluator

2. **Performance**:
   - Hash join currently uses column index lookup (fast)
   - Expression evaluation per row is slower
   - May need to detect simple column references and optimize them

3. **Error Messages**:
   - Need clear errors when expression references wrong table
   - "Column 'users.name' not found in right table 'orders'"

---

## 5. Workarounds Available Today

While waiting for expression support, users can use CTEs to pre-compute transformations:

### Pattern: CTE Transformation
```sql
-- Instead of: ON a.portfolio = TRIM(b.Name)
-- Use CTE to pre-trim:

WITH trimmed_funds AS (
    SELECT
        fund_id,
        TRIM(Name) AS trimmed_name,
        other_columns
    FROM fund_portfolio
)
SELECT *
FROM fix_portfolio
JOIN trimmed_funds
  ON fix_portfolio.portfolio = trimmed_funds.trimmed_name;
```

### Pattern: Case-Insensitive Join
```sql
-- Instead of: ON LOWER(a.email) = LOWER(b.email)
-- Use CTE:

WITH normalized_users AS (
    SELECT user_id, LOWER(email) AS email_lower
    FROM users
),
normalized_contacts AS (
    SELECT contact_id, LOWER(email) AS email_lower
    FROM contacts
)
SELECT *
FROM normalized_users u
JOIN normalized_contacts c ON u.email_lower = c.email_lower;
```

**Pros**: Works today, readable, can be reused
**Cons**: Verbose, requires materialization, loses performance of hash join on expressions

---

## 6. Recommended Implementation Plan

### Phase 1: Simple Function Support (MVP)
**Goal**: Support functions on ONE side of join condition

Example: `ON portfolio = TRIM(fund_portfolio.Name)`

**Changes**:
1. AST: Change `right_column` to `right_expr: SqlExpression`
2. Parser: Parse right side as expression
3. Executor: Evaluate right expression for each right table row

**Benefit**: Solves most common use cases with minimal changes
**Complexity**: Low-Medium

### Phase 2: Full Expression Support
**Goal**: Support expressions on BOTH sides

Example: `ON LOWER(a.email) = LOWER(b.email)`

**Changes**:
1. AST: Change both fields to SqlExpression
2. Parser: Parse both sides as expressions
3. Executor: Evaluate both expressions
4. Add column context resolution

**Benefit**: Complete parity with WHERE clause
**Complexity**: Medium

### Phase 3: Optimization
**Goal**: Maintain performance for simple column-to-column joins

**Changes**:
1. Detect when expression is simple column reference
2. Use fast path (current index-based approach)
3. Only evaluate expressions when necessary

**Benefit**: No performance regression for existing queries
**Complexity**: Medium

---

## 7. Testing Strategy

### Parser Tests
```rust
#[test]
fn test_join_with_function() {
    let sql = "SELECT * FROM a JOIN b ON a.id = TRIM(b.name)";
    let ast = parse(sql).unwrap();
    // Verify right_expr is FunctionCall
}

#[test]
fn test_join_with_both_expressions() {
    let sql = "SELECT * FROM a JOIN b ON LOWER(a.email) = LOWER(b.email)";
    let ast = parse(sql).unwrap();
    // Verify both are FunctionCall
}
```

### Integration Tests
```sql
-- Test with actual data (tests/sql_examples/test_join_expressions.sql)

-- #! ../data/users_padded.csv
-- Test: TRIM() in join condition
WITH contacts AS (
    SELECT 1 as id, '  John  ' as name
)
SELECT * FROM users
JOIN contacts ON users.name = TRIM(contacts.name);
GO

-- Test: Case-insensitive email matching
WITH emails AS (
    SELECT 'User1' as source, 'JOHN@EXAMPLE.COM' as email
)
SELECT * FROM users
JOIN emails ON LOWER(users.email) = LOWER(emails.email);
GO

-- Test: Arithmetic in join
SELECT * FROM orders o1
JOIN orders o2 ON o1.id = o2.parent_id + 1;
GO
```

### Performance Tests
```bash
# Benchmark: Simple column join (should use hash join)
time ./target/release/sql-cli -q "SELECT * FROM large1 JOIN large2 ON large1.id = large2.id"

# Benchmark: Expression join (will use evaluated approach)
time ./target/release/sql-cli -q "SELECT * FROM large1 JOIN large2 ON large1.id = TRIM(large2.id)"

# Target: < 10% performance degradation for simple joins after changes
```

---

## 8. User Impact & Examples

### Real-World Use Cases Enabled

**1. String Normalization (User's Exact Problem)**
```sql
-- Fix padded spaces in database exports
SELECT *
FROM fix_portfolio
JOIN fund_portfolio
  ON fix_portfolio.portfolio = TRIM(fund_portfolio.Name);
```

**2. Case-Insensitive Matching**
```sql
-- Match emails regardless of case
SELECT *
FROM users
JOIN login_attempts
  ON LOWER(users.email) = LOWER(login_attempts.email);
```

**3. Date Truncation**
```sql
-- Join on date ignoring time component
SELECT *
FROM transactions t1
JOIN transactions t2
  ON DATE_TRUNC('day', t1.timestamp) = DATE_TRUNC('day', t2.timestamp)
 AND t1.account <> t2.account;
```

**4. Numeric Precision**
```sql
-- Join on rounded prices
SELECT *
FROM buy_orders
JOIN sell_orders
  ON buy_orders.instrument = sell_orders.instrument
 AND ROUND(buy_orders.price, 2) = ROUND(sell_orders.price, 2);
```

**5. String Extraction**
```sql
-- Join on extracted identifiers
SELECT *
FROM events e1
JOIN events e2
  ON SUBSTRING(e1.event_id, 1, 10) = SUBSTRING(e2.event_id, 1, 10)
 AND e1.event_id <> e2.event_id;
```

---

## 9. Alternative Approaches Considered

### Option 1: Add Specific Functions (REJECTED)
**Idea**: Add `TRIMJOIN()` or similar specialized functions

**Pros**: Simple, focused
**Cons**: Doesn't scale, inconsistent with SQL standards

### Option 2: Pre-Join Transformation Hint (REJECTED)
**Idea**: Special syntax like `JOIN ... ON TRANSFORM(TRIM) column = column`

**Pros**: Explicit performance control
**Cons**: Non-standard, doesn't support arbitrary expressions

### Option 3: Expression Support (RECOMMENDED)
**Idea**: Full `SqlExpression` support in JOIN conditions

**Pros**:
- Standard SQL behavior
- Consistent with WHERE/HAVING/GROUP BY
- Supports all existing functions
- Future-proof

**Cons**:
- More complex implementation
- Potential performance impact (mitigable with optimization)

---

## 10. Open Questions

1. **Should we support OR in join conditions?**
   - Current: Only AND is supported
   - Proposal: Keep AND-only for Phase 1/2, consider OR in Phase 3
   - Rationale: OR in joins is rare, complicates hash join algorithm

2. **Should we optimize simple column references?**
   - Current: N/A (no expressions yet)
   - Proposal: Yes, detect `SqlExpression::Column` and use fast path
   - Rationale: Avoid performance regression for existing queries

3. **How to handle aggregate functions in JOIN conditions?**
   - Example: `ON a.id = MAX(b.id)` (semantically unclear)
   - Proposal: Disallow aggregates in JOIN conditions (like WHERE clause)
   - Rationale: Aggregates belong in HAVING, not JOIN

4. **Column context for table qualification?**
   - Challenge: `ON table1.id = table2.id` - which table does each column come from?
   - Current evaluator may not have table context
   - Proposal: Extend evaluator to accept table context parameter

---

## 11. Conclusion & Recommendation

**Recommendation**: Implement in phases starting with Phase 1 (MVP)

**Justification**:
- Solves real user pain points (trimming, case-insensitive joins)
- Aligns with SQL standards
- Reuses existing expression infrastructure
- Medium complexity with manageable risk

**Next Steps**:
1. Create feature branch: `feature/join-expressions`
2. Implement Phase 1: Right-side expression support
3. Add comprehensive tests
4. Validate performance impact
5. If successful, proceed to Phase 2

**Estimated Effort**:
- Phase 1 (MVP): 2-3 days
- Phase 2 (Full): 3-4 days
- Phase 3 (Optimization): 2-3 days
- Testing: 1-2 days per phase

**Total**: 8-12 days for complete implementation

---

## Appendix: Code Location Reference

| Component | File Path | Lines |
|-----------|-----------|-------|
| Parser: JOIN parsing | `src/sql/recursive_parser.rs` | 1648-1869 |
| AST: JoinCondition | `src/sql/parser/ast.rs` | 481-502 |
| AST: SqlExpression | `src/sql/parser/ast.rs` | 136-200 |
| Executor: Hash Join | `src/data/hash_join.rs` | 221-529 |
| Executor: Nested Loop | `src/data/hash_join.rs` | 637-1092 |
| Evaluator: Expression eval | `src/data/arithmetic_evaluator.rs` | All |
| Query Engine: JOIN orchestration | `src/data/query_engine.rs` | ~1000-1070 |

---

**Document Version**: 1.0
**Date**: 2025-01-22
**Author**: Analysis based on codebase exploration
**Status**: Proposal - Awaiting decision