sql-cli 1.73.1 - Docs.rs

# JOIN Expression Support Analysis & Implementation Proposal

## Executive Summary

**Current State**: JOIN conditions only support simple column references
**User Need**: Support functions and expressions in JOIN conditions
**Complexity**: Medium - requires changes to parser, AST, and executor
**Benefit**: High - enables common real-world scenarios like TRIM(), case-insensitive joins, etc.

---

## 1. What Currently Works ✓

### Multiple Column Joins (Fully Supported)
```sql
-- Multiple AND conditions work perfectly
SELECT * FROM orders
JOIN executions
  ON orders.id = executions.order_id
 AND orders.account = executions.account
 AND orders.date = executions.date;
```

### All Comparison Operators
```sql
-- Equality, inequality, and range operators all work
SELECT * FROM buy
JOIN sell
  ON buy.instrument = sell.instrument
 AND buy.price > sell.price
 AND buy.quantity <> sell.quantity;
```

### Smart Algorithm Selection
- **Hash Join**: Used for single equality condition (O(n+m) complexity)
- **Nested Loop**: Used for multiple conditions or inequalities (O(n*m) complexity)
- Automatic selection based on condition analysis

### Advanced Features
- Subquery JOINs (DerivedTable)
- CTE JOINs
- Table aliases
- All JOIN types: INNER, LEFT, RIGHT, CROSS (FULL not yet implemented)

---

## 2. Critical Gap: No Expressions in JOIN Conditions ✗

### What Doesn't Work Today

**User's Exact Use Cases:**
```sql
-- ❌ Function on right side (padding/trimming issue)
ON fix_portfolio.portfolio = TRIM(fund_portfolio.Name)

-- ❌ Function on either side (string manipulation)
ON lhs.order_id = SUBSTRING(rhs.field, 1, 10)
ON lhs.order_id = SPLIT_LEFT(rhs.field, '.')

-- ❌ Case-insensitive comparison
ON LOWER(a.email) = LOWER(b.email)

-- ❌ Arithmetic expressions
ON users.id = orders.user_id + 1
ON ROUND(buy.price, 2) = ROUND(sell.price, 2)

-- ❌ Complex transformations
ON DATE_TRUNC('day', a.timestamp) = DATE_TRUNC('day', b.timestamp)
```

### Why It Doesn't Work

**Parser Limitation** (`src/sql/recursive_parser.rs:1820-1844`):
```rust
fn parse_single_join_condition(&mut self) -> Result<SingleJoinCondition, String> {
    let left_column = self.parse_column_reference()?;  // ❌ Only column names
    let operator = /* ... */;
    let right_column = self.parse_column_reference()?; // ❌ Only column names

    Ok(SingleJoinCondition {
        left_column,   // String, not Expression
        operator,
        right_column,  // String, not Expression
    })
}
```

**AST Limitation** (`src/sql/parser/ast.rs:481-487`):
```rust
pub struct SingleJoinCondition {
    pub left_column: String,    // ❌ Just a string
    pub operator: JoinOperator,
    pub right_column: String,   // ❌ Just a string
}
```

Compare this to WHERE clauses, GROUP BY, and HAVING which all use `SqlExpression`:
```rust
pub struct SelectStatement {
    pub where_clause: Option<WhereClause>,        // Uses SqlExpression
    pub group_by: Option<Vec<SqlExpression>>,     // Uses SqlExpression ✓
    pub having: Option<SqlExpression>,            // Uses SqlExpression ✓
    pub joins: Vec<JoinClause>,                   // Uses String ✗
}
```

---

## 3. Proposed Solution: Expression-Based JOIN Conditions

### Design Overview

Upgrade JOIN conditions to use `SqlExpression` (already defined in `ast.rs:136-200`) instead of `String`.

**Benefits:**
- Reuse existing expression evaluation infrastructure
- Consistent with WHERE, GROUP BY, HAVING
- Supports all functions in function registry
- Enables complex transformations

**Complexity:**
- **Parser changes**: Medium (swap `parse_column_reference()` → `parse_expression()`)
- **AST changes**: Low (change String → SqlExpression)
- **Executor changes**: Medium (evaluate expressions before comparison)

### Architecture Changes

#### Change 1: AST Structure
```rust
// BEFORE (src/sql/parser/ast.rs:481-487)
pub struct SingleJoinCondition {
    pub left_column: String,
    pub operator: JoinOperator,
    pub right_column: String,
}

// AFTER
pub struct SingleJoinCondition {
    pub left_expr: SqlExpression,    // Can be Column, FunctionCall, BinaryOp, etc.
    pub operator: JoinOperator,
    pub right_expr: SqlExpression,   // Can be Column, FunctionCall, BinaryOp, etc.
}
```

#### Change 2: Parser
```rust
// BEFORE (src/sql/recursive_parser.rs:1820-1844)
fn parse_single_join_condition(&mut self) -> Result<SingleJoinCondition, String> {
    let left_column = self.parse_column_reference()?;
    let operator = /* parse operator */;
    let right_column = self.parse_column_reference()?;

    Ok(SingleJoinCondition { left_column, operator, right_column })
}

// AFTER
fn parse_single_join_condition(&mut self) -> Result<SingleJoinCondition, String> {
    let left_expr = self.parse_expression()?;  // Supports functions, ops, etc.
    let operator = /* parse operator */;
    let right_expr = self.parse_expression()?; // Supports functions, ops, etc.

    Ok(SingleJoinCondition { left_expr, operator, right_expr })
}
```

#### Change 3: Executor

**Hash Join** (`src/data/hash_join.rs:221-529`):
Currently extracts column indices:
```rust
// BEFORE: Assumes simple column names
let left_col_idx = find_column_index(&left_table, &condition.left_column)?;
let right_col_idx = find_column_index(&right_table, &condition.right_column)?;
```

**AFTER**: Evaluate expressions for each row
```rust
// Build hash table
for row in left_table.rows {
    let key_value = evaluate_expression(&condition.left_expr, row, left_table)?;
    hash_map.entry(key_value).or_insert(vec![]).push(row);
}

// Probe hash table
for row in right_table.rows {
    let key_value = evaluate_expression(&condition.right_expr, row, right_table)?;
    if let Some(matching_rows) = hash_map.get(&key_value) {
        // Emit joined rows
    }
}
```

**Nested Loop Join** (`src/data/hash_join.rs:637-1092`):
Currently compares column values:
```rust
// BEFORE
let left_value = &left_row[left_idx];
let right_value = &right_row[right_idx];
let matches = compare_values(left_value, right_value, operator);
```

**AFTER**: Evaluate both expressions
```rust
let left_value = evaluate_expression(&condition.left_expr, left_row, left_table)?;
let right_value = evaluate_expression(&condition.right_expr, right_row, right_table)?;
let matches = compare_values(&left_value, &right_value, operator);
```

---

## 4. Implementation Complexity Analysis

### Low Complexity: AST Changes
- **Files**: `src/sql/parser/ast.rs`
- **Changes**: Rename fields, change types
- **Risk**: Low (breaking change, but straightforward)

### Medium Complexity: Parser Changes
- **Files**: `src/sql/recursive_parser.rs`
- **Changes**:
  - Replace `parse_column_reference()` with `parse_expression()`
  - Expression parser already exists and is mature
  - Need to handle qualified column names (table.column) in expressions
- **Risk**: Medium (parser context tracking for table references)

### Medium Complexity: Executor Changes
- **Files**:
  - `src/data/hash_join.rs` (hash join and nested loop algorithms)
  - `src/data/arithmetic_evaluator.rs` (expression evaluation)
- **Changes**:
  - Add expression evaluation in join execution
  - Handle table context for column resolution (left vs right table)
  - Performance consideration: expression evaluation per row pair in nested loop
- **Risk**: Medium (performance impact, context handling)

### High Complexity Areas to Watch
1. **Column Context Resolution**:
   - In JOIN condition `ON users.id = orders.user_id`, need to resolve which table each column comes from
   - Current expression evaluator may not have table context awareness
   - May need to pass table metadata to evaluator

2. **Performance**:
   - Hash join currently uses column index lookup (fast)
   - Expression evaluation per row is slower
   - May need to detect simple column references and optimize them

3. **Error Messages**:
   - Need clear errors when expression references wrong table
   - "Column 'users.name' not found in right table 'orders'"

---

## 5. Workarounds Available Today

While waiting for expression support, users can use CTEs to pre-compute transformations:

### Pattern: CTE Transformation
```sql
-- Instead of: ON a.portfolio = TRIM(b.Name)
-- Use CTE to pre-trim:

WITH trimmed_funds AS (
    SELECT
        fund_id,
        TRIM(Name) AS trimmed_name,
        other_columns
    FROM fund_portfolio
)
SELECT *
FROM fix_portfolio
JOIN trimmed_funds
  ON fix_portfolio.portfolio = trimmed_funds.trimmed_name;
```

### Pattern: Case-Insensitive Join
```sql
-- Instead of: ON LOWER(a.email) = LOWER(b.email)
-- Use CTE:

WITH normalized_users AS (
    SELECT user_id, LOWER(email) AS email_lower
    FROM users
),
normalized_contacts AS (
    SELECT contact_id, LOWER(email) AS email_lower
    FROM contacts
)
SELECT *
FROM normalized_users u
JOIN normalized_contacts c ON u.email_lower = c.email_lower;
```

**Pros**: Works today, readable, can be reused
**Cons**: Verbose, requires materialization, loses performance of hash join on expressions

---

## 6. Recommended Implementation Plan

### Phase 1: Simple Function Support (MVP)
**Goal**: Support functions on ONE side of join condition

Example: `ON portfolio = TRIM(fund_portfolio.Name)`

**Changes**:
1. AST: Change `right_column` to `right_expr: SqlExpression`
2. Parser: Parse right side as expression
3. Executor: Evaluate right expression for each right table row

**Benefit**: Solves most common use cases with minimal changes
**Complexity**: Low-Medium

### Phase 2: Full Expression Support
**Goal**: Support expressions on BOTH sides

Example: `ON LOWER(a.email) = LOWER(b.email)`

**Changes**:
1. AST: Change both fields to SqlExpression
2. Parser: Parse both sides as expressions
3. Executor: Evaluate both expressions
4. Add column context resolution

**Benefit**: Complete parity with WHERE clause
**Complexity**: Medium

### Phase 3: Optimization
**Goal**: Maintain performance for simple column-to-column joins

**Changes**:
1. Detect when expression is simple column reference
2. Use fast path (current index-based approach)
3. Only evaluate expressions when necessary

**Benefit**: No performance regression for existing queries
**Complexity**: Medium

---

## 7. Testing Strategy

### Parser Tests
```rust
#[test]
fn test_join_with_function() {
    let sql = "SELECT * FROM a JOIN b ON a.id = TRIM(b.name)";
    let ast = parse(sql).unwrap();
    // Verify right_expr is FunctionCall
}

#[test]
fn test_join_with_both_expressions() {
    let sql = "SELECT * FROM a JOIN b ON LOWER(a.email) = LOWER(b.email)";
    let ast = parse(sql).unwrap();
    // Verify both are FunctionCall
}
```

### Integration Tests
```sql
-- Test with actual data (tests/sql_examples/test_join_expressions.sql)

-- #! ../data/users_padded.csv
-- Test: TRIM() in join condition
WITH contacts AS (
    SELECT 1 as id, '  John  ' as name
)
SELECT * FROM users
JOIN contacts ON users.name = TRIM(contacts.name);
GO

-- Test: Case-insensitive email matching
WITH emails AS (
    SELECT 'User1' as source, 'JOHN@EXAMPLE.COM' as email
)
SELECT * FROM users
JOIN emails ON LOWER(users.email) = LOWER(emails.email);
GO

-- Test: Arithmetic in join
SELECT * FROM orders o1
JOIN orders o2 ON o1.id = o2.parent_id + 1;
GO
```

### Performance Tests
```bash
# Benchmark: Simple column join (should use hash join)
time ./target/release/sql-cli -q "SELECT * FROM large1 JOIN large2 ON large1.id = large2.id"

# Benchmark: Expression join (will use evaluated approach)
time ./target/release/sql-cli -q "SELECT * FROM large1 JOIN large2 ON large1.id = TRIM(large2.id)"

# Target: < 10% performance degradation for simple joins after changes
```

---

## 8. User Impact & Examples

### Real-World Use Cases Enabled

**1. String Normalization (User's Exact Problem)**
```sql
-- Fix padded spaces in database exports
SELECT *
FROM fix_portfolio
JOIN fund_portfolio
  ON fix_portfolio.portfolio = TRIM(fund_portfolio.Name);
```

**2. Case-Insensitive Matching**
```sql
-- Match emails regardless of case
SELECT *
FROM users
JOIN login_attempts
  ON LOWER(users.email) = LOWER(login_attempts.email);
```

**3. Date Truncation**
```sql
-- Join on date ignoring time component
SELECT *
FROM transactions t1
JOIN transactions t2
  ON DATE_TRUNC('day', t1.timestamp) = DATE_TRUNC('day', t2.timestamp)
 AND t1.account <> t2.account;
```

**4. Numeric Precision**
```sql
-- Join on rounded prices
SELECT *
FROM buy_orders
JOIN sell_orders
  ON buy_orders.instrument = sell_orders.instrument
 AND ROUND(buy_orders.price, 2) = ROUND(sell_orders.price, 2);
```

**5. String Extraction**
```sql
-- Join on extracted identifiers
SELECT *
FROM events e1
JOIN events e2
  ON SUBSTRING(e1.event_id, 1, 10) = SUBSTRING(e2.event_id, 1, 10)
 AND e1.event_id <> e2.event_id;
```

---

## 9. Alternative Approaches Considered

### Option 1: Add Specific Functions (REJECTED)
**Idea**: Add `TRIMJOIN()` or similar specialized functions

**Pros**: Simple, focused
**Cons**: Doesn't scale, inconsistent with SQL standards

### Option 2: Pre-Join Transformation Hint (REJECTED)
**Idea**: Special syntax like `JOIN ... ON TRANSFORM(TRIM) column = column`

**Pros**: Explicit performance control
**Cons**: Non-standard, doesn't support arbitrary expressions

### Option 3: Expression Support (RECOMMENDED)
**Idea**: Full `SqlExpression` support in JOIN conditions

**Pros**:
- Standard SQL behavior
- Consistent with WHERE/HAVING/GROUP BY
- Supports all existing functions
- Future-proof

**Cons**:
- More complex implementation
- Potential performance impact (mitigable with optimization)

---

## 10. Open Questions

1. **Should we support OR in join conditions?**
   - Current: Only AND is supported
   - Proposal: Keep AND-only for Phase 1/2, consider OR in Phase 3
   - Rationale: OR in joins is rare, complicates hash join algorithm

2. **Should we optimize simple column references?**
   - Current: N/A (no expressions yet)
   - Proposal: Yes, detect `SqlExpression::Column` and use fast path
   - Rationale: Avoid performance regression for existing queries

3. **How to handle aggregate functions in JOIN conditions?**
   - Example: `ON a.id = MAX(b.id)` (semantically unclear)
   - Proposal: Disallow aggregates in JOIN conditions (like WHERE clause)
   - Rationale: Aggregates belong in HAVING, not JOIN

4. **Column context for table qualification?**
   - Challenge: `ON table1.id = table2.id` - which table does each column come from?
   - Current evaluator may not have table context
   - Proposal: Extend evaluator to accept table context parameter

---

## 11. Conclusion & Recommendation

**Recommendation**: Implement in phases starting with Phase 1 (MVP)

**Justification**:
- Solves real user pain points (trimming, case-insensitive joins)
- Aligns with SQL standards
- Reuses existing expression infrastructure
- Medium complexity with manageable risk

**Next Steps**:
1. Create feature branch: `feature/join-expressions`
2. Implement Phase 1: Right-side expression support
3. Add comprehensive tests
4. Validate performance impact
5. If successful, proceed to Phase 2

**Estimated Effort**:
- Phase 1 (MVP): 2-3 days
- Phase 2 (Full): 3-4 days
- Phase 3 (Optimization): 2-3 days
- Testing: 1-2 days per phase

**Total**: 8-12 days for complete implementation

---

## Appendix: Code Location Reference

| Component | File Path | Lines |
|-----------|-----------|-------|
| Parser: JOIN parsing | `src/sql/recursive_parser.rs` | 1648-1869 |
| AST: JoinCondition | `src/sql/parser/ast.rs` | 481-502 |
| AST: SqlExpression | `src/sql/parser/ast.rs` | 136-200 |
| Executor: Hash Join | `src/data/hash_join.rs` | 221-529 |
| Executor: Nested Loop | `src/data/hash_join.rs` | 637-1092 |
| Evaluator: Expression eval | `src/data/arithmetic_evaluator.rs` | All |
| Query Engine: JOIN orchestration | `src/data/query_engine.rs` | ~1000-1070 |

---

**Document Version**: 1.0
**Date**: 2025-01-22
**Author**: Analysis based on codebase exploration
**Status**: Proposal - Awaiting decision