sql-cli 1.73.1

SQL query tool for CSV/JSON with both interactive TUI and non-interactive CLI modes - perfect for exploration and automation
Documentation
# AST Transformation Ordering

This document explains the order in which AST transformers are applied in the preprocessing pipeline and why that order matters.

## Current Standard Pipeline

The standard pipeline (created by `create_standard_pipeline()` in `src/query_plan/mod.rs`) applies transformers in this order:

1. **ExpressionLifter** - Lifts window functions and column alias dependencies to CTEs
2. **CTEHoister** - Hoists nested WITH clauses to top level
3. **InOperatorLifter** - Optimizes large IN expressions by lifting to CTEs

## Why Order Matters

### 1. ExpressionLifter Must Run First

**Why:** Window functions and complex expressions need to be lifted into CTEs before any other transformations. This is because:

- Window functions can't be used in WHERE clauses or GROUP BY
- Column alias dependencies need to be resolved before other optimizations
- Creates a clean, normalized AST for subsequent transformers

**Example:**
```sql
-- Input
SELECT
    amount,
    ROW_NUMBER() OVER (ORDER BY amount) as rn
FROM sales
WHERE amount > 100

-- After ExpressionLifter (creates intermediate CTE)
WITH __lifted_1 AS (
    SELECT
        amount,
        ROW_NUMBER() OVER (ORDER BY amount) as rn
    FROM sales
)
SELECT * FROM __lifted_1 WHERE amount > 100
```

### 2. CTEHoister Runs Second

**Why:** After ExpressionLifter creates new CTEs, we may have nested CTEs that need to be flattened:

- ExpressionLifter might create CTEs inside queries that already have CTEs
- Nested CTEs are harder to optimize and execute
- Hoisting brings all CTEs to the top level for better visibility

**Example:**
```sql
-- After ExpressionLifter might produce nested CTEs
WITH outer_cte AS (
    SELECT * FROM (
        WITH inner_cte AS (SELECT 1 as x)
        SELECT * FROM inner_cte
    )
)
SELECT * FROM outer_cte

-- After CTEHoister (flattened)
WITH inner_cte AS (SELECT 1 as x),
     outer_cte AS (SELECT * FROM inner_cte)
SELECT * FROM outer_cte
```

### 3. InOperatorLifter Runs Last

**Why:** IN expression optimization should happen after structural transformations:

- Works on the clean, normalized AST
- Doesn't interfere with window function lifting
- Can safely lift IN expressions to CTEs knowing the structure is stable

**Example:**
```sql
-- Input (after previous transformations)
SELECT * FROM sales
WHERE product_id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)

-- After InOperatorLifter (if threshold met)
WITH __in_values_1 AS (
    SELECT 1 as value UNION ALL
    SELECT 2 UNION ALL
    -- ... (11 values)
)
SELECT * FROM sales
WHERE product_id IN (SELECT value FROM __in_values_1)
```

## Transformation Dependencies

```
ExpressionLifter
    ↓ (may create nested CTEs)
CTEHoister
    ↓ (flattens structure)
InOperatorLifter
    ↓ (optimizes on clean AST)
Final AST
```

## Adding New Transformers

When adding new transformers, consider these guidelines:

### Early Stage (before ExpressionLifter)
- Very rare! Only for transformations that need to see the original query structure
- Must be able to handle window functions in expressions
- Example: Syntax sugar expansion that doesn't affect semantics

### Middle Stage (between ExpressionLifter and CTEHoister)
- Transformations that work with expressions but don't create CTEs
- Must handle window functions that are now in CTEs
- Example: Expression simplification, constant folding

### Late Stage (after CTEHoister, before InOperatorLifter)
- Transformations that benefit from flat CTE structure
- May create new CTEs (which are already at top level)
- Example: Correlated subquery rewriting (Phase 3)

### Final Stage (after InOperatorLifter)
- Optimizations that work on fully normalized queries
- Should not create new CTEs
- Example: Predicate pushdown, join reordering

## Phase 0 Configuration

Currently in Phase 0, the pipeline is configured to:

- ✅ Apply all three existing transformers in correct order
- ✅ Collect statistics on transformation duration
- ✅ Support verbose logging via `--show-preprocessing` flag
- ✅ Handle errors gracefully (fall back to original AST if transformation fails)

## Future Phases

### Phase 1 Additions (Quick Wins)
- **HAVING Auto-aliaser** - Insert between ExpressionLifter and CTEHoister
  - Adds implicit aliases to aggregate functions in HAVING clauses
  - Must run after expression lifting but before CTE hoisting

- **SELECT * Expander** - Run very early (before ExpressionLifter)
  - Expands `SELECT *` to explicit column names
  - Helps other transformers by making column references explicit

### Phase 2 Additions (Parallel CTEs)
- **CTE Dependency Analyzer** - Run after CTEHoister
  - Analyzes CTE dependencies to identify parallelizable CTEs
  - Doesn't modify AST, just annotates for executor

### Phase 3 Additions (Correlated Subqueries)
- **Correlated Subquery Rewriter** - Run after CTEHoister
  - Converts correlated subqueries to CTEs + JOINs
  - Must run after all CTEs are hoisted to top level
  - May create additional CTEs (which are already normalized)

## Testing Transformation Order

The integration test `tests/integration/test_preprocessing_pipeline.sh` verifies:

1. All transformers are invoked in correct order
2. Statistics show each transformer's execution time
3. Transformations are applied correctly
4. Pipeline handles errors gracefully

Run with:
```bash
./tests/integration/test_preprocessing_pipeline.sh
```

## Debugging

To see transformation order in action:

```bash
# Show preprocessing statistics
./target/release/sql-cli data/test.csv \
    -q "YOUR_QUERY" \
    --check-in-lifting \
    --show-preprocessing

# Output will show:
# === Preprocessing Pipeline ===
# 3 transformer(s) applied in 0.04ms
#   ExpressionLifter - 0.02ms
#   CTEHoister - 0.01ms
#   InOperatorLifter - 0.01ms
```

## References

- Pipeline implementation: `src/query_plan/pipeline.rs`
- Transformer adapters: `src/query_plan/transformer_adapters.rs`
- Standard pipeline factory: `src/query_plan/mod.rs` (lines 52-74)
- Individual transformers:
  - ExpressionLifter: `src/query_plan/expression_lifter.rs`
  - CTEHoister: `src/query_plan/cte_hoister.rs`
  - InOperatorLifter: `src/query_plan/in_operator_lifter.rs`