# Window Function Performance Profiling Plan
## Current Performance (Benchmarks)
### 10k rows
- LAG + LEAD combined: **321ms**
- LAG with PARTITION BY: **111ms**
- ROW_NUMBER: **161ms**
### 20k rows
- LAG function: **402ms**
- LEAD function: **400ms**
- LAG + LEAD combined: **648ms**
- LAG with PARTITION BY: **221ms**
- ROW_NUMBER: **322ms**
### 50k rows (Performance bottleneck)
- **LAG function: 1.04s** ⚠️
- **LEAD function: 992ms** ⚠️
- **LAG + LEAD combined: 1.67s** ⚠️
- LAG with PARTITION BY: 565ms
- ROW_NUMBER: 839ms
**Comparison** (50k rows):
- GROUP BY with multiple aggregates: **965ms** ✅ (faster than LAG/LEAD)
- Hash JOIN (equality): **6ms** ✅✅ (very efficient!)
## Performance Gap Analysis
Window functions are ~2x slower than GROUP BY operations:
- LAG/LEAD: ~1 second for 50k rows
- GROUP BY: ~500-600ms for 50k rows
- Hash JOIN: Negligible (~6ms)
**Scaling**: Window functions appear to scale linearly (O(n)) but with higher constant factor than GROUP BY.
## Current Profiling Capabilities
### ✅ Already Have Timing For:
1. **CTE execution** - `cte_start.elapsed()`
2. **JOIN operations** - `join_start.elapsed()`
3. **WHERE filtering** - `filter_start.elapsed()`
4. **GROUP BY aggregation** - `group_start.elapsed()`
5. **DISTINCT** - `distinct_start.elapsed()`
6. **ORDER BY sorting** - `sort_start.elapsed()`
7. **Set operations** (UNION/INTERSECT/EXCEPT) - `op_start.elapsed()`
### ❌ Missing Timing For:
1. **Window function evaluation** - NO TIMING CURRENTLY
2. **WindowContext creation** - NO TIMING
3. **Partition computation** - NO TIMING
4. **LAG/LEAD offset access** - NO TIMING
5. **Per-row window function calls** - NO TIMING
## Proposed Profiling Additions
### Phase 1: Coarse-Grained Timing (Quick Win - 1 hour)
Add top-level timing for window function operations in query_engine.rs:
```rust
// Around line 1200+ (after GROUP BY, before SELECT item evaluation)
let window_start = Instant::now();
// ... window function evaluation ...
info!(
"Window function evaluation took {:.2}ms for {} rows",
window_start.elapsed().as_secs_f64() * 1000.0,
filtered_table.len()
);
```
**Location**: `src/data/query_engine.rs` in `execute()` method
**Benefit**: Confirms window functions are the bottleneck
**Effort**: 15 minutes
### Phase 2: WindowContext Profiling (Medium - 2 hours)
Add timing breakdown in `arithmetic_evaluator.rs`:
1. **WindowContext creation timing**:
```rust
let context_start = Instant::now();
let context = WindowContext::new_with_spec(...)?;
debug!(
"WindowContext creation took {:.2}ms (partitions: {}, rows: {})",
context_start.elapsed().as_secs_f64() * 1000.0,
context.partition_count(),
table.len()
);
```
2. **Per-function timing**:
```rust
let func_start = Instant::now();
debug!(
"{} evaluation took {:.2}μs",
name_upper,
func_start.elapsed().as_micros()
);
```
**Location**: `src/data/arithmetic_evaluator.rs`
**Benefit**: Identifies if creation or evaluation is slow
**Effort**: 30 minutes
### Phase 3: Detailed Window Context Timing (Fine-Grained - 4 hours)
Add timing in `window_context.rs` for key operations:
1. **Partition creation**:
```rust
let partition_start = Instant::now();
debug!(
"Partition creation: {:.2}ms ({} partitions, {} rows)",
partition_start.elapsed().as_secs_f64() * 1000.0,
partitions.len(),
total_rows
);
```
2. **Partition sorting**:
```rust
let sort_start = Instant::now();
debug!(
"Partition sorting: {:.2}ms ({} rows)",
sort_start.elapsed().as_secs_f64() * 1000.0,
rows.len()
);
```
3. **LAG/LEAD offset access**:
```rust
let offset_start = Instant::now();
trace!(
"Offset access: {:.2}μs (row: {}, offset: {})",
offset_start.elapsed().as_micros(),
row_index,
offset
);
```
**Location**: `src/sql/window_context.rs`
**Benefit**: Pinpoints exact bottleneck (partition creation, sorting, or access)
**Effort**: 1-2 hours
### Phase 4: Execution Plan Integration (Polish - 2 hours)
Add window function steps to execution plan output:
```rust
// In query_engine.rs
if let Some(ref mut plan) = execution_plan {
plan.add_step(
StepType::WindowFunction,
"Evaluate window functions",
format!("{} window functions, {} partitions",
window_func_count, partition_count),
window_duration
);
}
```
**Update ExecutionPlan enum**: Add `WindowFunction` variant to `StepType`
**Benefit**: Shows window timing in `--execution-plan` output
**Effort**: 30 minutes
## Recommended Execution Order
### Week 1: Monday-Tuesday (Investigation)
1. ✅ **Phase 1** (15 min) - Add coarse-grained timing - **COMPLETED**
- Added `contains_window_function()` helper in `query_engine.rs`
- Modified `apply_select_items()` to detect and time window functions
- Added execution plan step with `StepType::WindowFunction`
- Logs total window function evaluation time and count
2. ⏸️ **Phase 2** (30 min) - Add WindowContext profiling - **TODO**
3. ⏸️ **Phase 3** (2 hours) - Add detailed timing - **TODO**
4. 🔬 **Run benchmarks** with `RUST_LOG=debug` to collect timing data
5. 📊 **Analyze** where time is spent
### Week 1: Wednesday-Thursday (Optimization)
Based on profiling data, likely hotspots to investigate:
#### Hotspot 1: Partition Creation
- Are we creating too many partitions?
- Can we cache partition boundaries?
- Memory allocation patterns?
#### Hotspot 2: Offset Access (LAG/LEAD)
- Are we cloning data unnecessarily?
- Can we use indices instead of cloning?
- Is the offset lookup O(1) or O(n)?
#### Hotspot 3: Per-Row Overhead
- Are we re-computing WindowContext for every row?
- Can we batch window function calls?
- Is there redundant work per row?
### Week 1: Friday (Validation)
4. ✅ **Phase 4** (30 min) - Add to execution plan
5. 📈 **Re-benchmark** to measure improvements
6. ✅ **Document** findings and optimizations
## Expected Outcomes
### Immediate (after profiling)
- Clear understanding of where time is spent
- Data-driven optimization targets
- Better --execution-plan output
### After Optimization (Target)
- **LAG/LEAD at 50k rows**: Reduce from 1.04s to ~600ms (40% improvement)
- **WindowContext creation**: Should be sub-100ms
- **Per-row overhead**: Should be sub-microseconds
### Success Metrics
- [ ] LAG/LEAD comparable to GROUP BY performance (~500-600ms at 50k rows)
- [ ] Clear timing breakdown in `--execution-plan`
- [ ] No more than 20% overhead from window function infrastructure
- [ ] Linear scaling confirmed (O(n) with acceptable constant factor)
## Debug Commands for Testing
```bash
# Basic window function with timing
RUST_LOG=debug ./target/release/sql-cli data.csv \
-q "SELECT *, LAG(value) OVER (ORDER BY id) as prev FROM data" \
--execution-plan
# Partitioned window function
RUST_LOG=debug ./target/release/sql-cli data.csv \
-q "SELECT *, LAG(value) OVER (PARTITION BY category ORDER BY id) as prev FROM data" \
--execution-plan
# Multiple window functions (worst case)
RUST_LOG=debug ./target/release/sql-cli data.csv \
-q "SELECT *,
LAG(value) OVER (ORDER BY id) as prev,
LEAD(value) OVER (ORDER BY id) as next,
ROW_NUMBER() OVER (ORDER BY id) as rn
FROM data" \
--execution-plan
# Large dataset benchmark
./target/release/sql-cli large_data.csv \
-q "SELECT *, LAG(sales) OVER (PARTITION BY region ORDER BY date) as prev_sales FROM data" \
--execution-plan
```
## Files to Modify
1. **src/data/query_engine.rs** - Add coarse window function timing (~line 1200+)
2. **src/data/arithmetic_evaluator.rs** - Add WindowContext & function timing
3. **src/sql/window_context.rs** - Add detailed partition/sort/access timing
4. **src/execution_plan.rs** - Add WindowFunction step type
5. **docs/WINDOW_FUNCTION_OPTIMIZATION.md** - Document findings (new file)
## Success Criteria
After completing profiling and optimization:
- ✅ Window functions show explicit timing in logs
- ✅ `--execution-plan` includes window function steps
- ⏸️ LAG/LEAD performance within 20% of GROUP BY
- ⏸️ Clear documentation of bottlenecks and solutions
- ⏸️ Benchmark improvements validated
## Implementation Notes
### Phase 1: Coarse-Grained Timing (COMPLETED - 2025-11-02)
**Changes Made**:
1. **Added `contains_window_function()` helper** in `src/data/query_engine.rs:405-465`
- Recursively checks if an expression tree contains window functions
- Similar to existing `contains_unnest()` helper
- Handles all expression variants: BinaryOp, Not, FunctionCall, CaseExpression, etc.
2. **Modified `apply_select_items()` signature** in `src/data/query_engine.rs:1589`
- Added `plan: &mut ExecutionPlanBuilder` parameter
- Updated caller at line 1330
3. **Added window function detection and timing** in `src/data/query_engine.rs:1606-1630`
- Detects window functions in select items
- Counts window functions for reporting
- Starts timing if window functions are present
4. **Added timing log and execution plan step** in `src/data/query_engine.rs:1811-1835`
- Logs: "Window function evaluation took X.XXms for N rows (M window functions)"
- Creates execution plan step with:
- StepType::WindowFunction (already existed in execution_plan.rs:33)
- Input/output row counts
- Window function count
- Evaluation time
**Test Results**:
- 10 rows, 1 window function: ~0.28ms
- 10 rows, 3 window functions: ~0.49ms
- 24 rows, 2 window functions (partitioned): ~0.67ms
**Example Output**:
```
└─ WINDOW Evaluate 3 window function(s) (rows: 10) [0.001ms]
• Input: 10 rows
• 3 window functions evaluated
• Evaluation time: 0.490ms
```
**Benefits Achieved**:
- ✅ Window functions now visible in `--execution-plan` output
- ✅ Baseline timing established for future optimization
- ✅ Can confirm window functions are the bottleneck when present
**Next Steps for Phase 2**:
- Add timing in `src/data/arithmetic_evaluator.rs` for WindowContext creation
- Add per-function timing (LAG vs LEAD vs ROW_NUMBER)
- Identify if creation or evaluation is the bottleneck