# GROUP BY Performance Analysis
## Executive Summary
GROUP BY operations in sql-cli show a clear single-phase dominance pattern where **~90-95% of time is spent in the group building phase**.
## Performance Breakdown
### Test: 30,000 rows grouped into 10 categories
```
GROUP BY Total Time: 2421ms
├─ Phase 1 - Group Building: 2191ms (90.5%)
├─ Phase 2 - Aggregation: 230ms (9.5%)
└─ Phase 3 - HAVING Filter: 0ms (0%)
```
### Test: 50,000 rows grouped into 100 categories with HAVING filter
```
GROUP BY Total Time: 3822ms
├─ Phase 1 - Group Building: 3596ms (94.1%)
├─ Phase 2 - Aggregation: 212ms (5.5%)
└─ Phase 3 - HAVING Filter: 13ms (0.3%)
```
## Key Findings
### 1. Group Building Dominates (90-95% of time)
This phase includes:
- Evaluating GROUP BY expressions for each row
- Creating hash keys from expression values
- Building hash map of group keys to row indices
- Creating DataView instances for each group
### 2. Aggregation is Fast (5-10% of time)
Once groups are built, computing aggregates is relatively quick:
- COUNT, SUM, AVG, MIN, MAX operations
- Iterating through pre-grouped rows
### 3. HAVING Filtering is Negligible (<1% of time)
Post-aggregation filtering adds minimal overhead.
## Performance Characteristics
| 1,000 | 10 | 80ms | 72ms (90%) | 8ms |
| 5,000 | 10 | 375ms | 340ms (91%) | 35ms |
| 10,000 | 10 | 737ms | 670ms (91%) | 67ms |
| 30,000 | 10 | 2,175ms | 1,960ms (90%) | 215ms |
| 50,000 | 10 | 3,675ms | 3,300ms (90%) | 375ms |
## Optimization Opportunities
Based on the single-phase dominance pattern:
1. **Optimize Expression Evaluation**
- Cache ArithmeticEvaluator instances
- Batch evaluate expressions
- Use specialized fast paths for simple column references
2. **Improve Hash Key Building**
- Pre-allocate hash map with estimated cardinality
- Use faster hashing for simple types
- Avoid cloning values when building keys
3. **Reduce DataView Creation Overhead**
- Lazy DataView creation (only when needed)
- Share immutable structures between groups
- Use more efficient row index storage
## Conclusion
The GROUP BY operation is clearly dominated by a single phase - group building. This accounts for 90-95% of the execution time, making it the primary target for optimization. The actual aggregation computation is already quite efficient at only 5-10% of total time.