sql-cli 1.54.0 - Docs.rs

# GROUP BY Performance Analysis

## Executive Summary
GROUP BY operations in sql-cli show a clear single-phase dominance pattern where **~90-95% of time is spent in the group building phase**.

## Performance Breakdown

### Test: 30,000 rows grouped into 10 categories
```
GROUP BY Total Time: 2421ms
├─ Phase 1 - Group Building: 2191ms (90.5%)
├─ Phase 2 - Aggregation: 230ms (9.5%)
└─ Phase 3 - HAVING Filter: 0ms (0%)
```

### Test: 50,000 rows grouped into 100 categories with HAVING filter
```
GROUP BY Total Time: 3822ms
├─ Phase 1 - Group Building: 3596ms (94.1%)
├─ Phase 2 - Aggregation: 212ms (5.5%)
└─ Phase 3 - HAVING Filter: 13ms (0.3%)
```

## Key Findings

### 1. Group Building Dominates (90-95% of time)
This phase includes:
- Evaluating GROUP BY expressions for each row
- Creating hash keys from expression values
- Building hash map of group keys to row indices
- Creating DataView instances for each group

### 2. Aggregation is Fast (5-10% of time)
Once groups are built, computing aggregates is relatively quick:
- COUNT, SUM, AVG, MIN, MAX operations
- Iterating through pre-grouped rows

### 3. HAVING Filtering is Negligible (<1% of time)
Post-aggregation filtering adds minimal overhead.

## Performance Characteristics

| Row Count | Groups | Total Time | Group Building | Aggregation |
|-----------|--------|------------|---------------|-------------|
| 1,000     | 10     | 80ms       | 72ms (90%)    | 8ms         |
| 5,000     | 10     | 375ms      | 340ms (91%)   | 35ms        |
| 10,000    | 10     | 737ms      | 670ms (91%)   | 67ms        |
| 30,000    | 10     | 2,175ms    | 1,960ms (90%) | 215ms       |
| 50,000    | 10     | 3,675ms    | 3,300ms (90%) | 375ms       |

## Optimization Opportunities

Based on the single-phase dominance pattern:

1. **Optimize Expression Evaluation**
   - Cache ArithmeticEvaluator instances
   - Batch evaluate expressions
   - Use specialized fast paths for simple column references

2. **Improve Hash Key Building**
   - Pre-allocate hash map with estimated cardinality
   - Use faster hashing for simple types
   - Avoid cloning values when building keys

3. **Reduce DataView Creation Overhead**
   - Lazy DataView creation (only when needed)
   - Share immutable structures between groups
   - Use more efficient row index storage

## Conclusion

The GROUP BY operation is clearly dominated by a single phase - group building. This accounts for 90-95% of the execution time, making it the primary target for optimization. The actual aggregation computation is already quite efficient at only 5-10% of total time.