sql-cli 1.73.1 - Docs.rs

# GROUP BY Architecture Design

## Overview
Implement GROUP BY using a recursive descent approach with DataView partitions, where each grouping level creates a map of child views.

## Core Concepts

### 1. Single Column Grouping per Level
- Each DataView can handle ONE GROUP BY column at a time
- Creates a HashMap<GroupValue, DataView> for that level
- Multiple columns handled through recursive descent

### 2. Master View Pattern
```rust
struct GroupedView {
    source: DataView,
    group_column: String,
    partitions: HashMap<DataValue, DataView>,
    aggregates: Vec<AggregateExpr>,
}
```

### 3. Recursive Descent for Multiple Columns
For `GROUP BY country, city, department`:
```
MasterView
├── country="USA" → View
│   ├── city="NYC" → View
│   │   ├── department="Sales" → View (leaf - apply aggregates)
│   │   └── department="IT" → View (leaf - apply aggregates)
│   └── city="LA" → View
│       └── department="Sales" → View (leaf - apply aggregates)
└── country="UK" → View
    └── city="London" → View
        └── department="IT" → View (leaf - apply aggregates)
```

## Implementation Steps

### Phase 1: Additional Aggregate Functions
- [ ] GREATEST(col1, col2, ...) - Maximum value across columns
- [ ] LEAST(col1, col2, ...) - Minimum value across columns  
- [ ] GROUP_CONCAT(col) - Concatenate values
- [ ] STDDEV(col) - Standard deviation
- [ ] VARIANCE(col) - Variance

### Phase 2: Single Column GROUP BY
- [ ] Create PartitionManager that takes DataView + group column
- [ ] Build HashMap<DataValue, Vec<usize>> for row indices
- [ ] Create child DataView for each unique value
- [ ] Apply aggregates to each partition

### Phase 3: Multi-Column GROUP BY
- [ ] Recursive descent through group columns
- [ ] Each level creates its own partition map
- [ ] Leaf nodes apply aggregate calculations
- [ ] Bubble results back up through the tree

### Phase 4: HAVING Clause
- [ ] Filter aggregated results
- [ ] Apply after GROUP BY completes
- [ ] Reuse WHERE evaluator with aggregate results

## Example Query Flow

```sql
SELECT country, city, COUNT(*) as cnt, AVG(salary) as avg_sal
FROM employees
WHERE active = true
GROUP BY country, city
HAVING COUNT(*) > 10
ORDER BY avg_sal DESC
```

1. Apply WHERE → filtered DataView
2. Partition by country → Map<country, DataView>
3. For each country, partition by city → Map<city, DataView>
4. Apply aggregates to leaf views (COUNT, AVG)
5. Collect results into new DataTable
6. Apply HAVING on aggregate results
7. Apply ORDER BY on final results

## Key Design Decisions

### Why Single Column per View?
- Simpler implementation
- Clear ownership of partitions
- Natural recursive structure
- Easier to debug and optimize

### Why DataView-based?
- Reuses existing infrastructure
- Virtual views are lightweight
- Already have filtering/sorting
- Natural for window functions later

### Memory Considerations
- Views share underlying DataTable
- Only store row indices, not duplicate data
- Lazy evaluation where possible
- Can spill large groups to disk later

## Testing Strategy

### Start Simple
1. Single column GROUP BY with COUNT(*)
2. Add SUM, AVG, MIN, MAX
3. Multi-column GROUP BY
4. Complex aggregates (GREATEST, LEAST)
5. HAVING clause
6. Performance testing with large datasets

### Test Data
- Use trades data (GROUP BY symbol, exchange)
- Geographic data (GROUP BY country, city)
- Time series (GROUP BY date functions)

## Future Extensions

### Window Functions (Later)
- Can use similar partitioning approach
- OVER (PARTITION BY x ORDER BY y)
- Running totals, rank, row_number

### Optimization Opportunities
- Hash-based aggregation for large groups
- Sort-based aggregation for ordered data
- Parallel aggregation for independent groups
- Incremental aggregation for streaming

## Implementation Notes

### Tomorrow's Priority:
1. Start with GREATEST/LEAST functions (simpler, builds on existing)
2. Implement single-column GROUP BY with COUNT(*) 
3. Test with trades data grouping by symbol
4. Add remaining aggregate functions
5. Move to multi-column if time permits

### Code Location:
- `src/data/partition_manager.rs` - New partitioning logic
- `src/data/grouped_view.rs` - GROUP BY view implementation  
- `src/sql/aggregates/` - Additional aggregate functions
- `src/data/query_engine.rs` - Wire up GROUP BY execution