# GROUP BY Implementation Design
## Current State Analysis
### AST Representation
The parser already captures GROUP BY in the AST:
```rust
pub struct SelectStatement {
pub group_by: Option<Vec<String>>, // GROUP BY columns
pub order_by: Option<Vec<OrderByColumn>>,
// ... other fields
}
```
**Issue Found**: Parser currently parses ORDER BY before GROUP BY, but SQL standard requires:
```sql
SELECT ... FROM ... WHERE ... GROUP BY ... ORDER BY ... LIMIT ...
```
### Current Gaps
1. No GROUP BY execution in query_executor.rs or query_engine.rs
2. Parser order needs fixing (GROUP BY should be parsed before ORDER BY)
3. No aggregation support when GROUP BY is present
## Proposed DataView-Based Implementation
### Core Concept
Use DataView's ability to create filtered views as the foundation for GROUP BY:
```rust
impl DataView {
/// Create a map of group keys to DataViews for each group
pub fn group_by(&self, group_columns: &[String]) -> Result<HashMap<Vec<DataValue>, DataView>> {
// 1. Get column indices for group columns
// 2. Build unique combinations of group values
// 3. For each unique combination, create a filtered DataView
// 4. Return map of group_key -> DataView
}
}
```
### Implementation Strategy
#### Phase 1: Basic GROUP BY Support
1. **Fix parser order** - GROUP BY before ORDER BY
2. **Implement DataView::group_by()** method
3. **Handle simple GROUP BY without aggregates**
#### Phase 2: Aggregate Functions with GROUP BY
1. **Detect aggregate functions** in SELECT when GROUP BY is present
2. **Apply aggregates** to each group's DataView
3. **Build result table** from aggregated values
#### Phase 3: Multi-level GROUP BY
For queries like `GROUP BY trader, book`:
```rust
// Recursive approach for multi-level grouping
pub fn group_by_recursive(&self, columns: &[String]) -> GroupedView {
if columns.is_empty() {
return GroupedView::Leaf(self.clone());
}
let first_col = &columns[0];
let remaining = &columns[1..];
let mut groups = HashMap::new();
for unique_value in self.unique_values(first_col) {
let filtered = self.filter_by_value(first_col, unique_value);
groups.insert(
unique_value,
filtered.group_by_recursive(remaining)
);
}
GroupedView::Node(groups)
}
```
### Execution Flow
```mermaid
graph TD
A[Parse Query] --> B{Has GROUP BY?}
B -->|No| C[Normal Query Execution]
B -->|Yes| D[Create Grouped Views]
D --> E{Has Aggregates?}
E -->|No| F[Select Distinct Group Keys]
E -->|Yes| G[Apply Aggregates to Each Group]
G --> H[Build Result Table]
F --> H
H --> I{Has ORDER BY?}
I -->|Yes| J[Sort Results]
I -->|No| K[Return Results]
J --> K
```
## Implementation Details
### 1. DataView Group By Method
```rust
impl DataView {
pub fn group_by(&self, group_columns: &[String]) -> Result<BTreeMap<GroupKey, DataView>> {
let mut groups = BTreeMap::new();
let source = self.source();
// Get column indices
let col_indices: Vec<usize> = group_columns
.iter()
.map(|col| source.get_column_index(col))
.collect::<Option<Vec<_>>>()
.ok_or_else(|| anyhow!("Invalid group by column"))?;
// Iterate through visible rows
for row_idx in self.get_filtered_rows() {
// Build group key from row values
let mut key = Vec::new();
for &col_idx in &col_indices {
key.push(source.get_value(row_idx, col_idx).clone());
}
// Add row to appropriate group
groups.entry(GroupKey(key))
.or_insert_with(|| DataView::new(source.clone()))
.add_row_to_filter(row_idx);
}
Ok(groups)
}
}
```
### 2. Query Executor Integration
```rust
// In query_executor.rs
pub fn execute_select(&mut self, stmt: &SelectStatement) -> Result<DataTable> {
// ... existing WHERE clause handling ...
if let Some(group_by_cols) = &stmt.group_by {
// Create grouped views
let groups = self.data_view.group_by(group_by_cols)?;
// Check for aggregate functions
let has_aggregates = stmt.select_items.iter().any(|item| {
matches!(item, SelectItem::Expression { expr, .. }
if is_aggregate_function(expr))
});
if has_aggregates {
// Process aggregates for each group
return self.execute_grouped_aggregates(groups, stmt);
} else {
// Return distinct group keys
return self.execute_grouped_select(groups, stmt);
}
}
// ... existing non-grouped execution ...
}
```
### 3. Aggregate Processing
```rust
fn execute_grouped_aggregates(
&self,
groups: BTreeMap<GroupKey, DataView>,
stmt: &SelectStatement
) -> Result<DataTable> {
let mut result = DataTable::new("result");
// Add columns for group keys and aggregates
for col in &stmt.columns {
result.add_column(col);
}
// Process each group
for (group_key, group_view) in groups {
let mut row = Vec::new();
for item in &stmt.select_items {
match item {
SelectItem::Column(col) if stmt.group_by.contains(col) => {
// Add group key value
let idx = stmt.group_by.iter().position(|c| c == col).unwrap();
row.push(group_key.0[idx].clone());
}
SelectItem::Expression { expr, .. } => {
// Evaluate aggregate on group
let value = self.evaluate_aggregate(expr, &group_view)?;
row.push(value);
}
_ => return Err(anyhow!("Non-aggregate column in GROUP BY"))
}
}
result.add_row(row)?;
}
Ok(result)
}
```
## Testing Strategy
### Test Cases
1. **Simple GROUP BY**
```sql
SELECT trader FROM trades GROUP BY trader
```
2. **GROUP BY with COUNT**
```sql
SELECT trader, COUNT(*) as total FROM trades GROUP BY trader
```
3. **Multi-column GROUP BY**
```sql
SELECT trader, book, SUM(quantity) FROM trades GROUP BY trader, book
```
4. **GROUP BY with ORDER BY**
```sql
SELECT trader, SUM(quantity) as total
FROM trades
GROUP BY trader
ORDER BY total DESC
```
5. **GROUP BY with WHERE**
```sql
SELECT trader, COUNT(*)
FROM trades
WHERE price > 50
GROUP BY trader
```
## Performance Considerations
1. **Memory Usage**: Each group creates a new DataView (lightweight - just indices)
2. **Optimization**: Use BTreeMap for sorted group keys
3. **Lazy Evaluation**: Don't materialize groups until needed
4. **Index Reuse**: Share source DataTable across all group views
## Next Steps
1. โ
Analyze current AST structure
2. ๐ Document implementation plan (this document)
3. ๐ง Fix parser to handle GROUP BY before ORDER BY
4. ๐๏ธ Implement DataView::group_by() method
5. ๐ Integrate into query executor
6. ๐งช Add comprehensive tests
7. ๐ Handle aggregate functions with GROUP BY
8. ๐ฏ Optimize performance for large datasets