featherdb-query 1.0.0

SQL parser, planner, optimizer, and executor for FeatherDB
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
# FeatherDB Query Engine

The query engine transforms SQL strings into executable query plans and produces results. It features a cost-based optimizer with index selection, join ordering, and prepared statement support with plan caching.

## Architecture Overview

```
                              SQL String
                                  |
                                  v
+-------------------------------------------------------------+
|  Parser (parser.rs)                                          |
|  - Uses sqlparser-rs for SQL parsing                         |
|  - Rich error messages with keyword suggestions              |
|  - Data type conversion to FeatherDB types                   |
+-------------------------------------------------------------+
                                  |
                                  v
+-------------------------------------------------------------+
|  Planner (planner.rs)                                        |
|  - AST to LogicalPlan conversion                             |
|  - Table/column resolution with suggestions                  |
|  - Parameter placeholder support ($1, $2, ?1, ?2)            |
+-------------------------------------------------------------+
                                  |
                                  v
+-------------------------------------------------------------+
|  Optimizer (optimizer.rs)                                    |
|  - Predicate pushdown                                        |
|  - Projection pushdown                                       |
|  - Constant folding                                          |
|  - Index selection (100-1000x speedup)                       |
|  - Join order optimization (greedy algorithm)                |
|  - Cost estimation with table statistics                     |
+-------------------------------------------------------------+
                                  |
                                  v
+-------------------------------------------------------------+
|  Executor (executor.rs)                                      |
|  - Volcano model (pull-based iteration)                      |
|  - Hash join for equi-joins                                  |
|  - Index scan execution                                      |
|  - MVCC visibility checking                                  |
|  - EXPLAIN with cost annotations                             |
+-------------------------------------------------------------+
                                  |
                                  v
                           Query Results
```

## Components

### 1. Parser (`parser.rs`)

Wraps `sqlparser-rs` to parse SQL into AST with enhanced error messages.

**Features:**
- Parse SQL strings into AST using `GenericDialect`
- Rich error context with line/column positions
- Keyword typo detection and suggestions (e.g., "SELEKT" suggests "SELECT")
- Data type conversion to FeatherDB's `ColumnType`

**Supported SQL:**
```sql
-- DDL
CREATE TABLE name (col TYPE [constraints], ...)
DROP TABLE [IF EXISTS] name
CREATE INDEX name ON table (col)

-- DML
SELECT [DISTINCT] cols FROM table [AS alias]
       [JOIN table ON condition]
       [WHERE condition]
       [GROUP BY cols] [HAVING condition]
       [ORDER BY cols [ASC|DESC]]
       [LIMIT n [OFFSET m]]

-- CTEs (Common Table Expressions)
WITH cte_name [(col1, col2, ...)] AS (
  SELECT ...
)
SELECT * FROM cte_name;

-- Subqueries
SELECT * FROM table WHERE col IN (SELECT ...);
SELECT * FROM table WHERE EXISTS (SELECT ...);
SELECT * FROM (SELECT ...) AS derived_table;

INSERT INTO table [(cols)] VALUES (...)
UPDATE table SET col = expr [WHERE condition]
DELETE FROM table [WHERE condition]

-- Query analysis
EXPLAIN [VERBOSE] [ANALYZE] query
```

**Data Types:**
| SQL Type | FeatherDB Type |
|----------|----------------|
| `BOOLEAN` | `Boolean` |
| `SMALLINT`, `INT`, `INTEGER`, `BIGINT`, `TINYINT` | `Integer` |
| `FLOAT`, `REAL`, `DOUBLE`, `DECIMAL`, `NUMERIC` | `Real` |
| `CHAR(n)`, `VARCHAR(n)`, `TEXT` | `Text { max_len }` |
| `BINARY`, `VARBINARY`, `BLOB`, `BYTEA` | `Blob { max_len }` |
| `TIMESTAMP`, `DATETIME`, `DATE` | `Timestamp` |

### 2. Planner (`planner.rs`)

Converts SQL AST to logical query plans with comprehensive error handling.

**Logical Plan Types:**
```rust
enum LogicalPlan {
    // Data access
    Scan { table, alias, projection, filter },
    IndexScan { table, alias, index, range, residual_filter, projection },

    // Subqueries and CTEs
    SubqueryScan { input, alias, schema },
    CTE { name, input, column_aliases },
    ScalarSubquery { input },
    Exists { input },
    InSubquery { expr, subquery },

    // Special join types
    SemiJoin { left, right, condition },
    AntiJoin { left, right, condition },

    // Relational operators
    Filter { input, predicate },
    Project { input, exprs },
    Join { left, right, join_type, condition },
    Aggregate { input, group_by, aggregates },
    Sort { input, order_by },
    Limit { input, limit, offset },
    Distinct { input },

    // DML
    Insert { table, columns, values },
    Update { table, assignments, filter },
    Delete { table, filter },

    // DDL
    CreateTable { table },
    DropTable { name, if_exists },

    // Utility
    EmptyRelation,
    Explain { input, verbose, analyze },
}
```

**Join Types:**
- `Inner` - Standard inner join
- `Left` - Left outer join
- `Right` - Right outer join
- `Full` - Full outer join

**Index Range Types:**
- `point(value)` - Equality lookup (fastest)
- `between(start, end)` - Range with both bounds
- `from_inclusive/exclusive(value)` - Range >= or >
- `to_inclusive/exclusive(value)` - Range <= or <
- `full()` - Full index scan

**Error Handling:**
- Table not found with suggestions for typos
- Column not found with similar name suggestions
- Ambiguous column references in joins
- Type validation for constraints

### 3. Optimizer (`optimizer.rs`)

Applies transformation rules and cost-based optimization.

**Optimization Rules:**

| Rule | Description | Benefit |
|------|-------------|---------|
| **Predicate Pushdown** | Push filters closer to table scans | Reduces rows early |
| **Projection Pushdown** | Only read needed columns | Reduces I/O |
| **Constant Folding** | Evaluate constant expressions at plan time | Avoids runtime computation |
| **Index Selection** | Convert Scan+Filter to IndexScan | 100-1000x speedup |
| **Join Order Optimization** | Reorder joins by estimated cardinality | 10-100x speedup |

**Index Selection:**

The optimizer automatically selects indexes for:
- Equality predicates: `WHERE id = 5` -> point lookup
- Range predicates: `WHERE age > 18 AND age < 65` -> range scan
- Combined predicates: indexed predicate + residual filter

```sql
-- Example: query with index
SELECT * FROM users WHERE id = 5 AND name = 'Alice';
-- Optimizer creates: IndexScan(id=5) + residual_filter(name='Alice')
```

**Cost Estimation:**

The `CostEstimator` provides:
- Cardinality estimation for each plan node
- Cost estimation based on:
  - Sequential scan: 1.0 per row
  - Index scan: 1.5 per row (but fewer rows)
  - Hash join: build cost + probe cost
  - Sort: n log n amortized

**Cost Constants:**
```rust
SEQ_SCAN_COST_PER_ROW: 1.0
INDEX_SCAN_COST_PER_ROW: 1.5
HASH_JOIN_COST_PER_ROW: 0.1
HASH_BUILD_COST_PER_ROW: 1.0
SORT_COST_PER_ROW: 2.0
DEFAULT_SELECTIVITY: 0.1
DEFAULT_RANGE_SELECTIVITY: 0.3
DEFAULT_JOIN_SELECTIVITY: 0.1
```

**Join Order Optimization:**

For multi-way joins (3+ tables), uses greedy algorithm:
1. Extract all base relations and predicates
2. Estimate cardinalities using catalog statistics
3. Iteratively join the pair with smallest intermediate result
4. Place smaller table on build side of hash join

### 4. Executor (`executor.rs`)

Executes physical operators using the Volcano (iterator) model.

**Physical Operators:**

| Operator | Description |
|----------|-------------|
| `SeqScan` | Sequential table scan with optional pushed filter |
| `IndexScan` | Index-based lookup (point or range) |
| `SubqueryScan` | Execute subquery and expose as table |
| `SemiJoin` | Semi-join for EXISTS and IN subqueries |
| `AntiJoin` | Anti-join for NOT EXISTS and NOT IN subqueries |
| `Filter` | Evaluate predicate on each row |
| `Project` | Compute output expressions, expand wildcards |
| `HashJoin` | Hash-based equi-join (preferred for equality) |
| `NestedLoopJoin` | Fallback for non-equi joins |
| `HashAggregate` | Group-by aggregation with hash table |
| `Sort` | In-memory quicksort |
| `Limit` | Limit/offset row output |
| `Distinct` | Remove duplicate rows via hash set |

**Execution Context:**
```rust
struct ExecutionContext<'a> {
    catalog: &'a Catalog,
    buffer_pool: &'a Arc<BufferPool>,
    snapshot: &'a Snapshot,      // MVCC visibility
    txn_manager: &'a TransactionManager,
}
```

**Query Results:**
```rust
enum QueryResult {
    Rows(Vec<Row>),      // SELECT results
    Affected(usize),      // INSERT/UPDATE/DELETE count
    SchemaChanged,        // CREATE/DROP
}
```

**EXPLAIN Support:**

```sql
-- Show query plan with cost estimates
EXPLAIN SELECT * FROM users WHERE id = 5;

-- Show detailed costs
EXPLAIN VERBOSE SELECT * FROM users WHERE id = 5;

-- Actually execute and show timing
EXPLAIN ANALYZE SELECT * FROM users WHERE id = 5;
```

Output includes:
- Plan tree structure
- Cost estimates per node
- Row estimates per node
- For ANALYZE: actual execution time and row counts

### 5. Expressions (`expr.rs`)

Expression representation and evaluation.

**Expression Types:**
```rust
enum Expr {
    Column { table, name, index },
    Literal(Value),
    Parameter { index },           // For prepared statements
    BinaryOp { left, op, right },
    UnaryOp { op, expr },
    Function { name, args },
    Aggregate { func, arg, distinct },
    Case { operand, when_clauses, else_result },
    Between { expr, low, high, negated },
    InList { expr, list, negated },
    Like { expr, pattern, negated },
    Wildcard,                      // SELECT *
    QualifiedWildcard(table),      // SELECT table.*
}
```

**Binary Operators:**
- Arithmetic: `+`, `-`, `*`, `/`, `%`
- Comparison: `=`, `!=`, `<`, `<=`, `>`, `>=`
- Logical: `AND`, `OR`
- String: `||` (concatenation)

**Unary Operators:**
- `NOT` - Logical negation
- `-` - Numeric negation
- `IS NULL` / `IS NOT NULL`

**Aggregate Functions:**
- `COUNT(*)`, `COUNT(col)`, `COUNT(DISTINCT col)`
- `SUM(col)`, `AVG(col)`
- `MIN(col)`, `MAX(col)`

**Built-in Functions:**

*String Functions:*
- `UPPER(text)`, `LOWER(text)` - Case conversion
- `LENGTH(text)`, `LENGTH(blob)` - Get length
- `SUBSTR(text, start [, length])` - Extract substring (1-based indexing, SQL standard)
- `TRIM(text)` - Remove leading and trailing whitespace
- `LTRIM(text)` - Remove leading whitespace
- `RTRIM(text)` - Remove trailing whitespace
- `REPLACE(text, from, to)` - Replace all occurrences

*Date/Time Functions:*
- `DATE(timestamp)` - Extract date portion ("YYYY-MM-DD")
- `YEAR(timestamp)` - Extract year as integer
- `MONTH(timestamp)` - Extract month as integer (1-12)
- `DAY(timestamp)` - Extract day as integer (1-31)
- `NOW()` - Get current timestamp

*Numeric Functions:*
- `ABS(numeric)` - Absolute value

*Utility Functions:*
- `COALESCE(val1, val2, ...)` - First non-NULL value
- `NULLIF(val1, val2)` - Return NULL if equal

**NULL Handling:**
- Most operations propagate NULL
- `AND`: `FALSE AND NULL` = `FALSE`
- `OR`: `TRUE OR NULL` = `TRUE`
- Division by zero returns NULL

### 6. Prepared Statements (`prepared.rs`)

Plan caching for repeated queries (5-10x speedup).

**PreparedStatement:**
```rust
struct PreparedStatement {
    sql: String,
    plan: LogicalPlan,        // Cached optimized plan
    param_types: Vec<Option<ColumnType>>,
    param_count: usize,
    id: u64,
}
```

**Parameter Placeholders:**
- PostgreSQL style: `$1`, `$2`, `$3`, ...
- Question mark with index: `?1`, `?2`, `?3`, ...

**PlanCache:**
- LRU cache with configurable max size (default: 1000)
- Automatic eviction of least recently used plans
- Schema-aware invalidation (on DDL)
- Table-specific invalidation

**Cache Statistics:**
```rust
struct CacheStats {
    hits: u64,          // Cache hits
    misses: u64,        // Cache misses
    evictions: u64,     // LRU evictions
    invalidations: u64, // Schema invalidations
}
```

**Usage Example:**
```rust
use featherdb_query::{PlanCache, PreparedStatement, substitute_plan_params};

// Create cache
let cache = PlanCache::new(1000);

// Prepare statement (parse + plan + optimize once)
let stmt = db.prepare("SELECT * FROM users WHERE id = $1")?;

// Execute multiple times with different parameters
let params = vec![Value::Integer(1)];
let plan = substitute_plan_params(stmt.plan(), &params)?;
let result = Executor::execute(&ctx, &plan)?;

// Cache statistics
println!("Hit ratio: {:.2}%", cache.stats().hit_ratio() * 100.0);
```

## Query Execution Flow

```
1. Parse:    "SELECT name FROM users WHERE id = $1"
                              |
2. Plan:     Project(name)
                 |__ Filter(id = $1)
                       |__ Scan(users)
                              |
3. Optimize: IndexScan(users, idx_id, point($1), project=[name])
                              |
4. Substitute: IndexScan(users, idx_id, point(42), project=[name])
                              |
5. Execute:  IndexScan iterator -> MVCC visibility check -> project
                              |
6. Results:  Row { name: "Alice" }
```

## Usage Examples

### Basic Query Execution

```rust
use featherdb_query::{Parser, Planner, Optimizer, Executor, ExecutionContext};

// Parse SQL
let stmt = Parser::parse_one("SELECT * FROM users WHERE age > 21")?;

// Create logical plan
let planner = Planner::new(&catalog);
let plan = planner.plan(&stmt)?;

// Optimize (with index selection and cost-based optimization)
let optimizer = Optimizer::with_catalog(catalog.clone());
let optimized = optimizer.optimize(plan);

// Execute
let ctx = ExecutionContext {
    catalog: &catalog,
    buffer_pool: &buffer_pool,
    snapshot: &snapshot,
    txn_manager: &txn_manager,
};

match Executor::execute(&ctx, &optimized)? {
    QueryResult::Rows(rows) => {
        for row in rows {
            println!("{}: {:?}", row.columns[0], row.values[0]);
        }
    }
    QueryResult::Affected(n) => println!("{} rows affected", n),
    QueryResult::SchemaChanged => println!("Schema updated"),
}
```

### Using Prepared Statements

```rust
use featherdb_query::{PlanCache, Parser, Planner, Optimizer};
use featherdb_query::{PreparedStatement, substitute_plan_params, count_plan_params};

// Create plan cache
let cache = PlanCache::new(100);

// Parse and plan with parameters
let sql = "SELECT * FROM users WHERE id = $1 AND status = $2";
let stmt = Parser::parse_one(sql)?;
let plan = planner.plan(&stmt)?;
let optimized = optimizer.optimize(plan);

// Create prepared statement
let param_count = count_plan_params(&optimized);
let prepared = PreparedStatement::new(
    sql.to_string(),
    optimized,
    vec![None; param_count],
    param_count,
    cache.next_id(),
);

// Cache it
cache.insert(sql.to_string(), prepared.clone());

// Execute with parameters
let params = vec![Value::Integer(42), Value::Text("active".into())];
prepared.validate_params(&params)?;
let bound_plan = substitute_plan_params(prepared.plan(), &params)?;
let result = Executor::execute(&ctx, &bound_plan)?;
```

### Using EXPLAIN

```rust
// Show query plan
let stmt = Parser::parse_one("EXPLAIN SELECT * FROM users WHERE id = 5")?;
let plan = planner.plan(&stmt)?;
let result = Executor::execute(&ctx, &plan)?;

// Output:
// QUERY PLAN
// ============================================================
// -> Index Lookup using idx_users_id on users
//
// Estimated total cost: 1.50
// Estimated rows: 1

// With verbose costs
let stmt = Parser::parse_one("EXPLAIN VERBOSE SELECT * FROM users")?;
// Shows cost and row estimates for each node

// With actual execution stats
let stmt = Parser::parse_one("EXPLAIN ANALYZE SELECT * FROM users")?;
// Shows actual execution time and row counts
```

## Configuration Options

### Optimizer Configuration

```rust
// Without catalog (no index selection or join ordering)
let optimizer = Optimizer::new();

// With catalog (enables all optimizations)
let optimizer = Optimizer::with_catalog(catalog.clone());
```

### Plan Cache Configuration

```rust
// Custom cache size
let cache = PlanCache::new(500);

// Default size (1000 entries)
let cache = PlanCache::default_size();

// Invalidate on schema changes
cache.invalidate_all();

// Invalidate specific table
cache.invalidate_table("users");
```

## Testing

```bash
# Run all query engine tests (using Make)
make test-crate CRATE=featherdb-query

# Or with cargo directly
cargo test -p featherdb-query

# Test specific component
cargo test -p featherdb-query parser
cargo test -p featherdb-query planner
cargo test -p featherdb-query optimizer
cargo test -p featherdb-query executor
cargo test -p featherdb-query prepared
cargo test -p featherdb-query expr

# Run with coverage
make coverage  # or: cargo llvm-cov --workspace
```

## Performance Considerations

1. **Index Selection**: Ensure indexes exist on frequently filtered columns
2. **Predicate Pushdown**: Filters are automatically pushed to scans
3. **Projection Pruning**: Select only needed columns
4. **Join Order**: Optimizer reorders joins for multi-table queries
5. **Prepared Statements**: Use for repeated queries (5-10x speedup)
6. **Statistics**: Keep table statistics updated for accurate cost estimates

## Implemented Features

**Core Query Processing:**
- [x] SQL parsing with rich error messages
- [x] Logical plan generation
- [x] Predicate pushdown optimization
- [x] Projection pushdown optimization
- [x] Constant folding
- [x] Index selection (point and range)
- [x] Join order optimization
- [x] Cost-based estimation with statistics
- [x] Prepared statements with plan caching
- [x] EXPLAIN with cost annotations
- [x] Hash join for equi-joins
- [x] MVCC-aware execution

**Advanced SQL:**
- [x] Common Table Expressions (WITH clause, column aliasing, chained CTEs)
- [x] Subquery support (scalar, IN, NOT IN, EXISTS, NOT EXISTS, correlated)
- [x] Semi-join and anti-join optimization
- [x] Derived tables (subquery in FROM clause)
- [x] UNION / UNION ALL / INTERSECT / EXCEPT set operations
- [x] CREATE VIEW / DROP VIEW
- [x] CAST, EXTRACT, date interval arithmetic

**Functions:**
- [x] String: UPPER, LOWER, LENGTH, SUBSTR, TRIM, LTRIM, RTRIM, REPLACE
- [x] Date/time: DATE, YEAR, MONTH, DAY, NOW
- [x] Math: ABS, SQRT, POWER, FLOOR, CEIL, LOG, MOD
- [x] Aggregate: COUNT, SUM, AVG, MIN, MAX
- [x] Window: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD
- [x] Utility: COALESCE, NULLIF

**Constraints:**
- [x] CHECK constraints (parsing and enforcement)
- [x] Foreign key constraints (CASCADE, RESTRICT, SET NULL, NO ACTION)
- [x] UNIQUE constraints
- [x] NOT NULL constraints

## Future Improvements

- [ ] Parallel query execution
- [ ] Lateral joins
- [ ] Recursive CTEs
- [ ] External sort for large datasets
- [ ] Merge join algorithm
- [ ] Bitmap index scans
- [ ] Adaptive hash aggregate with spill-to-disk