kore_fileformat 1.1.4

KORE — Killer Optimized Record Exchange: standalone Rust crate (zero deps)
Documentation
# Phase 7: Query Optimization

**Status:** 🚀 In Progress  
**Timeline:** 1-2 weeks  
**Target:** Advanced compression, caching, indexing  

## Overview

Query optimization enables:
- ✅ Adaptive compression selection
- ✅ Cost-based query planning
- ✅ Metadata caching
- ✅ Column/partition pruning
- ✅ Predicate pushdown
- ✅ Index-based lookups

## Optimizations

### 1. Adaptive Compression

Select best codec per column:
- **RLE** → Low cardinality (states, flags)
- **FOR** → Numeric (integers, timestamps)
- **Dictionary** → Text/strings with repeats
- **LZSS** → High entropy data

### 2. Cost-Based Planning

Estimate query cost:
- Rows to read
- Bytes to decompress
- CPU cycles needed
- Choose fastest execution plan

### 3. Metadata Caching

Cache expensive operations:
- Schema inference
- Column statistics
- Partition metadata
- TTL-based invalidation

### 4. Column Pruning

Read only needed columns:
- Query `SELECT name, age` → skip salary
- Reduces I/O by 2-3x

### 5. Predicate Pushdown

Apply filters at read layer:
- `WHERE age > 30` → filter during read
- Reduces rows processed

### 6. Partition Pruning

Skip irrelevant partitions:
- `WHERE date >= '2024-01-01'` → skip old partitions
- Reduces data scanned

### 7. Index-Based Lookups

Fast point queries:
- Hash index for equality
- B-tree for range queries
- Bitmap index for boolean

## Implementation

### Statistics Collection

```rust
struct ColumnStats {
    cardinality: u64,      // Unique values
    null_count: u64,       // NULL rows
    min_value: String,     // Min (nullable)
    max_value: String,     // Max (nullable)
    compression_ratio: f32, // Actual ratio
}
```

### Compression Selection

```rust
fn select_compression(column: &Column) -> CompressionCodec {
    match (column.data_type, column.cardinality) {
        (String, low) => Dictionary,    // < 1000 unique
        (String, high) => Snappy,       // Many unique
        (Integer, _) => FOR,            // Good for numbers
        (Float, _) => LZSS,             // General purpose
        (Boolean, _) => RLE,            // 2 values max
        _ => Snappy,                    // Fallback
    }
}
```

### Query Planning

```rust
fn plan_query(query: &str, stats: &[ColumnStats]) -> QueryPlan {
    // 1. Parse query
    let parsed = parse_sql(query);
    
    // 2. Extract required columns
    let needed_cols = extract_columns(&parsed);
    
    // 3. Extract predicates
    let predicates = extract_predicates(&parsed);
    
    // 4. Estimate costs
    let cost = estimate_cost(&needed_cols, &predicates, stats);
    
    // 5. Choose plan
    QueryPlan { columns: needed_cols, predicates, cost }
}
```

## Performance Targets

| Optimization | Impact | Example |
|--------------|--------|---------|
| Column pruning | 2-3x | Read 2 of 10 columns |
| Predicate pushdown | 2-5x | Filter 1M → 100K rows |
| Partition pruning | 5-10x | Skip 9 of 10 partitions |
| Compression selection | 1.5-2x | Right codec per column |
| Metadata caching | 10x | Cached vs fresh stat |

## Implementation Phases

### Phase 7A: Compression Optimization (Week 1)
- [ ] Adaptive codec selection
- [ ] Per-column compression
- [ ] Compression benchmarking

### Phase 7B: Statistics & Caching (Week 1)
- [ ] Statistics collection
- [ ] Metadata cache layer
- [ ] Cache invalidation

### Phase 7C: Index Management (Week 2)
- [ ] Hash index
- [ ] B-tree index
- [ ] Bitmap index

## Roadmap

- [ ] Statistics collection
- [ ] Compression selection logic
- [ ] Cost estimator
- [ ] Metadata cache
- [ ] Column pruning
- [ ] Predicate pushdown
- [ ] Partition pruning
- [ ] Index structures
- [ ] Query planner
- [ ] Unit tests
- [ ] Benchmarks
- [ ] Documentation

## Testing

```bash
# Unit tests
cargo test --release

# Benchmarks
cargo bench

# Compression ratios
./bench_compression

# Query plans
./bench_query_planning
```

## Expected Results

After Phase 7, queries should be:
- **10-100x faster** with proper optimization
- **50-70% compression** with adaptive codecs
- **Sub-millisecond** metadata lookups (cached)
- **1-10ms** for typical analytical queries

## Contributors

Assigned for Phase 7 development.

---

**Next:** Begin with compression selection and statistics collection.