sql-stream 0.1.0

A production-grade CLI tool for executing SQL queries against CSV/JSON files using streaming architecture
Documentation
# Performance Profiling Guide


This guide covers how to profile and optimize the performance of SQL Stream.

## Prerequisites


Install the flamegraph tool:

```bash
cargo install flamegraph
```

On Linux, you may need to install perf:

```bash
# Ubuntu/Debian

sudo apt-get install linux-tools-common linux-tools-generic

# Fedora/RHEL

sudo dnf install perf
```

On Windows, use the built-in profiling tools or WSL2.

## Generating Flamegraphs


### Basic Flamegraph


```bash
# On Linux (requires sudo for perf access)

sudo cargo flamegraph -- -f large_dataset.csv -q "SELECT * FROM data"

# Output: flamegraph.svg

```

### Profiling Specific Queries


```bash
# Profile aggregation query

sudo cargo flamegraph -- -f data.csv -q "SELECT category, COUNT(*), AVG(price) FROM data GROUP BY category"

# Profile sorting

sudo cargo flamegraph -- -f data.csv -q "SELECT * FROM data ORDER BY timestamp DESC LIMIT 1000"

# Profile joins (requires registering multiple tables in code)

```

### Interpreting Results


The flamegraph shows:
- **Width**: Time spent in each function
- **Height**: Call stack depth
- **Color**: Different modules/crates

Look for:
1. **Wide bars at the top**: Hot paths in your code
2. **DataFusion internals**: Normal to see significant time here
3. **I/O operations**: File reading bottlenecks
4. **Serialization**: Arrow formatting overhead

## Benchmarking


### Create Test Datasets


Generate large CSV files for realistic testing:

```bash
# Python script to generate test data

python3 << 'EOF'
import csv
import random

with open('large_test.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['id', 'name', 'value', 'category', 'timestamp'])
    
    for i in range(1000000):  # 1 million rows
        writer.writerow([
            i,
            f'Item_{i}',
            random.randint(1, 10000),
            f'Category_{random.randint(1, 100)}',
            f'2024-{random.randint(1,12):02d}-{random.randint(1,28):02d}'
        ])
EOF
```

### Timing Queries


Use the built-in time command:

```bash
# Linux/macOS

time cargo run --release -- -f large_test.csv -q "SELECT category, COUNT(*), AVG(value) FROM data GROUP BY category"

# PowerShell (Windows)

Measure-Command { cargo run --release -- -f large_test.csv -q "SELECT category, COUNT(*), AVG(value) FROM data GROUP BY category" }
```

## Performance Tips


### 1. File Format Selection


- **CSV**: Better for wide tables with many columns
- **JSON**: Better for nested or semi-structured data
- **Parquet** (future): Best for analytical workloads (consider adding support)

### 2. Query Optimization


```sql
-- Good: Filter early
SELECT name FROM data WHERE category = 'A' ORDER BY value LIMIT 10

-- Bad: Filter late  
SELECT * FROM data ORDER BY value LIMIT 1000 -- then filter in app
```

### 3. Memory Considerations


DataFusion streams results, but aggregations require materialization:

```sql
-- Low memory: Streaming
SELECT * FROM data WHERE value > 100

-- Higher memory: Full aggregation
SELECT category, SUM(value) FROM data GROUP BY category
```

### 4. Build Configuration


Always use release mode for production:

```toml
[profile.release]
opt-level = 3
lto = true
codegen-units = 1
```

## Bottleneck Checklist


If queries are slow, check:

1. **File I/O**: Is disk reading the bottleneck?
   - Use SSD instead of HDD
   - Consider file compression
   
2. **CPU**: Is computation the bottleneck?
   - Check multi-core utilization
   - Optimize SQL query
   
3. **Memory**: Is swapping occurring?
   - Reduce batch size in DataFusion config
   - Use LIMIT clauses for exploratory queries

4. **Network**: For remote files
   - Consider local caching
   - Use faster network connection

## Continuous Performance Testing


Add benchmark tests:

```rust
// benches/benchmark.rs (requires criterion crate)
use criterion::{black_box, criterion_group, criterion_main, Criterion};

fn benchmark_query(c: &mut Criterion) {
    c.bench_function("csv_aggregation", |b| {
        b.iter(|| {
            // Your query logic here
        });
    });
}

criterion_group!(benches, benchmark_query);
criterion_main!(benches);
```

## Resources


- [DataFusion Performance Guide]https://arrow.apache.org/datafusion/user-guide/performance.html
- [Arrow Memory Model]https://arrow.apache.org/docs/format/Columnar.html
- [Flamegraph Documentation]https://github.com/flamegraph-rs/flamegraph