# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
`dft` (datafusion-dft) is a batteries-included suite of DataFusion applications providing four interfaces: TUI, CLI, FlightSQL Server, and HTTP Server. All interfaces share a common execution engine built on Apache DataFusion and Apache Arrow.
## Building and Running
### Development Commands
```bash
# Build the project (default features: functions-parquet, s3)
cargo build
# Build with TUI support
cargo build --features=tui
# Build with all features
cargo build --all-features
# Run the TUI (requires tui feature)
cargo run --features=tui
# Run CLI with a query
cargo run -- -c "SELECT 1 + 2"
# Start HTTP server (requires http feature)
cargo run --features=http -- serve-http
# Start FlightSQL server (requires flightsql feature)
cargo run --features=flightsql -- serve-flightsql
# Generate TPC-H data
cargo run -- generate-tpch
```
### Benchmarking
Benchmarks measure query performance with detailed timing breakdowns:
```bash
# Serial benchmark (default, 10 iterations)
cargo run -- -c "SELECT 1" --bench
# Custom iteration count
cargo run -- -c "SELECT 1" --bench -n 100
# Concurrent benchmark (measures throughput under load)
cargo run -- -c "SELECT 1" --bench --concurrent
# With custom iterations and concurrency
cargo run -- -c "SELECT 1" --bench -n 100 --concurrent
# Save results to CSV
cargo run -- -c "SELECT 1" --bench --save results.csv
# Append to existing results
cargo run -- -c "SELECT 2" --bench --concurrent --save results.csv --append
# Warm up cache before benchmarking
cargo run -- -c "SELECT * FROM t" --bench --run-before "CREATE TABLE t AS VALUES (1)"
```
**Benchmark Modes:**
- **Serial** (default): Measures query performance in isolation
- Shows pure query execution time without contention
- Ideal for understanding baseline performance
- **Concurrent** (`--concurrent`): Measures performance under load
- Runs iterations in parallel (concurrency = min(iterations, CPU cores))
- Shows throughput (queries/second) with multiple clients
- Reveals resource contention and bottlenecks
- Higher mean/median times are expected due to concurrent load
**Output:**
- Timing breakdown: logical planning, physical planning, execution, total
- Statistics: min, max, mean, median for each phase
- CSV format includes `concurrency_mode` column (serial or concurrent(N))
**FlightSQL Benchmarks:**
```bash
# Benchmark FlightSQL server (requires --flightsql flag and server running)
cargo run -- -c "SELECT 1" --bench --flightsql --concurrent
```
### Testing
Tests are organized by feature and component:
```bash
# Run core database tests
cargo test db
# Run CLI tests
cargo test cli_cases
# Run TUI tests (requires tui feature)
cargo test --features=tui tui_cases
# Run feature-specific tests
cargo test --features=flightsql extension_cases::flightsql -- --test-threads=1
cargo test --features=s3 extension_cases::s3
cargo test --features=functions-json extension_cases::functions_json
cargo test --features=deltalake extension_cases::deltalake
cargo test --features="deltalake s3" extension_cases::deltalake::test_deltalake_s3 # Requires LocalStack
cargo test --features=udfs-wasm extension_cases::udfs_wasm
cargo test --features=vortex extension_cases::vortex
cargo test --features=vortex cli_cases::basic::test_output_vortex
# Run tests for specific crates
cargo test --manifest-path crates/datafusion-app/Cargo.toml --all-features
cargo test --manifest-path crates/datafusion-functions-parquet/Cargo.toml
cargo test --manifest-path crates/datafusion-udfs-wasm/Cargo.toml
# Run a single test
cargo test <test_name>
```
Note: FlightSQL tests require `--test-threads=1` because they spin up servers on the same port.
### Code Quality
```bash
# Format code
cargo fmt --all
# Check formatting (CI check)
cargo fmt --all -- --check
# Run clippy
cargo clippy --all-features --workspace -- -D warnings
# Check for unused dependencies
cargo machete
# Format TOML files
taplo format --check
```
## Architecture
### Crate Structure
The project is organized as a workspace with multiple crates:
- **Root crate (`datafusion-dft`)**: Main binary and application logic
- `src/main.rs` - Entry point that routes to TUI, CLI, or servers
- `src/tui/` - TUI implementation using ratatui
- `src/cli/` - CLI implementation
- `src/server/` - HTTP and FlightSQL server implementations
- `src/config.rs` - Configuration management
- `src/args.rs` - Command-line argument parsing
- **`crates/datafusion-app`**: Core execution engine (reusable library)
- `src/local.rs` - ExecutionContext wrapping DataFusion SessionContext
- `src/executor/` - Dedicated executors for CPU-intensive work (inspired by InfluxDB)
- `src/catalog/` - Catalog management
- `src/extensions/` - DataFusion extensions
- `src/tables/` - Table provider implementations
- `src/stats.rs` - Query execution statistics
- `src/config.rs` - Execution configuration
- **`crates/datafusion-functions-parquet`**: Parquet-specific UDFs
- **`crates/datafusion-udfs-wasm`**: WASM-based UDF support
- **`crates/datafusion-auth`**: Authentication implementations
- **`crates/datafusion-ffi-table-providers`**: FFI table provider support
### Key Components
#### ExecutionContext
The `ExecutionContext` (in `crates/datafusion-app/src/local.rs`) is the core abstraction that wraps DataFusion's `SessionContext` with:
- Extension registration (UDFs, table formats, object stores)
- DDL file execution
- Dedicated executor for CPU-intensive work
- Query execution and statistics collection
- Observability integration
#### TUI Architecture
The TUI (in `src/tui/`) follows a state-based architecture:
- `src/tui/state/` - Application state management with tab-specific state
- `src/tui/ui/` - Rendering logic separated from state
- `src/tui/handlers/` - Event handling
- `src/tui/execution.rs` - Async query execution
Built with ratatui and crossterm.
#### Server Implementations
- **FlightSQL**: `src/server/flightsql/` - Arrow Flight SQL protocol server
- **HTTP**: `src/server/http/` - REST API using Axum
Both servers share the same `ExecutionContext` from `datafusion-app`.
### Feature Flags
The project uses extensive feature flags to keep binary size manageable:
- `tui` - Terminal user interface (ratatui-based)
- `s3` - S3 object store integration (default)
- `functions-parquet` - Parquet-specific functions (default)
- `functions-json` - JSON functions
- `deltalake` - Delta Lake table format support
- `vortex` - Vortex file format support
- `flightsql` - FlightSQL server and client
- `http` - HTTP server
- `huggingface` - HuggingFace dataset integration
- `udfs-wasm` - WASM UDF support
- `observability` - Metrics and tracing (required by servers)
When adding code that depends on a feature, use conditional compilation:
```rust
#[cfg(feature = "flightsql")]
use datafusion_app::flightsql;
```
### Configuration
Configuration files use TOML and are located in `~/.config/dft/`. Key config files:
- Main config: `~/.config/dft/config.toml`
- DDL file: `~/.config/dft/ddl.sql` (auto-loaded by TUI)
See `src/config.rs` and `crates/datafusion-app/src/config.rs` for configuration structure.
### Executor Pattern
The project uses a dedicated executor pattern (inspired by InfluxDB) for CPU-intensive work. This separates network I/O (on the main Tokio runtime) from CPU-bound query execution. See `crates/datafusion-app/src/executor/`.
The main runtime in `src/main.rs` uses a single-threaded Tokio runtime optimized for network I/O.
## Development Workflow
### Adding New Features
1. Update `Cargo.toml` feature flags if needed
2. Add the feature to CI test matrix in `.github/workflows/test.yml`
3. Implement feature in appropriate crate
4. Add tests in the `extension_cases` or `crate` test suites
5. Update documentation
### Testing Against LocalStack
Some tests (S3, TUI, CLI, Delta Lake + S3) require LocalStack for S3 testing. The CI workflow shows the setup:
```bash
# Start LocalStack
localstack start -d
awslocal s3api create-bucket --bucket test --acl public-read
awslocal s3 mv data/aggregate_test_100.csv s3://test/
# Run S3 tests
cargo test --features=s3 extension_cases::s3
# For Delta Lake + S3 tests, also sync the delta lake data
awslocal s3 sync data/deltalake/simple_table s3://test/deltalake/simple_table
cargo test --features="deltalake s3" extension_cases::deltalake::test_deltalake_s3
```
### Benchmarking
The project includes benchmarking support:
```bash
# Start HTTP server for benchmarking
just serve-http
# Run basic HTTP benchmark (requires oha tool)
just bench-http-basic
# Run custom benchmark
just bench-http-custom <file>
```
Criterion benchmarks are available in `crates/datafusion-app/benches/`.
## Common Patterns
### Adding a New UDF
1. Implement in `crates/datafusion-app/src/extensions/`
2. Register in the appropriate extension registration function
3. Add tests with SQL queries exercising the UDF
### Adding Table Provider Support
1. Implement in `crates/datafusion-app/src/tables/`
2. Register in catalog creation (`src/catalog/`)
3. Add integration tests
### Working with the TUI
The TUI uses a tab-based interface. Each tab has:
- State struct in `src/tui/state/tabs/`
- UI rendering in `src/tui/ui/tabs/`
- Event handlers in `src/tui/handlers/`
When modifying TUI code, ensure proper separation between state management and rendering.
## Important Notes
- The project is licensed under Apache 2.0
- Clippy lint `clone_on_ref_ptr` is set to "deny"
- The main Tokio runtime should only be used for network I/O (single-threaded)
- CPU-intensive query execution uses dedicated executors
- FlightSQL tests must run with `--test-threads=1` due to port conflicts
- All server implementations share the same execution engine