# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Priority #1: CORRECTNESS
**Correctness is our number one priority.** No hacks, no workarounds, no shortcuts.
- Code must be maximally correct - if something feels wrong, it probably is
- Match established semantics (Kubernetes, SQL/PostgreSQL, Rust idioms)
- Special cases are acceptable ONLY when they reflect actual underlying semantics (e.g., ConfigMaps have `data` not `spec` because that's how K8s works)
- NEVER assume or decide to take shortcuts without consulting me first
- Any deviations from standards, best practices, or established patterns must be discussed before implementation
- Anything that would trigger the "spidey senses" of a very senior engineer/architect should make you pause and ask
- When in doubt, ask - do not proceed with assumptions
## Priority #2: Performance
Performance is our second priority - still very important.
- Minimize API calls to Kubernetes (push filters to API when possible)
- Parallel queries across clusters
- Efficient data structures and algorithms
- But never sacrifice correctness for performance
## Git Commits
- Do NOT add Claude attributions to commit messages (no "Generated with Claude Code", no "Co-Authored-By: Claude")
- Write clear, conventional commit messages focusing on what changed and why
- Always run `cargo fmt` before committing
- Always verify with `cargo clippy` for lint issues before committing
## Code Navigation
- Use the Serena MCP with rust-analyzer for code navigation and symbol lookup
- Prefer symbolic tools (find_symbol, find_referencing_symbols, get_symbols_overview) over grep for Rust code
## Project Overview
k8sql exposes the Kubernetes API as a SQL-compatible database using Apache DataFusion as the query engine. kubectl contexts are treated as databases (switchable with `USE cluster1;`) and Kubernetes resources are exposed as tables. The `_cluster` column is part of every table's primary key, enabling cross-cluster queries.
## Build Commands
```bash
cargo build # Debug build
cargo build --release # Release build
cargo test # Run all tests
```
## Usage Examples
```bash
# Interactive REPL (default)
k8sql
# Batch mode - single query
k8sql -q "SELECT * FROM pods"
k8sql -q "SELECT name, status->>'phase' as phase FROM pods WHERE namespace = 'kube-system'"
# Multi-cluster queries
k8sql -q "SELECT * FROM pods WHERE _cluster = 'prod'"
k8sql -q "SELECT * FROM pods WHERE _cluster = '*'" # All clusters
# Batch mode with context and output format
k8sql -c prod-cluster -q "SELECT * FROM deployments" -o json
# Daemon mode (PostgreSQL wire protocol)
k8sql daemon --port 15432
```
## Architecture
```
src/
├── main.rs # Entry point, CLI argument parsing, mode dispatch
├── cli/
│ ├── mod.rs
│ ├── args.rs # Clap argument definitions
│ └── repl.rs # Rustyline REPL with completion/highlighting
├── datafusion_integration/
│ ├── mod.rs
│ ├── context.rs # K8sSessionContext - DataFusion session setup
│ ├── provider.rs # K8sTableProvider - TableProvider implementation
│ ├── convert.rs # JSON to Arrow RecordBatch conversion
│ ├── preprocess.rs # SQL preprocessing (fixes ->> operator precedence)
│ └── hooks.rs # Query hooks for filter extraction
├── kubernetes/
│ ├── mod.rs # ApiFilters type for filter pushdown
│ ├── cache.rs # Resource discovery cache persistence
│ ├── client.rs # K8sClientPool - multi-cluster connection caching
│ ├── discovery.rs # Dynamic resource discovery and schema generation
│ └── field_selectors.rs # Field selector registry and types for K8s API pushdown
├── output/
│ ├── mod.rs # QueryResult type and format dispatch
│ ├── table.rs # Pretty table output
│ ├── json.rs # JSON output
│ ├── csv.rs # CSV output
│ └── yaml.rs # YAML output
└── daemon/
├── mod.rs
└── pgwire_server.rs # PostgreSQL wire protocol server
```
## Key Components
### DataFusion Integration (`src/datafusion_integration/`)
k8sql uses Apache DataFusion as its SQL query engine:
- **K8sSessionContext** (`context.rs`): Wraps DataFusion's SessionContext, registers all K8s resources as tables, provides JSON functions for querying nested fields.
- **K8sTableProvider** (`provider.rs`): Implements DataFusion's TableProvider trait. Fetches data from Kubernetes API on scan, extracts and pushes filters (namespace, cluster, label selectors) to the API.
- **JSON to Arrow conversion** (`convert.rs`): Converts K8s JSON resources to Arrow RecordBatches. Metadata fields use native Arrow types (Timestamp for dates, Int64 for integers); labels/annotations/spec/status are stored as JSON strings.
- **SQL Preprocessing** (`preprocess.rs`): Fixes `->>` operator precedence before DataFusion parsing. Wraps `col->>'key' = 'value'` as `(col->>'key') = 'value'` to work around parser quirks.
### Resource Discovery (`src/kubernetes/discovery.rs`)
Dynamic resource discovery at startup:
- Discovers all available resources including CRDs
- Generates consistent schema: `_cluster`, `api_version`, `kind`, metadata fields, `spec`/`status` as JSON
- Special schemas for ConfigMaps, Secrets, Events, and metrics resources
- Core resources (v1, apps, batch, etc.) take priority over conflicting names from other API groups
#### ColumnDataType Enum
Schema definitions use a type-safe enum for Arrow data types:
```rust
pub enum ColumnDataType {
Text, // Arrow Utf8 - strings, JSON blobs
Timestamp, // Arrow Timestamp(ms) - creation/deletion timestamps
Integer, // Arrow Int64 - generation, count fields
}
```
Helper constructors simplify schema definition:
```rust
ColumnDef::text("name", "metadata.name")
ColumnDef::timestamp("created", "metadata.creationTimestamp")
ColumnDef::integer("generation", "metadata.generation")
ColumnDef::text_raw("_cluster") // No JSON path (virtual column)
```
### K8sClientPool (`src/kubernetes/client.rs`)
Manages connections to multiple Kubernetes clusters:
- Caches clients by context name
- Caches resource registries with TTL
- Supports parallel queries across clusters
## Key Dependencies
- **datafusion**: SQL query engine (Apache Arrow-based)
- **datafusion-functions-json**: JSON querying functions (`json_get_str`, `json_get_int`, etc.)
- **datafusion-postgres**: PostgreSQL wire protocol for daemon mode
- **kube**: Kubernetes client (kube-rs)
- **k8s-openapi**: Kubernetes API type definitions (v1.32)
- **rustyline**: REPL with readline support
- **clap**: CLI argument parsing
## Table Schema
All resources share a consistent schema:
| `_cluster` | Utf8 | Kubernetes context name |
| `api_version` | Utf8 | API version (v1, apps/v1, etc.) |
| `kind` | Utf8 | Resource kind (Pod, Deployment, etc.) |
| `name` | Utf8 | Resource name |
| `namespace` | Utf8 | Namespace (null for cluster-scoped) |
| `uid` | Utf8 | Unique identifier |
| `created` | Timestamp(ms) | Creation timestamp (native, supports date comparisons) |
| `generation` | Int64 | Spec change counter (native, supports numeric comparisons) |
| `labels` | Utf8 (JSON) | JSON object, access with `labels->>'key'` |
| `annotations` | Utf8 (JSON) | JSON object, access with `annotations->>'key'` |
| `spec` | Utf8 (JSON) | Resource specification |
| `status` | Utf8 (JSON) | Resource status |
Special cases:
- **ConfigMaps/Secrets**: `data` column instead of spec/status
- **Events**: `type`, `reason`, `message`, `count`, etc.
- **PodMetrics/NodeMetrics**: `timestamp`, `window`, `containers`/`usage`
- **CustomResourceDefinitions**: `group`, `scope`, `resource_kind`, `plural`, `singular`, `short_names`, `categories` (queryable as `crds` or `customresourcedefinitions`)
## Query Optimization
### Server-side (pushed to K8s API):
| `namespace = 'x'` | Uses namespaced API endpoint |
| `labels->>'app' = 'nginx'` | K8s label selector |
| `labels->>'app' = 'x' AND labels->>'env' = 'y'` | Combined label selector: `app=x,env=y` |
| `status->>'phase' = 'Running'` | K8s field selector (pods only) |
| `spec->>'nodeName' = 'node-1'` | K8s field selector (pods only) |
| `type = 'Opaque'` | K8s field selector (secrets only) |
| `name = 'pod-123'` | K8s field selector (`metadata.name`) |
| `_cluster = 'prod'` | Only queries that cluster |
| `_cluster = '*'` | Queries all clusters in parallel |
**Field Selectors** (`src/kubernetes/field_selectors.rs`):
- Limited to `=` and `!=` operators (K8s restriction)
- Resource-specific support (e.g., `status.phase` for pods, `type` for secrets)
- Hardcoded registry based on K8s API documentation
- Namespace field selector intentionally not used (namespaced API endpoints are more efficient)
Supported resources and fields:
- **All resources**: `metadata.name`
- **Pods**: `spec.nodeName`, `spec.restartPolicy`, `spec.schedulerName`, `spec.serviceAccountName`, `spec.hostNetwork`, `status.phase`, `status.podIP`, `status.nominatedNodeName`
- **Events**: `involvedObject.kind`, `involvedObject.namespace`, `involvedObject.name`, `involvedObject.uid`, `involvedObject.apiVersion`, `involvedObject.resourceVersion`, `involvedObject.fieldPath`, `reason`, `reportingComponent`, `source`, `type`
- **Secrets**: `type`
- **Namespaces**: `status.phase`
- **ReplicaSets**: `status.replicas`
- **Jobs**: `status.successful`
- **Nodes**: `spec.unschedulable`
- **CertificateSigningRequests**: `spec.signerName`
### Client-side (DataFusion):
- Unsupported field selectors (fields not in registry)
- LIKE patterns
- Complex expressions
- ORDER BY, LIMIT, GROUP BY
- Operators not supported by K8s field selectors (IN, NOT IN, etc.)
### JSON Array Access
Use `json_get_array()` with `UNNEST` for array expansion:
```sql
SELECT json_get_str(UNNEST(json_get_array(spec, 'containers')), 'image') FROM pods
```
### Performance Tips
**Projection Pushdown** (automatic optimization):
- k8sql automatically skips converting unrequested JSON columns
- `SELECT name FROM pods` only processes ~50 bytes per pod (vs ~2KB for full resource)
- `SELECT name, namespace, labels FROM pods` processes ~200 bytes per pod
- Impact: **10-40x reduction** in CPU/memory for selective queries
- No user action needed - optimization happens automatically
**Query Performance Best Practices**:
1. **Select only needed columns**: `SELECT name, namespace FROM pods` is much faster than `SELECT * FROM pods`
2. **Use namespace filters**: `WHERE namespace = 'kube-system'` uses efficient namespaced API endpoint
3. **Use label selectors**: `WHERE labels->>'app' = 'nginx'` pushes to K8s API (reduces network transfer)
4. **Use field selectors**: `WHERE status->>'phase' = 'Running'` for pods pushes to K8s API
5. **Cluster filters**: `WHERE _cluster = 'prod'` avoids querying unnecessary clusters
6. **Combine filters**: Multiple AND conditions on labels/fields push to K8s as combined selectors
**What to Avoid**:
- `SELECT * FROM pods` when you only need a few columns (wastes CPU converting large JSON fields)
- Queries without namespace/cluster filters on large clusters (fetches all resources)
- Client-side filtering (LIKE, complex expressions) when API-pushable filters exist
## REPL Controls
- **Tab**: Auto-complete keywords/tables/columns
- **↑/↓**: Navigate history
- **Ctrl+D**: Quit
- **\dt**: SHOW TABLES (displays: table_name, aliases, group, version, kind, scope, resource_type)
- **\l**: SHOW DATABASES
- **\d table**: DESCRIBE table
- **\x**: Toggle expanded display
- **\q**: Quit