hamelin_translation 0.7.13

# hamelin_translation

This crate lowers Hamelin's type-checked AST into a backend-agnostic intermediate
representation (IR). It sits between `hamelin_lib` (parsing and type-checking)
and backend crates like `hamelin_datafusion`.

```
hamelin_lib (TypedAST)
        ↓
hamelin_translation (IR)  ← this crate
        ↓
hamelin_datafusion, hamelin_trino, etc.
```

The `TypedStatement` AST goes through a series of normalization passes that
simplify and canonicalize the query structure. The resulting `IRStatement`
enforces invariants that make backend translation straightforward—backends
only need to handle a reduced set of commands with predictable shapes.

## IR guarantees

### Commands that get lowered away

These commands do not exist in the IR:

- **`LET`** — fused into `SELECT` by `fuse_projections`
- **`DROP`** — fused into `SELECT` by `fuse_projections`
- **`WITHIN`** — converted to `WHERE` with explicit timestamp bounds
- **`PARSE`** — converted to `LET` + `WHERE` (regex extraction + filter)
- **`UNNEST`** — converted to `EXPLODE` (if array) + `LET` + `DROP`
- **`NEST`** — converted to `SELECT` with compound identifiers, then packed
- **`MATCH`** — lowered to `FROM` + `LET` + `WHERE` + `WINDOW` + `WHERE` + `DROP`

### Identifier normalization

All assignments in the IR use simple identifiers. Compound paths like `user.name`
get packed into struct literals, so `x.a = 1, x.b = 2` becomes `x = {a: 1, b: 2}`.
FROM aliases and JOIN/LOOKUP right sides get hoisted into named subqueries,
leaving only simple identifier references in the main pipeline.

### Schema normalization

When FROM/UNION combines tables with differing schemas, the IR widens each table
to a common schema via named subqueries. Array literals get similar treatment—if
struct elements have different fields, they get widened to match the array's
element type.

### Window frames

`WINDOW` commands have pre-computed `WindowFrame` values rather than expressions.
Non-deterministic functions like `now()` and `today()` are disallowed in window
frame expressions since the frame bounds must be constant.

### EXPLODE canonical form

After normalization, EXPLODE is always in canonical form: `EXPLODE col = col`.
The column explodes in place, keeping the same name but changing from an array
type to its element type.

### JOIN normalization

Both `JOIN` and `LOOKUP` lower to `IRJoinCommand`—`JOIN` becomes an inner join,
`LOOKUP` becomes a left join. Missing `ON` conditions default to `true` for
cross join semantics.

## Usage

```rust
use hamelin_translation::{lower, normalize_with};

// Simple case
let ir = lower(typed_statement)?;

// With options
let ir = normalize_with()
    .with_timestamp_field("event_time")
    .lower(typed_statement)?;
```

## IR types

| Type | Description |
|------|-------------|
| `IRStatement` | Top-level: WITH clauses + main pipeline |
| `IRPipeline` | Sequence of `IRCommand`s with output schema |
| `IRCommand` | Command with kind, span, and output schema |
| `IRExpression` | Wrapper around `TypedExpression` |
| `IRAssignment` | `SimpleIdentifier` + `IRExpression` pair |

### Command kinds

| IR Command | Source Commands |
|------------|-----------------|
| `From` | `FROM`, `UNION` |
| `Where` | `WHERE`, `WITHIN` |
| `Select` | `SELECT`, `LET`, `DROP`, `UNNEST` |
| `Agg` | `AGG` |
| `Window` | `WINDOW`, `MATCH` |
| `Sort` | `SORT` |
| `Limit` | `LIMIT` |
| `Explode` | `EXPLODE` |
| `Join` | `JOIN`, `LOOKUP` |

## Normalization passes

Lowering runs a series of normalization passes before converting to IR.

### Statement-level passes

**lower_match** — Lowers `MATCH` into `FROM` + `LET` + `WHERE` + `WINDOW` + `WHERE`
+ `DROP`. It assigns a synthetic label per source row, aggregates those labels
into a state string, and filters with a regex. This must run first so later
passes can treat pattern matching as a normal pipeline.

Example (before):
```hamelin
MATCH a=events+ b=logs BY host WITHIN 5m
```
Example (after):
```hamelin
FROM a=events, b=logs
| LET __pattern_label = case(a IS NOT NULL: "a", b IS NOT NULL: "b")
| WHERE __pattern_label IS NOT NULL
| WINDOW __state = array_join(array_agg(__pattern_label), ",")
    WITHIN 5m
    BY host
| WHERE __pattern_label = "a"
  AND regexp_like(__state, "^(a,)+b(,b)*")
| DROP __pattern_label, __state
```

**JOIN/LOOKUP lowering** (IR phase) — During IR lowering (`IRStatement::from_typed`),
JOIN/LOOKUP right sides are hoisted into synthetic CTEs with `FROM <table> | NEST <alias>`.
This preserves the original alias structure (e.g., `users.id`) while making the right
side a nested struct. Missing `ON` clauses default to `true`.

Example (before):
```hamelin
FROM events
| JOIN users ON user_id == users.id
```
Example (after IR lowering):
```hamelin
WITH __join_0 = FROM users | NEST users

FROM events
| JOIN __join_0 ON user_id == users.id
```

**nest_from_aliases** — Rewrites aliased `FROM` clauses into CTEs with `NEST`.
This turns `FROM x=table` into a named subquery that nests all fields under `x`.
The main pipeline then references the generated CTE name.

Example (before):
```hamelin
FROM x=events
| WHERE x.a > 10
```
Example (after):
```hamelin
WITH __alias_0 = FROM events | NEST x

FROM __alias_0
| WHERE x.a > 10
```

**from_to_union** — Converts multi-source `FROM` into `UNION` so later passes
only handle schema widening on `UNION`. Single-source `FROM` stays unchanged.
Aliased sources are handled earlier and are not converted here.

Example (before):
```hamelin
FROM events, logs
```
Example (after):
```hamelin
UNION events, logs
```

**expand_union_schemas** — For `UNION` sources with different schemas, builds
CTEs that project a widened schema with typed NULLs. Each source gets a SELECT
that matches the merged output schema before the UNION runs.

Example (before):
```hamelin
UNION events, logs
```
Example (after):
```hamelin
WITH __union_0 = FROM events
  | SELECT timestamp, event_type, message = CAST(NULL AS String)
WITH __union_1 = FROM logs
  | SELECT timestamp, event_type = CAST(NULL AS String), message

UNION __union_0, __union_1
```

### Pipeline-level passes

**normalize_within** — Rewrites `WITHIN` into explicit timestamp bounds using
`WHERE`. It preserves `now()` so evaluation happens at query time, not during
normalization. Ranges and intervals both become `timestamp >= ... AND timestamp <= ...`.

Example (before):
```hamelin
FROM events
| WITHIN -7d
```
Example (after):
```hamelin
FROM events
| WHERE timestamp >= now() + -7d AND timestamp <= now()
```

**normalize_agg** — Replaces compound identifiers in `AGG` with temporary simple
names. It then restores the original compound name with `LET` and removes the
temp with `DROP`. Group-by compound identifiers follow the same pattern.

Example (before):
```hamelin
FROM events
| AGG stats.total = sum(value) BY category
```
Example (after):
```hamelin
FROM events
| AGG __normalize_agg_0 = sum(value) BY category
| LET stats.total = __normalize_agg_0
| DROP __normalize_agg_0
```

**normalize_window** — Same compound-identifier lowering as `normalize_agg`, but
for `WINDOW`. It ensures WINDOW projections and group-bys use simple identifiers
in the command itself.

Example (before):
```hamelin
FROM events
| WINDOW stats.running = sum(value) BY category
```
Example (after):
```hamelin
FROM events
| WINDOW __normalize_window_0 = sum(value) BY category
| LET stats.running = __normalize_window_0
| DROP __normalize_window_0
```

**extract_window_aggregates** — Ensures `WINDOW` projections only contain
top-level aggregate calls. Nested aggregates are extracted into synthetic fields,
then recombined with `LET` and cleaned up with `DROP`.

Example (before):
```hamelin
FROM events
| WINDOW crazy = sum(left) + sum(right) SORT BY order
```
Example (after):
```hamelin
FROM events
| WINDOW __window_agg_0 = sum(left), __window_agg_1 = sum(right) SORT BY order
| LET crazy = __window_agg_0 + __window_agg_1
| DROP __window_agg_0, __window_agg_1
```

**normalize_explode** — Canonicalizes `EXPLODE` to `EXPLODE col = col`. If the
identifier is compound or the expression differs from the column name, it inserts
the necessary `LET`/`DROP` steps so the explode happens in place.

Example (before):
```hamelin
EXPLODE items.expanded = array_field
```
Example (after):
```hamelin
LET __explode_0 = array_field
| EXPLODE __explode_0 = __explode_0
| LET items.expanded = __explode_0
| DROP __explode_0
```

**lower_unnest** — Lowers `UNNEST` to `LET`/`DROP`, with an `EXPLODE` step for
arrays of structs. Complex expressions are assigned to a temp column first. The
original column is always dropped unless the user preserved it earlier.

Example (before):
```hamelin
UNNEST arr
```
Example (after):
```hamelin
EXPLODE arr = arr
| LET a = arr.a, b = arr.b
| DROP arr
```

**lower_parse** — Converts `PARSE` anchor patterns into `regexp_extract` in a
`LET`, then filters non-matching rows with `WHERE` (unless `NODROP` is used).
Throwaway identifiers (`_`) are skipped entirely.

Example (before):
```hamelin
PARSE message "user=* ip=* status=*" AS user_id, src_ip, status
```
Example (after):
```hamelin
LET user_id = regexp_extract(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)", 1),
    src_ip = regexp_extract(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)", 2),
    status = regexp_extract(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)", 3)
| WHERE regexp_count(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)") > 0
```

**lower_nest** — Replaces `NEST` with a `SELECT` that assigns each field to a
compound identifier under the target name. This avoids struct literals in the
typed AST, and later projection fusion packs them into structs.

Example (before):
```hamelin
FROM events
| NEST user
```
Example (after):
```hamelin
FROM events
| SELECT user.a = a, user.b = b
```

**expand_array_literals** — Widen struct elements inside array literals to match
the array's element type by inserting typed NULLs. If expansion would duplicate
complex expressions, the pass hoists them into `LET`/`DROP` around the command.

Example (before):
```hamelin
LET arr = [{a: 1}, {a: 2, b: 3}]
```
Example (after):
```hamelin
LET arr = [{a: 1, b: CAST(NULL AS Int)}, {a: 2, b: 3}]
```

**fuse_projections** — Fuses consecutive `LET`/`DROP`/`SELECT` commands into the
minimal number of `SELECT` steps. It respects dependency barriers when a `LET`
references a field assigned earlier in the pending projection.

Example (before):
```hamelin
FROM events
| SELECT a, b
| LET c = a + b
| DROP b
```
Example (after):
```hamelin
FROM events
| SELECT c = a + b, a
```

**align_append_schema** — Inserts a `SELECT` before `APPEND` so the pipeline
output matches the target table schema (order, missing fields, and extra fields).
It uses typed NULLs for missing columns and runs only on the main pipeline.

Example (before):
```hamelin
LET a = 1, b = 2
| APPEND my_table
```
Example (after):
```hamelin
LET a = 1, b = 2
| SELECT b, c = CAST(NULL AS Int), a
| APPEND my_table
```

## Query-time evaluation

`IRExpression::freeze()` evaluates constant subexpressions at query execution
time. This resolves `now()`, `today()`, and other time-dependent functions to
concrete timestamp values:

```rust
// Before freeze: WHERE timestamp >= now() + -5h
// After freeze:  WHERE timestamp >= ts("2024-01-15T10:00:00Z")
let frozen_expr = ir_expr.freeze();
```

## Incremental analysis

The `incremental` module analyzes queries for incremental execution:

```rust
use hamelin_translation::incremental::{analyze, IncrementalStrategy};

let analysis = analyze(&ir_statement, "timestamp")?;
match analysis.strategy {
    IncrementalStrategy::Stateless => { /* Simple filter/project */ }
    IncrementalStrategy::Windowed { .. } => { /* Requires time window */ }
}
```