hamelin_translation 0.7.9

Lowering and IR for Hamelin query language
Documentation

hamelin_translation

This crate lowers Hamelin's type-checked AST into a backend-agnostic intermediate representation (IR). It sits between hamelin_lib (parsing and type-checking) and backend crates like hamelin_datafusion.

hamelin_lib (TypedAST)
        ↓
hamelin_translation (IR)  ← this crate
        ↓
hamelin_datafusion, hamelin_trino, etc.

The TypedStatement AST goes through a series of normalization passes that simplify and canonicalize the query structure. The resulting IRStatement enforces invariants that make backend translation straightforward—backends only need to handle a reduced set of commands with predictable shapes.

IR guarantees

Commands that get lowered away

These commands do not exist in the IR:

  • LET — fused into SELECT by fuse_projections
  • DROP — fused into SELECT by fuse_projections
  • WITHIN — converted to WHERE with explicit timestamp bounds
  • PARSE — converted to LET + WHERE (regex extraction + filter)
  • UNNEST — converted to EXPLODE (if array) + LET + DROP
  • NEST — converted to SELECT with compound identifiers, then packed
  • MATCH — lowered to FROM + LET + WHERE + WINDOW + WHERE + DROP

Identifier normalization

All assignments in the IR use simple identifiers. Compound paths like user.name get packed into struct literals, so x.a = 1, x.b = 2 becomes x = {a: 1, b: 2}. FROM aliases and JOIN/LOOKUP right sides get hoisted into named subqueries, leaving only simple identifier references in the main pipeline.

Schema normalization

When FROM/UNION combines tables with differing schemas, the IR widens each table to a common schema via named subqueries. Array literals get similar treatment—if struct elements have different fields, they get widened to match the array's element type.

Window frames

WINDOW commands have pre-computed WindowFrame values rather than expressions. Non-deterministic functions like now() and today() are disallowed in window frame expressions since the frame bounds must be constant.

EXPLODE canonical form

After normalization, EXPLODE is always in canonical form: EXPLODE col = col. The column explodes in place, keeping the same name but changing from an array type to its element type.

JOIN normalization

Both JOIN and LOOKUP lower to IRJoinCommandJOIN becomes an inner join, LOOKUP becomes a left join. Missing ON conditions default to true for cross join semantics.

Usage

use hamelin_translation::{lower, normalize_with};

// Simple case
let ir = lower(typed_statement)?;

// With options
let ir = normalize_with()
    .with_timestamp_field("event_time")
    .lower(typed_statement)?;

IR types

Type Description
IRStatement Top-level: WITH clauses + main pipeline
IRPipeline Sequence of IRCommands with output schema
IRCommand Command with kind, span, and output schema
IRExpression Wrapper around TypedExpression
IRAssignment SimpleIdentifier + IRExpression pair

Command kinds

IR Command Source Commands
From FROM, UNION
Where WHERE, WITHIN
Select SELECT, LET, DROP, UNNEST
Agg AGG
Window WINDOW, MATCH
Sort SORT
Limit LIMIT
Explode EXPLODE
Join JOIN, LOOKUP

Normalization passes

Lowering runs a series of normalization passes before converting to IR.

Statement-level passes

lower_match — Lowers MATCH into FROM + LET + WHERE + WINDOW + WHERE

  • DROP. It assigns a synthetic label per source row, aggregates those labels into a state string, and filters with a regex. This must run first so later passes can treat pattern matching as a normal pipeline.

Example (before):

MATCH a=events+ b=logs BY host WITHIN 5m

Example (after):

FROM a=events, b=logs
| LET __pattern_label = case(a IS NOT NULL: "a", b IS NOT NULL: "b")
| WHERE __pattern_label IS NOT NULL
| WINDOW __state = array_join(array_agg(__pattern_label), ",")
    WITHIN 5m
    BY host
| WHERE __pattern_label = "a"
  AND regexp_like(__state, "^(a,)+b(,b)*")
| DROP __pattern_label, __state

JOIN/LOOKUP lowering (IR phase) — During IR lowering (IRStatement::from_typed), JOIN/LOOKUP right sides are hoisted into synthetic CTEs with FROM <table> | NEST <alias>. This preserves the original alias structure (e.g., users.id) while making the right side a nested struct. Missing ON clauses default to true.

Example (before):

FROM events
| JOIN users ON user_id == users.id

Example (after IR lowering):

WITH __join_0 = FROM users | NEST users

FROM events
| JOIN __join_0 ON user_id == users.id

nest_from_aliases — Rewrites aliased FROM clauses into CTEs with NEST. This turns FROM x=table into a named subquery that nests all fields under x. The main pipeline then references the generated CTE name.

Example (before):

FROM x=events
| WHERE x.a > 10

Example (after):

WITH __alias_0 = FROM events | NEST x

FROM __alias_0
| WHERE x.a > 10

from_to_union — Converts multi-source FROM into UNION so later passes only handle schema widening on UNION. Single-source FROM stays unchanged. Aliased sources are handled earlier and are not converted here.

Example (before):

FROM events, logs

Example (after):

UNION events, logs

expand_union_schemas — For UNION sources with different schemas, builds CTEs that project a widened schema with typed NULLs. Each source gets a SELECT that matches the merged output schema before the UNION runs.

Example (before):

UNION events, logs

Example (after):

WITH __union_0 = FROM events
  | SELECT timestamp, event_type, message = CAST(NULL AS String)
WITH __union_1 = FROM logs
  | SELECT timestamp, event_type = CAST(NULL AS String), message

UNION __union_0, __union_1

Pipeline-level passes

normalize_within — Rewrites WITHIN into explicit timestamp bounds using WHERE. It preserves now() so evaluation happens at query time, not during normalization. Ranges and intervals both become timestamp >= ... AND timestamp <= ....

Example (before):

FROM events
| WITHIN -7d

Example (after):

FROM events
| WHERE timestamp >= now() + -7d AND timestamp <= now()

normalize_agg — Replaces compound identifiers in AGG with temporary simple names. It then restores the original compound name with LET and removes the temp with DROP. Group-by compound identifiers follow the same pattern.

Example (before):

FROM events
| AGG stats.total = sum(value) BY category

Example (after):

FROM events
| AGG __normalize_agg_0 = sum(value) BY category
| LET stats.total = __normalize_agg_0
| DROP __normalize_agg_0

normalize_window — Same compound-identifier lowering as normalize_agg, but for WINDOW. It ensures WINDOW projections and group-bys use simple identifiers in the command itself.

Example (before):

FROM events
| WINDOW stats.running = sum(value) BY category

Example (after):

FROM events
| WINDOW __normalize_window_0 = sum(value) BY category
| LET stats.running = __normalize_window_0
| DROP __normalize_window_0

extract_window_aggregates — Ensures WINDOW projections only contain top-level aggregate calls. Nested aggregates are extracted into synthetic fields, then recombined with LET and cleaned up with DROP.

Example (before):

FROM events
| WINDOW crazy = sum(left) + sum(right) SORT BY order

Example (after):

FROM events
| WINDOW __window_agg_0 = sum(left), __window_agg_1 = sum(right) SORT BY order
| LET crazy = __window_agg_0 + __window_agg_1
| DROP __window_agg_0, __window_agg_1

normalize_explode — Canonicalizes EXPLODE to EXPLODE col = col. If the identifier is compound or the expression differs from the column name, it inserts the necessary LET/DROP steps so the explode happens in place.

Example (before):

EXPLODE items.expanded = array_field

Example (after):

LET __explode_0 = array_field
| EXPLODE __explode_0 = __explode_0
| LET items.expanded = __explode_0
| DROP __explode_0

lower_unnest — Lowers UNNEST to LET/DROP, with an EXPLODE step for arrays of structs. Complex expressions are assigned to a temp column first. The original column is always dropped unless the user preserved it earlier.

Example (before):

UNNEST arr

Example (after):

EXPLODE arr = arr
| LET a = arr.a, b = arr.b
| DROP arr

lower_parse — Converts PARSE anchor patterns into regexp_extract in a LET, then filters non-matching rows with WHERE (unless NODROP is used). Throwaway identifiers (_) are skipped entirely.

Example (before):

PARSE message "user=* ip=* status=*" AS user_id, src_ip, status

Example (after):

LET user_id = regexp_extract(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)", 1),
    src_ip = regexp_extract(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)", 2),
    status = regexp_extract(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)", 3)
| WHERE regexp_count(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)") > 0

lower_nest — Replaces NEST with a SELECT that assigns each field to a compound identifier under the target name. This avoids struct literals in the typed AST, and later projection fusion packs them into structs.

Example (before):

FROM events
| NEST user

Example (after):

FROM events
| SELECT user.a = a, user.b = b

expand_array_literals — Widen struct elements inside array literals to match the array's element type by inserting typed NULLs. If expansion would duplicate complex expressions, the pass hoists them into LET/DROP around the command.

Example (before):

LET arr = [{a: 1}, {a: 2, b: 3}]

Example (after):

LET arr = [{a: 1, b: CAST(NULL AS Int)}, {a: 2, b: 3}]

fuse_projections — Fuses consecutive LET/DROP/SELECT commands into the minimal number of SELECT steps. It respects dependency barriers when a LET references a field assigned earlier in the pending projection.

Example (before):

FROM events
| SELECT a, b
| LET c = a + b
| DROP b

Example (after):

FROM events
| SELECT c = a + b, a

align_append_schema — Inserts a SELECT before APPEND so the pipeline output matches the target table schema (order, missing fields, and extra fields). It uses typed NULLs for missing columns and runs only on the main pipeline.

Example (before):

LET a = 1, b = 2
| APPEND my_table

Example (after):

LET a = 1, b = 2
| SELECT b, c = CAST(NULL AS Int), a
| APPEND my_table

Query-time evaluation

IRExpression::freeze() evaluates constant subexpressions at query execution time. This resolves now(), today(), and other time-dependent functions to concrete timestamp values:

// Before freeze: WHERE timestamp >= now() + -5h
// After freeze:  WHERE timestamp >= ts("2024-01-15T10:00:00Z")
let frozen_expr = ir_expr.freeze();

Incremental analysis

The incremental module analyzes queries for incremental execution:

use hamelin_translation::incremental::{analyze, IncrementalStrategy};

let analysis = analyze(&ir_statement, "timestamp")?;
match analysis.strategy {
    IncrementalStrategy::Stateless => { /* Simple filter/project */ }
    IncrementalStrategy::Windowed { .. } => { /* Requires time window */ }
}