hamelin_translation 0.9.4

Lowering and IR for Hamelin query language
Documentation

hamelin_translation

This crate lowers Hamelin's type-checked AST into a backend-agnostic intermediate representation (IR). It sits between hamelin_lib (parsing and type-checking) and backend crates like hamelin_datafusion.

hamelin_lib (TypedAST)
        ↓
hamelin_translation (IR)  ← this crate
        ↓
hamelin_datafusion, hamelin_trino, etc.

The TypedStatement AST goes through a series of normalization passes that simplify and canonicalize the query structure. The resulting IRStatement enforces invariants that make backend translation straightforward—backends only need to handle a reduced set of commands with predictable shapes.

IR guarantees

Commands that get lowered away

These commands do not exist in the IR:

  • SET — fused into SELECT by fuse_projections
  • DROP — fused into SELECT by fuse_projections
  • WITHIN — converted to WHERE with explicit timestamp bounds
  • PARSE — converted to SET + WHERE (regex extraction + filter)
  • UNNEST — converted to EXPLODE (if array) + SET + DROP
  • NEST — converted to SELECT with compound identifiers, then packed
  • MATCH — lowered to FROM + SET + WHERE + WINDOW + WHERE + DROP

Identifier normalization

All assignments in the IR use simple identifiers. Compound paths like user.name get packed into struct literals, so x.a = 1, x.b = 2 becomes x = {a: 1, b: 2}. FROM aliases and JOIN/LOOKUP right sides get hoisted into named subqueries, leaving only simple identifier references in the main pipeline.

Schema normalization

When FROM/UNION combines tables with differing schemas, the IR widens each table to a common schema via named subqueries. Array literals get similar treatment—if struct elements have different fields, they get widened to match the array's element type.

Window frames

WINDOW commands have pre-computed WindowFrame values rather than expressions. Non-deterministic functions like now() and today() are disallowed in window frame expressions since the frame bounds must be constant.

EXPLODE canonical form

After normalization, EXPLODE is always in canonical form: EXPLODE col = col. The column explodes in place, keeping the same name but changing from an array type to its element type.

JOIN normalization

Both JOIN and LOOKUP lower to IRJoinCommandJOIN becomes an inner join, LOOKUP becomes a left join. Missing ON conditions default to true for cross join semantics.

Usage

use hamelin_translation::{lower, normalize_with};

// Simple case
let ir = lower(typed_statement)?;

// With options
let ir = normalize_with()
    .with_timestamp_field("event_time")
    .lower(typed_statement)?;

IR types

Type Description
IRStatement Top-level: DEF statements + main pipeline
IRPipeline Sequence of IRCommands with output schema
IRCommand Command with kind, span, and output schema
IRExpression Wrapper around TypedExpression
IRAssignment SimpleIdentifier + IRExpression pair

Command kinds

IR Command Source Commands
From FROM, UNION
Where WHERE, WITHIN
Select SELECT, SET, DROP, UNNEST
Agg AGG
Window WINDOW, MATCH
Sort SORT
Limit LIMIT
Explode EXPLODE
Join JOIN, LOOKUP

Normalization passes

Lowering runs a series of normalization passes before converting to IR.

Statement-level passes

lower_match — Lowers MATCH into FROM + SET + WHERE + WINDOW + WHERE

  • DROP. It assigns a synthetic label per source row, aggregates those labels into a state string, and filters with a regex. This must run first so later passes can treat pattern matching as a normal pipeline.

Example (before):

MATCH a=events+ b=logs BY host WITHIN 5m

Example (after):

FROM a=events, b=logs
| SET __pattern_label = case(a IS NOT NULL: "a", b IS NOT NULL: "b")
| WHERE __pattern_label IS NOT NULL
| WINDOW __state = array_join(array_agg(__pattern_label), ",")
    WITHIN 5m
    BY host
| WHERE __pattern_label = "a"
  AND regexp_like(__state, "^(a,)+b(,b)*")
| DROP __pattern_label, __state

JOIN/LOOKUP lowering (IR phase) — During IR lowering (IRStatement::from_typed), JOIN/LOOKUP right sides are hoisted into synthetic CTEs with FROM <table> | NEST <alias>. This preserves the original alias structure (e.g., users.id) while making the right side a nested struct. Missing ON clauses default to true.

Example (before):

FROM events
| JOIN users ON user_id == users.id

Example (after IR lowering):

DEF __join_0 = FROM users | NEST users;

FROM events
| JOIN __join_0 ON user_id == users.id

nest_from_aliases — Rewrites aliased FROM clauses into CTEs with NEST. This turns FROM x=table into a named subquery that nests all fields under x. The main pipeline then references the generated CTE name.

Example (before):

FROM x=events
| WHERE x.a > 10

Example (after):

DEF __alias_0 = FROM events | NEST x;

FROM __alias_0
| WHERE x.a > 10

from_to_union — Converts multi-source FROM into UNION so later passes only handle schema widening on UNION. Single-source FROM stays unchanged. Aliased sources are handled earlier and are not converted here.

Example (before):

FROM events, logs

Example (after):

UNION events, logs

expand_union_schemas — For UNION sources with different schemas, builds CTEs that project a widened schema with typed NULLs. Each source gets a SELECT that matches the merged output schema before the UNION runs.

Example (before):

UNION events, logs

Example (after):

DEF __union_0 = FROM events
  | SELECT timestamp, event_type, message = CAST(NULL AS String);
DEF __union_1 = FROM logs
  | SELECT timestamp, event_type = CAST(NULL AS String), message;

UNION __union_0, __union_1

Pipeline-level passes

normalize_within — Rewrites WITHIN into explicit timestamp bounds using WHERE. It preserves now() so evaluation happens at query time, not during normalization. Ranges and intervals both become timestamp >= ... AND timestamp <= ....

Example (before):

FROM events
| WITHIN -7d

Example (after):

FROM events
| WHERE timestamp >= now() + -7d AND timestamp <= now()

normalize_agg — Replaces compound identifiers in AGG with temporary simple names. It then restores the original compound name with SET and removes the temp with DROP. Group-by compound identifiers follow the same pattern.

Example (before):

FROM events
| AGG stats.total = sum(value) BY category

Example (after):

FROM events
| AGG __normalize_agg_0 = sum(value) BY category
| SET stats.total = __normalize_agg_0
| DROP __normalize_agg_0

normalize_window — Same compound-identifier lowering as normalize_agg, but for WINDOW. It ensures WINDOW projections and group-bys use simple identifiers in the command itself.

Example (before):

FROM events
| WINDOW stats.running = sum(value) BY category

Example (after):

FROM events
| WINDOW __normalize_window_0 = sum(value) BY category
| SET stats.running = __normalize_window_0
| DROP __normalize_window_0

extract_window_aggregates — Ensures WINDOW projections only contain top-level aggregate calls. Nested aggregates are extracted into synthetic fields, then recombined with SET and cleaned up with DROP.

Example (before):

FROM events
| WINDOW crazy = sum(left) + sum(right) SORT order

Example (after):

FROM events
| WINDOW __window_agg_0 = sum(left), __window_agg_1 = sum(right) SORT order
| SET crazy = __window_agg_0 + __window_agg_1
| DROP __window_agg_0, __window_agg_1

normalize_explode — Canonicalizes EXPLODE to EXPLODE col = col. If the identifier is compound or the expression differs from the column name, it inserts the necessary SET/DROP steps so the explode happens in place.

Example (before):

EXPLODE items.expanded = array_field

Example (after):

SET __explode_0 = array_field
| EXPLODE __explode_0 = __explode_0
| SET items.expanded = __explode_0
| DROP __explode_0

lower_unnest — Lowers UNNEST to SET/DROP, with an EXPLODE step for arrays of structs. Complex expressions are assigned to a temp column first. When the expression is a simple field reference, the source column is dropped before extracting struct fields to avoid duplicating data.

Example (before):

UNNEST arr

Example (after):

SET __unnest_0 = arr
| EXPLODE __unnest_0 = __unnest_0
| DROP arr
| SET a = __unnest_0.a, b = __unnest_0.b
| DROP __unnest_0

lower_parse — Converts PARSE anchor patterns into regexp_extract in a SET, then filters non-matching rows with WHERE (unless NODROP is used). Throwaway identifiers (_) are skipped entirely.

Example (before):

PARSE message "user=* ip=* status=*" AS user_id, src_ip, status

Example (after):

SET user_id = regexp_extract(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)", 1),
    src_ip = regexp_extract(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)", 2),
    status = regexp_extract(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)", 3)
| WHERE regexp_count(message, "(?s)user=(.*?) ip=(.*?) status=(.*?)") > 0

lower_nest — Replaces NEST with a SELECT that assigns each field to a compound identifier under the target name. This avoids struct literals in the typed AST, and later projection fusion packs them into structs.

Example (before):

FROM events
| NEST user

Example (after):

FROM events
| SELECT user.a = a, user.b = b

expand_array_literals — Widen struct elements inside array literals to match the array's element type by inserting typed NULLs. If expansion would duplicate complex expressions, the pass hoists them into SET/DROP around the command.

Example (before):

SET arr = [{a: 1}, {a: 2, b: 3}]

Example (after):

SET arr = [{a: 1, b: CAST(NULL AS Int)}, {a: 2, b: 3}]

fuse_projections — Fuses consecutive SET/DROP/SELECT commands into the minimal number of SELECT steps. It respects dependency barriers when a SET references a field assigned earlier in the pending projection.

Example (before):

FROM events
| SELECT a, b
| SET c = a + b
| DROP b

Example (after):

FROM events
| SELECT c = a + b, a

align_append_schema — Inserts a SELECT before APPEND so the pipeline output matches the target table schema (order, missing fields, and extra fields). It uses typed NULLs for missing columns and runs only on the main pipeline.

Example (before):

SET a = 1, b = 2
| APPEND my_table

Example (after):

SET a = 1, b = 2
| SELECT b, c = CAST(NULL AS Int), a
| APPEND my_table

Query-time evaluation

IRExpression::freeze() evaluates constant subexpressions at query execution time. This resolves now(), today(), and other time-dependent functions to concrete timestamp values:

// Before freeze: WHERE timestamp >= now() + -5h
// After freeze:  WHERE timestamp >= ts("2024-01-15T10:00:00Z")
let frozen_expr = ir_expr.freeze();

Incremental analysis

The incremental module analyzes queries for incremental execution:

use hamelin_translation::incremental::{analyze, IncrementalStrategy};

let analysis = analyze(&ir_statement, "timestamp")?;
match analysis.strategy {
    IncrementalStrategy::Stateless => { /* Simple filter/project */ }
    IncrementalStrategy::Windowed { .. } => { /* Requires time window */ }
}