hamelin_lib 0.2.14

## Hamelin Instructions

I am going to teach you a new query language, it’s called Hamelin. Its main purpose is to support SIEM use-cases. In some ways, it is similar to SQL, PRQL, Elastic Query Language, Sumo Logic Query Language or Splunk Query Language, though it has some distinct properties. It’s a pipe language, which processes data and allows to separate the operations using pipes. Throughout this guide, I will include some examples in <example></example> tags.

Hamelin query consists out of one or more operations separated by pipe ("|"). It is very important to memorize the commands, as each pipe operation must start with either of these. Following are the only commands supported:

- `FROM` is the beginning of every Hamelin query. Every query must pull data from _somewhere._ One exception is that Hamelin can be used to calculate some expression value too, e.g. <example>LET x = 8 * 2</example>.
- `SELECT` allows to specify which columns are selected: <example>SELECT <col1>, <col2>, <col3> = <val></example> drops all columns except columns 1 and 2, and also defines a 3rd column to a new, specific value. It’s optional and by default, SELECT * is implied (which selects all columns).
- `WHERE` filters the rows that flow through it based on applying a boolean condition to each of them.
- `LET <col> = <val>` defines a new column value, or shadows an existing value.
- `DROP <col1>, <col2>` removes col1 and col2 from the projection downstream and in the final result set.
- `WITH <tab1> = <operations>` associates the tab1 with specified operations (until two newlines are encountered) , so tab1 can be reused later
- `AGG <function1, function2, ...> BY <column1, column2, ...>` aggregates and groups by the data accordingly
- `WINDOW` applies aggregation or window functions over sliding windows of data.
- `MATCH` correlates multiple event patterns together.
- `SORT <columns> <optional order>` sorts the rows using specified criteria (and `ASC` or `DESC` order, `ASC` being default)
- `LIMIT <limit>` limits the output rows
- `UNNEST <expr>` lifts a struct or array of structs into the parent or enclosing result set.
- `EXPLODE <expr or assignment_clause>` expand array fields into separate rows.

It is extremely important to start each pipe section using one of the commands and use pipe ("|") to separate further operations, e.g.
<example>
FROM example | WHERE name = "John Smith" | SELECT name, address
</example>

For example, to select columns "c1" and "c2" from table "A", one needs to write:

<example>
FROM A
| SELECT c1, c2
</example>

To add a filter for "c1=3" condition:

<example>
FROM A
| WHERE c1 == 3
| SELECT c1, c2
</example>

To select all data from table "A" and associate column "v" value multipled by 2 with new column "x""

<example>
FROM A
| LET x = v*2
</example>

To associate column c1 with new column x and remove c1 and c3, this can be used:

<example>
FROM A
| LET x = c1
| DROP c1, c3
</example>

If only c1, c2 and c3 are present, the above can be rewritten using just `SELECT` :
<example>
FROM A
| SELECT x = c1, c2
</example>

### Aggregations

Aggregations are extremely common in queries and Hamelin supports these via `AGG` keyword: `| AGG <aggregation1>, <aggregation2>, ... BY column1, column2, ...`

The "BY ..." part is optional. Any columns or operations not specified in the aggregation operation will be dropped.
The list of aggregations is optional as well. When not specifying it, the operation acts in a similar way as the DISTINCT operator in SQL.

To aggregate rows in table "A" using count and group by c1:

<example>
FROM A
 | AGG count = count() BY c1
</example>

To get distinct c1 values in table "A" and ignore nulls:

<example>
FROM A
 | AGG BY c1
 | WHERE c1 IS NOT NULL
</example>

To aggregate rows in table "A" using count, group by c1 and select 10 most common values:

<example>
FROM A
 | AGG count = count() BY c1
 | SORT count DESC
 | LIMIT 10
</example>

To take max of column c1, avg of column c2 and count, grouped by c3 and order max(c1) and take top 10 results, this can be issued:

<example>
FROM A
 | AGG count = count(), max = max(c1), avg = avg(c2) BY c3
 | SORT max DESC
 | LIMIT 10
</example>

It is also possible to associate the aggregate values with new column names:

<example>
FROM A
 | AGG count = count(), max_c1 = max(c1), avg_c2 = avg(c2) by c3
 | SORT max_c1 desc
 | LIMIT 10
</example>

Very important: remember that aggregation selects only columns (and aggregations) specified, everything else is dropped. If you want to filter and aggregate data, consider the right order. E.g. to filter cases where c1 > 500 and then get average value of c2, grouped by c3 from table tbl, sorted by c2, this is the correct order (WHERE before AGG):

<example>
FROM tbl
| WHERE c1 > 500
| AGG avg_c2 = avg(c2) BY c3
| SORT avg_c2
</example>

Of course, there are also cases when WHERE is applied after aggregation. E.g. to get average of c1 and count, grouped by c3, only where count() > 10, this will be a correct query:

<example>
FROM tbl
 | AGG count = count(), avg_c1 = avg(c1) BY c3
 | WHERE count > 10
</example>

### Literals

Hamelin adopts most of SQL’s conventions for the basic literal syntax and semantics. Integer, decimal, boolean, string literals, and NULL all behave as they do in SQL, and have equivalent syntax.

In addition, however, Hamelin adds:

- Double-quoted strings, which allow for the use of single quotes in the string without needing escaping.
- Support for multiline. (You can add literal newlines into a string.)

### Identifiers

The identifier syntax is similar to how Trino SQL handles identifiers. They are ASCII letter sequences. These sequences can also start with an underscore `_` or contain an underscore. The sequences can also contain digits, but they cannot start with digits. For Hamelin, we want to follow along with the way string _literals_ are using double-quotes in many existing programming languages and therefore we want to stray from the SQL example, which requires the use of single-quotes `'` for strings. This means we do not want to support double-quoted identifiers. We instead support backtick-quoted identifiers: `identifier of life`.

Following query retrieves data from test table and filters for two conditions, whether "curry wurst" column is equal to the string "wiener schnitzel" and "1a_extra" column is equal to the string 'mit "pommes"' (note the doublequote inside)

<example>
FROM _test
| WHERE `curry wurst` == 'wiener schnitzel'
    AND `1a_extra` == 'mit "pommes"'
</example>

### Types

Hamelin has the following primitive types:

- `boolean`
- `int` - an umbrella for all integer types of any bit length
- `decimal(precision, scale)` - exact fixed point values with specified precision and scale (e.g. `100.5`)
- `double` - floating point numbers with variable precision (e.g. `-9.87e2`)
- `string` - a string of any length, defined as a literal with either single quotes. String literals can escape quotes of either type by doubling them. (e.g. `"She said ""hello"""`, `'It''s raining`)
- `binary` - translates to `varbinary` in SQL for handling binary data
- `timestamp` - an umbrella for date, timestamp, and all their variants (precision and timezone)
- `interval` - an exact time duration of seconds, minutes, hours, or days (e.g. `30s`, `14d`)
- `calendar_interval` - a calendar based time duration of months, quarters, or years (e.g. `3mon`, `5y`)
- `range` - a span between two values of any primitive type using the `..` operator, which is inclusive on both sides and can be unbounded in either direction (e.g. `ts('2024-01-15')..now()`, `1..10`, `yesterday()..`). This is primarily used for time ranges, but can be used for any sortable type.

Hamelin has four composite types: 

- A `struct` has a set of fields, each with name and schema. You can access its field `my_field` of `my_struct` by doing `my_struct.my_field`
- A `map` has a set of key-value pairs with homogeneous types. You can access the value for key `my_key` in `my_map` by doing `my_map["my_key"]`
- An `array` is an ordered list of values with homogeneous types. You can access the value at index 5 in `my_array` my doing `my_array[5]`
- A `variant` is essentially a representation of untyped JSON data. You can access fields of JSON objects with the struct syntax above and you can access items of JSON arrays with the array syntax above.

Hamelin types can be casted to one another with the `AS` operator

- If the type cast fails, the expression returns `NULL`
- Casting never happens implicitly, so to concatenate strings with other types you need to cast. Often parentheses are necessary to make sure to cast the right expression (e.g. `"Today is " + (now() AS string)`)
- Casting is most often useful for casting `variant`s parsed from JSON into the expected composite type (e.g. `| LET parsed = parse_json(message) AS { user_id: string, email: string}`)
- Array types are casted like `array(string)`, not `string[]`
- Map types are casted like `map(string, int)`, not `{ string: int }`

Any field in Hamelin can also be `NULL`. There is no concept of a `NOT NULL` field. You can check for `NULL` values with `IS NULL` or `IS NOT NULL`.

NOTE: `variant` and `interval` types can never be saved in the output of a table, view, or detection query. For `variant`, you can cast it to a struct, array, etc.. For `interval`, you can use a function like `to_nanos()` to convert it to an `int`.

### Operators

Hamelin is designed to be very intentional to make interactive query authoring more fluent and ergonomic.

- Boolean operators
  - Negation: `NOT`
  - Conjunction: `AND`
  - Disjunction: `OR`
- Identity: `IS` (only really used to test against `NULL`).
- Arithmetic:
  - Addition: `+`
  - Subtraction: `-`
  - Multiplication: `*`
  - Division: `/`
  - Modulo: `%`
  - Negation: `-`
- Comparison:
  - Equality: `==`
    - Note we use single-equals for _assignment_.
  - Non-equality: `!=`
  - Greater than: `>`
  - Less than: `<`
  - Greater or equal: `>=`
  - Less or equal: `<=`
  - Inclusion: `IN ['item1', 'item2', ...]` or `IN 1..10`
- Concatenation: `+`
  - For strings: `'one' + 'two'` is `onetwo`
  - For arrays: `['one'] + ['two']` is `['one', 'two']`

### Unioning multiple tables with FROM

Hamelin allows a shorthand in the `FROM` command that targets _many_ sources (by separating them with commas (the `,` character). In a system that offers search, authors often want to perform the same search (or filter) over the rows that come from many different places. This is most common in "needle" queries, where you’re looking for something relatively rare in a bunch of different data sets. This is not a common pattern in structured querying, likely explaining why SQL makes this so hard, and why there is no direct translation of this form in SQL.

For example, where table A defines only `f1` and `f2`, and table B defines only `f2` and `f3`, following will work. It will select `f1` and `f2` from table A and `f2` and `f3` from table B, so the full set of fields in the output of the following example will be `f1`, `f2`, `f3`.

<example>
FROM A, B
</example>

You can also alias the fields based on which table they came from if you need to determine which table a given `f2` value came from example. In the following example, the output fields would be a struct `a` with sub-fields `f1` and `f2` and a struct `b` with sub-fields `f2` and `f3`. You can reference the sub-fields later in the query with `a.f1`, `b.f2`, etc. In addition, the fields are all still usable at the top level.

<example>
FROM a = A, b = B
</example>

Aliasing tables is also useful when aggregating multiple types of rows (often from `WITH` CTEs):

<example>
WITH network_connections = FROM simba.sysmon_events
| WHERE winlog.event_id == '3'

WITH suspicious_file_creation = FROM simba.sysmon_events
| WHERE winlog.event_id == '11'

FROM network_connections = network_connections, suspicious_file_creation = suspicious_file_creation
| AGG network_connections_count = count(network_connections), suspicious_file_creation_count = count(suspicious_file_creation)
  BY host.name
</example>

### Functions

The following functions are supported in expressions (within `LET`, `SELECT`, `WHERE`, etc.):

- len(x[])
- sum(x[])
- avg(x[])
- min(x[])
- max(x[])
- array_distinct(x[])
- filter_null(x[])
- slice(x[], start, end) - returns a slice of the array x from start to end (inclusive). Supports negative indices to count from the end of the array.
- flatten(x[]) - flattens a nested array by one level.
- all(x[])
- any(x[])
- len(x[])
- abs(x)
- cbrt(x)
- ceil(x)
- floor(x)
- degrees(x)
- e() - Euler's number
- exp(x)
- ln(x)
- log(b, x)
- log10(x)
- log2(x)
- pi()
- pow(x,p)
- power(x,p)
- radians(x)
- round(x)
- round(x,d)
- sign(x)
- sqrt(x)
- truncate(x) - Remove the fractional part of a number
- width_bucket(x,bound1,bound2,n) - the bucket number for a value in a histogram with equal-width buckets
- width_bucket(x, bins[]) - the bucket number for a value using explicitly defined bucket boundaries
- if(condition, then, else) - Both `then` and `else` must have the same type
- case(when_condition: then_condition, when_condition: then_expression, ...) - Evaluates multiple condition-value pairs in order and returns the value associated with the first condition that evaluates to true. Returns null if no conditions are true. Every `then` must have the same type. There is no `else` in this expression, but you can achieve the same results by having `true` in the final `then_expression` like `case(when_condition: then_condition, when_condition: then_expression, ..., true: else_expression)`.
- regexp_count(str, pattern)
- regexp_extract(str, pattern)
- regexp_extract(str, pattern, group)
- regexp_extract_all(str, pattern)
- regexp_extract_all(str, pattern, group)
- regexp_like(string, pattern)
- regexp_position(string, pattern, start)
- regexp_replace(string, pattern)
- regexp_replace(string, pattern, replacement)
- regexp_split(string, pattern)
- split(str, separator)
- array_join(x[], separator)
- replace(str, str_to_remove)
- starts_with(str, prefix)
- ends_with(str, suffix)
- contains(str, substring)
- len(str)
- lower(str)
- upper(str)
- parse_json(json_str) - Only takes a string, never pass a map or struct
- to_json_string(json_variant) - Only takes a variant. To pass other types, cast to variant with `AS variant`
- typeof(x)
- coalesce(x[]) - returns the first non-null value from a list of expressions
- first(x)
- last(x)
- now()
- today()
- yesterday()
- tomorrow()
- ts(str)
- at_timezone(timestamp, timezone)
- year(timestamp)
- month(timestamp)
- day(timestamp)
- hour(timestamp)
- minute(timestamp)
- second(timestamp)
- from_unixtime_micros(x) - returns timestamp
- from_unixtime_millis(x) - returns timestamp
- from_unixtime_nanos(x) - returns timestamp
- from_unixtime_seconds(x) - returns timestamp
- from_millis(x) - returns interval
- from_nanos(x) - returns interval
- to_unixtime(timestamp) - returns double in seconds since the Unix epoch
- to_millis(interval)
- to_nanos(interval)
- map(keys, values)
- map(elements)
- map_keys(map)
- map_values(map)
- next(expression)
- cidr_contains(cidr, ip)
- is_ipv4(ip)
- is_ipv6(ip)

The following functions are ONLY available within an AGG or WINDOW command (not within LET, SELECT, or WHERE commands):

- all(bool_expr) - returns true if bool_expr is true for all rows
- any(bool_expr) - returns true if bool_expr is true for any row
- any_value(x) - returns an arbitrary value from each group
- approx_percentile(x, percentile)
- array_agg(x)
- avg(x)
- count_distinct(x) - counts the number of distinct non-null values of x
- approx_distinct(x) - same as count_distinct(x) but uses an approximation algorithm that is faster but less accurate
- count_if(condition) - counts all rows where condition is true
- count(expr) - counts all rows where expr is non-null
- count() - counts all rows
- map_agg(key, value)
- max(x)
- min(x)
- multimap_agg(key, value)
- stddev(x)
- sum(x)

The following functions are ONLY available within a WINDOW command (not within LET, SELECT, WHERE, or AGG commands):

- first_value(expression)
- last_value(expression)
- nth_value(expression, n)
- cume_dist() - the number of rows with values less than or equal to the current row's value, divided by the total number of rows in the partition
- percent_rank() - the percentile rank of each row within the window partition
- dense_rank() - the rank of each row within a window partition without gaps
- lag(expression, offset, ignore_nulls) - the value of an expression from a previous row within the window
- lead(expression, offset, ignore_nulls) - the value of an expression from a subsequent row within the window
- rank() - the rank of each row within a window partition with gaps
- row_number() - a sequential row number for each row within a window partition

### Describing time ranges

Use `<low>..<high>` to describe a time range. Drop either `low` or `high` from the expression to communicate that side being unbounded. Allow the use of interval values, like `WITHIN 1hr..`. This time range syntax is used in the `timeRange` input of tools such as `execute-hamelin-query` and `query-page-run-query` to filter the result set to only the rows that have a `timestamp` in that time range. The time range syntax can be used in various parts of a Hamelin query, such as the `WINDOW` command.

The `WITHIN` keyword is used to compare expressions to ranges. It can be used in three basic forms:

- `| WHERE val WITHIN low..high` allows `WITHIN` to be used as a binary boolean operator to test any sortable value against a range. This can be used anywhere boolean expressions are needed.
- `| WINDOW WITHIN low..high` defines the frame clause of a window expression, filtering only those rows which are within the interval defined in terms of the current (or output) row.

Whenever providing timestamp literals, you must use a format of `('<TIMESTAMP>' AS timestamp)`. For example, to describe noon on March 1st, 2020, use `('2020-03-01 12:00' AS timestamp)`. To select all rows from table A with timestamp greater than March 1st, 2020 on noon, you can use:
<example>
FROM A
| WHERE timestamp > ('2025-03-01 12:00' AS timestamp)
</example>

To select rows from the last hour, use `timeRange`:

<example>
-1h..
</example>

To select rows which were between two and one hour ago, use the `timeRange`:

<example>
-2h..-1h
</example>

To select rows which occured during the past 7 days, use the `timeRange`:

<example>
-1w..
</example>

The `@` operator can also be used on a timestamp to round timestamps down to the nearest unit (day, hour, etc.).

To select rows in the previous 5 calendar days (not including today), use the `timeRange`:

<example>
-5d@d..now()@d
</example>

### Windows

- `WINDOW` is a command that behaves like a mix of `AGG` and `LET`. It has three optional arguments that come at the end, all of which define a window:
  - `BY <column1, column2>` means everything in the let is partitioned by this. (e.g., `BY user` means compute the function in a user-relative window).
  - `SORT <column1, column2, ...>` means everything in the let sorted by this. (e.g., `SORT timestamp`).
  - `WITHIN <range>` means everything in the let uses this range as the frame. (e.g., `WITHIN -5m` means within a 5 minute sliding window).
    - Here, within uses nearly the same semantics as the `WITHIN` command.
    - We introduce a new row interval literal with the suffix `r` or `rows` or `row` for defining ranges over exact numbers of rows rather than over ranges of values.
- All three are optional, but the presence of any means _every_ expression in the let is defined "over" this window.

For example, to do count in 5 minute windows in table tbl:

<example>
FROM tbl | WINDOW c = count() WITHIN -5m
</example>

#### Rate-Based Detection with WINDOW...WITHIN

Use `WINDOW ... WITHIN` for sliding window aggregations to detect bursts or spikes of activity.

**When to use:** Count/aggregate events within a moving time window (e.g., "10+ failed logins in the last 5 minutes")

**Key characteristics:**
- **Purpose**: Detect bursts/spikes of activity (rate-based detection)
- **Syntax**: `WINDOW ... BY grouping_keys WITHIN -duration`
- **Window moves with time**: Always looks at the most recent N minutes/hours
- **Use cases**: Rate limiting, burst detection, threshold alerting, brute force attacks
- **Duration format**: `-5min`, `-1h`, `-1d` (negative values look backward in time)

**Example: Brute Force Detection**

<example>
FROM simba.okta_events
| WHERE event.action == "user.session.start" AND event.outcome == "failure"
| WINDOW
    failed_login_count = count(),
    unique_ips = array_agg(source.ip)
  BY user.name
  WITHIN -5min
| WHERE failed_login_count >= 10
| SELECT
    timestamp,
    user.name,
    event.count = failed_login_count,
    source.ip_count = unique_ips
</example>

**Common WINDOW patterns:**

<example>
// Count events in sliding window
FROM dataset | WINDOW event_count = count() BY host WITHIN -10min
</example>

<example>
// Conditional counting
FROM dataset
| WINDOW
    malicious_count = count_if(is_malicious),
    total_count = count()
  BY user.name
  WITHIN -30min
</example>

### Event Correlation with MATCH

The `MATCH` command enables powerful multi-stage detection by correlating different event patterns. This is essential for detecting attack chains and reducing false positives.

**When to use MATCH:**
- Detecting multi-stage attacks (initial access → execution → persistence)
- Correlating different behavioral indicators
- Building high-fidelity detections that require multiple suspicious events
- Creating attack chain detections with context

**MATCH Pattern Structure:**

```hamelin
// Step 1: Define individual event patterns as WITH clauses
WITH pattern1 = FROM dataset
| WHERE <specific conditions>
| SELECT fields...

WITH pattern2 = FROM dataset
| WHERE <different conditions>
| SELECT fields...

// Step 2: Correlate patterns using MATCH (space-separated!)
MATCH p1=pattern1? p2=pattern2?
BY correlation_key  // e.g., host, user, session_id
| LET indicator_count = (
    if(p1 IS NOT NULL, 1, 0) +
    if(p2 IS NOT NULL, 1, 0)
) AS int
| WHERE indicator_count >= 2  // Alert when 2+ patterns match

// Step 3: Aggregate and enrich
| AGG
    timestamp = min(timestamp),
    event.start = min(timestamp),
    event.end = max(timestamp),
    pattern1_event = array_agg(p1)[0],
    pattern2_event = array_agg(p2)[0],
    indicator_count = max(indicator_count)
  BY match_number, correlation_key

// Step 4: Apply time window constraint
| WHERE event.end - event.start <= 10min

// Step 5: Build output with risk scoring
| LET risk_score = (
    if(pattern1_event IS NOT NULL, 30, 0) +
    if(pattern2_event IS NOT NULL, 30, 0)
) AS int
| SELECT timestamp, fields...
```

**Critical MATCH Syntax Rules:**
- Pattern aliases are **space-separated**, NOT comma-separated: `MATCH a=query1? b=query2? c=query3?`
- Use `?` quantifier to make patterns optional (allows partial matches)
- The `BY` clause specifies the correlation key (what ties events together)
- Access matched events in AGG using `array_agg(alias)[0]`
- Time window filtering: `WHERE event.end - event.start <= duration`

**Example: Multi-Stage Malware Detection**

<example>
// Define behavioral indicators
WITH lnk_powershell = FROM simba.sysmon_events
| WHERE event.code == "1"
| LET parent_image = coalesce(winlog.event_data["ParentImage"], '') AS string
| LET image = coalesce(winlog.event_data["Image"], '') AS string
| LET cmd_line = coalesce(winlog.event_data["CommandLine"], '') AS string
| LET host = host.name AS string
| WHERE regexp_like(lower(parent_image), '(?i).*\\\\explorer\\.exe')
  AND regexp_like(lower(image), '(?i).*(powershell|pwsh)\\.exe')
  AND regexp_like(cmd_line, '(?i).*(hidden|-enc|-encodedcommand).*')
| SELECT timestamp, host, process_image = image, process_commandline = cmd_line

WITH cloud_download = FROM simba.sysmon_events
| WHERE event.code == "1"
| LET image = coalesce(winlog.event_data["Image"], '') AS string
| LET cmd_line = coalesce(winlog.event_data["CommandLine"], '') AS string
| LET host = host.name AS string
| WHERE regexp_like(lower(image), '(?i).*(powershell|pwsh)\\.exe')
  AND regexp_like(cmd_line, '(?i).*(github\\.com|dropbox\\.com).*')
  AND regexp_like(cmd_line, '(?i).*(downloadstring|invoke-webrequest).*')
| SELECT timestamp, host, process_image = image, process_commandline = cmd_line

// Correlate: alert when 2+ indicators on same host within 10 min
MATCH lnk=lnk_powershell? dl=cloud_download?
BY host
| LET indicator_count = (
    if(lnk IS NOT NULL, 1, 0) +
    if(dl IS NOT NULL, 1, 0)
) AS int
| WHERE indicator_count >= 2
| AGG
    timestamp = min(timestamp),
    event.start = min(timestamp),
    event.end = max(timestamp),
    lnk_event = array_agg(lnk)[0],
    dl_event = array_agg(dl)[0],
    indicator_count = max(indicator_count)
  BY match_number, host
| WHERE event.end - event.start <= 10min
| LET risk_score = (
    if(lnk_event IS NOT NULL, 40, 0) +
    if(dl_event IS NOT NULL, 40, 0)
) AS int
| SELECT
    timestamp,
    event.start,
    event.end,
    host,
    event.count = indicator_count,
    event.risk_score = risk_score
</example>

### UNNEST and EXPLODE

The UNNEST command lifts struct or array of struct fields into the parent or enclosing result set. When given a struct, it lifts struct fields into the parent struct without changing row cardinality. When given an array of struct, it performs an explode operation followed by unnesting, creating one row per array element with the struct fields lifted into the parent struct.

The EXPLODE command transforms rows containing array fields into multiple rows, with each element of the array becoming a separate row. Each array element becomes a new row with all other fields from the original row preserved in each generated output row. When you use assignment syntax (`identifier = expression`), the exploded values are placed in the specified field name. Without assignment syntax, the exploded values replace the original array field.

### More valid examples

To find records in table foo where column c1 is greater than 42, one needs to write:

<example>
FROM foo | WHERE c1 > 42
</example>

To take unique (distinct) tuples of hostname and dest_port in table1, you can use AGG without aggregation functions:
<example>
FROM table1 | AGG BY hostname, dest_port
</example>

To get number of flows by protocol and destination_port, then select top 10 by count, one needs to write:
<example>
FROM flows
| AGG count = count() BY protocol, destination_port
| SORT count DESC
| LIMIT 10
</example>

To get number of flows by day, one needs to write:
<example>
FROM flows
| AGG count = count() BY timestamp@d
| SORT count DESC
| LIMIT 10
</example>

The resulting data will have two columns: count and timestamp (note that the "@d" part gets removed in the outputted column name).

### Invalid usage examples

Here are some examples how NOT TO USE Hamelin. It is important to memorize these bad examples and verify when producing a response to the prompt if the output is correct.

Following bad query mistakenly uses ORDER BY rather than SORT:
<bad_example>
FROM flows
| ORDER BY count() DESC
</bad_example>
After fixing:
<example>
FROM flows
| SORT count() DESC
</example>

Following bad query uses aggregation for sorting, but it was already done
<bad_example>
FROM flows
| AGG count() BY user_name
| SORT count()
</bad_example>
After fixing:
<example>
FROM flows
| AGG count = count() BY user_name
| SORT count
</example>

Following bad query mistakenly uses a single equal sign for equality test:
<bad_example>
FROM flows
| WHERE user = 'John Smith'
</bad_example>
After fixing:
<example>
FROM flows
| WHERE user == 'John Smith'
</example>

Following bad query mistakenly uses a star character in count:
<bad_example>
FROM flows
| AGG count = count(*) BY user
</bad_example>
After fixing:
<example>
FROM flows
| AGG count = count() BY user
</example>

Following bad query is mixing two commands in one:
<bad_example>
FROM flows | SELECT dest_port, LET proto = 'tcp'
</bad_example>
After fixing:
<example>
FROM flows | SELECT dest_port | LET proto = 'tcp'
</example>

Following bad query is using incorrect AS operator
<bad_example>
FROM flows | AGG max(timestamp) AS max_timestamp
</bad_example>
After fixing:
<example>
FROM flows | AGG max_timestamp = max(timestamp)
</example>

The following bad query uses the LIKE operator, which doesn't exist in Hamelin:
<bad_example>
FROM flows | WHERE proto LIKE 'http%'
</bad_example>
Three different options to fix:
<example>
FROM flows | WHERE starts_with(proto, 'http')
</example>
<example>
FROM flows | WHERE contains(proto, 'http')
</example>
<example>
FROM flows | WHERE regexp_like(proto, '^http')
</example>

AGG does not support the WITHIN clause. Use WINDOW for rate-based detection with sliding windows.
<bad_example>
| AGG count() BY host WITHIN -5min
</bad_example>
After fixing:
<example>
| WINDOW count() BY host WITHIN -5min
</example>

MATCH patterns must be space-separated, not comma-separated.
<bad_example>
MATCH a=query1?, b=query2?, c=query3?
</bad_example>
After fixing:
<example>
MATCH a=query1? b=query2? c=query3?
</example>

### Key differences from SQL

Hamelin is not SQL but a unique query language! Carefully note the Hamelin rules when drafting a new query.

- The count function does not include a star. So it's `count()` and definitely NOT `count(*)`.
- There is no `CASE`/`WHEN` operator in Hamelin, but its implemented as the `case(when_condition: then_condition, when_condition: then_expression, ...)`
- There's no `LIKE` operator. Instead, use `contains` to do simple substring searching or `regexp_like` to check if a string matches a regex pattern.
- String concatenation happens with `+` not `||`
- Double equals `==` is required for equality comparison. Single equals `=` is only used for assignment.
- `AS` is used for casting, never for assignment/aliasing. NEVER use it like `SELECT expr AS alias`. Only use it for casting like `SELECT expr AS type`. To do aliasing in a SELECT command, use `SELECT alias = expr` syntax.
- `SORT`, not `ORDER BY`
- There is no support for subqueries within expressions (`| WHERE name IN (FROM admins | SELECT name)` is not valid)
- Trailing commas are allowed (but not required).

### Summary

Let me note a few important properties:

- Aggregation, when used, must be done using `AGG` keyword. The `AGG` command completely redefines the field space so only fields used/defined in the command will be available downstream. This means that if a field isn't used at all in the AGG command, you cannot reference it again in a subsequent command.
- Grouping, when used, must be done within a window or aggregation, i.e. there is no separate GROUP BY operation, but there is "AGG .... BY ...."
- Each query must start with `FROM <table>`, followed by other operations separated by pipe character ("|").
- Very important, pipe character ("|") must be always used to separate each operation. For example, this is invalid: <bad_example>FROM tbl WHERE x > 5</bad_example>. Instead, it should be: <example> FROM tbl | WHERE x > 5</example>. It’s extremely important, use pipe characters for separation!
- The only valid operation names are FROM, SELECT, WHERE, DROP, LET, WITH, WINDOW, MATCH, SORT, LIMIT, AGG, UNNEST, EXPLODE - each operation must start with one of these.
- When the prompt contains a SQL query on the input, you must convert it into a Hamelin query!!!

### Best Practices

- Don't use `SORT` and `LIMIT` commands unless the specific situation requires them or the user asks for it specifically. If you're adding `| SORT timestamp DESC` to the end of the query, that's almost always wrong.
- `SELECT` is useful to narrow the result set if the you and the user know exactly what you're looking for, but `SELECT` is not required and should be omitted for most queries.

### Final Reminders

Now you should understand how Hamelin operates and how to write queries. When providing Hamelin queries, make sure you take into account syntax requirements of Hamelin and do logical operators correctly. Ensure that the order of operations as logically described in the question is followed. Ensure that the order of aggregating and filtering operations is correct. When asked for providing a Hamelin query, provide concise output with a correct Hamelin query.