# MATCH Command Implementations
The MATCH command in Hamelin supports pattern matching over time-ordered event sequences. Two translation strategies are available:
1. **MATCH_RECOGNIZE** - Uses Trino's native `MATCH_RECOGNIZE` clause
2. **CTE** - Uses standard SQL CTEs with window functions and regex
Both strategies implement **ONE ROW PER MATCH** with **AFTER MATCH SKIP TO NEXT ROW** semantics, meaning:
- Only one row is returned per match (the row where the pattern starts)
- Overlapping matches are allowed (each row can start a new match)
## CTE Strategy: Forward-Looking Window Approach
The CTE strategy uses a forward-looking window to match MATCH_RECOGNIZE semantics exactly.
### Why Forward-Looking?
MATCH_RECOGNIZE processes events in timestamp order and looks **ahead** to find pattern matches. The pattern `A B+` starting at row 1 looks forward to find subsequent B events.
A backward-looking window (`PRECEDING...CURRENT ROW`) would fire at every B row in a sequence, producing too many matches. A forward-looking window (`CURRENT ROW...FOLLOWING`) only fires at the A row where the pattern starts.
### Generated SQL Structure
```sql
WITH __ordered_events AS (
-- Add pattern labels to each row
SELECT *,
IF(a IS NOT NULL, 'a', IF(b IS NOT NULL, 'b', NULL)) AS __pattern_label
FROM <input>
),
__with_state AS (
-- Build forward-looking state string
SELECT *,
ARRAY_JOIN(ARRAY_AGG(__pattern_label) OVER (
ORDER BY timestamp ASC
RANGE BETWEEN CURRENT ROW AND <within> FOLLOWING
), ',') AS __state
FROM __ordered_events
WHERE __pattern_label IS NOT NULL
)
SELECT <output_cols>
FROM __with_state
WHERE __pattern_label = '<first_var>'
AND REGEXP_LIKE(__state, '^<pattern>')
ORDER BY timestamp
```
### Key Design Decisions
1. **Forward-looking window**: `RANGE BETWEEN CURRENT ROW AND <within> FOLLOWING`
- Looks ahead from current row to find matches
- Matches MATCH_RECOGNIZE's forward processing model
2. **Start anchor `^`**: The regex uses `^pattern` not `pattern$`
- Ensures pattern starts at the current row
- With forward-looking window, current row is first in the array
3. **Filter on starting variable**: `WHERE __pattern_label = '<first_var>'`
- Only check rows where the pattern can start
- Combined with REGEXP_LIKE for efficient filtering
4. **REGEXP_LIKE for matching**: Simple boolean check if pattern matches
- No need to extract match length since we return the start row
- The timestamp of the matching row IS the start of the pattern
### Example: Pattern A B+ Window Behavior
Given events ordered by timestamp (earliest first), with a `WITHIN 5m` window:
| 1 | 12:00:00 | a | a,b,b,a,b |
| 2 | 12:01:00 | b | b,b,a,b |
| 3 | 12:02:00 | b | b,a,b |
| 4 | 12:03:00 | a | a,b |
| 5 | 12:04:00 | b | b |
Each row's forward window state contains the labels of all events from the current row through the end of the WITHIN window. Row 1 at 12:00:00 looks ahead and sees all 5 events (all within 5 minutes). Row 4 at 12:03:00 only sees itself and row 5.
**Matching process** (pattern `^a,b(,b)*`):
- Row 1 (a): `__pattern_label = 'a'` YES, regex matches `a,b,b` in `a,b,b,a,b` → **MATCH** at 12:00:00
- Row 2 (b): `__pattern_label = 'a'` NO → filtered out
- Row 3 (b): `__pattern_label = 'a'` NO → filtered out
- Row 4 (a): `__pattern_label = 'a'` YES, regex matches `a,b` in `a,b` → **MATCH** at 12:03:00
- Row 5 (b): `__pattern_label = 'a'` NO → filtered out
**Result**: 2 matches at 12:00:00 and 12:03:00
### Comparison with Backward-Looking (Incorrect)
A backward-looking window (`PRECEDING...CURRENT ROW`) with end anchor `$`:
| 1 | 12:00:00 | a | a |
| 2 | 12:01:00 | b | a,b |
| 3 | 12:02:00 | b | a,b,b |
| 4 | 12:03:00 | a | a,b,b,a |
| 5 | 12:04:00 | b | a,b,b,a,b |
With pattern `a,b(,b)*$`:
- Row 2: matches (incorrectly fires early)
- Row 3: matches (incorrectly fires again)
- Row 5: matches
This produces **3 matches** instead of the correct **2 matches** because it fires at every B row rather than at the start of each greedy match sequence.
## Pattern Regex Generation
Patterns are converted to regex for matching comma-separated label strings:
| A B | `^a,b` |
| A B+ | `^a,b(,b)*` |
| A B+ A | `^a,(b,)+a` |
| A B+ A B | `^a,(b,)+a,b` |
| B A{2} B | `^b,a,a,b` |
| B? A B | `^(b,)?a,b` |
| A B? | `^a(,b)?` |
| A B* | `^a(,b(,b)*)?` |
| A B? C? | `^a(,b)?(,c)?` |
**Note**: The same pattern variable can appear multiple times in a pattern sequence. For example, `A B+ A` means "event from A, one or more events from B, then another event from A".
The `^` anchor ensures the pattern starts at the current row (first element in forward window).
**Optional trailing elements**: When optional elements (`?` or `*`) appear at the end of a pattern, the comma is included inside the optional group. This handles cases where the optional element matches zero times (e.g., `A B?` matching just "a" without a trailing comma).
## AGG Clause
The `AGG` clause allows aggregating values across matched pattern elements. This is useful for computing summary statistics over the events that participated in a match.
### Syntax
```
MATCH <patterns> [WITHIN <duration>] [AGG <assignments>]
```
### Example
```
MATCH
login AS event_type == "login"
failed+ AS event_type == "failed_login"
WITHIN 5m
AGG
failed_count = COUNT(failed.timestamp),
first_failure = MIN(failed.timestamp),
last_failure = MAX(failed.timestamp)
```
This matches a login event followed by one or more failed login attempts within 5 minutes, and computes:
- `failed_count`: Number of failed login attempts
- `first_failure`: Timestamp of the first failure
- `last_failure`: Timestamp of the last failure
### How It Works
In the CTE strategy, AGG aggregations are computed using window functions over the events that fall within the match window:
```sql
SELECT *,
COUNT(IF(__pattern_label = 'failed', failed.timestamp, NULL)) OVER (
ORDER BY timestamp ASC
RANGE BETWEEN CURRENT ROW AND <within> FOLLOWING
) AS failed_count,
...
FROM __ordered_events
```
Each aggregation:
1. Filters to only consider rows matching the specified pattern variable
2. Applies the aggregate function (COUNT, MIN, MAX, SUM, etc.) as a window function
3. Uses the same forward-looking window as the pattern matching
This ensures the aggregated values reflect only the events that could participate in the match starting at each row.