hamelin_legacy 0.4.2

# MATCH Command Implementations

The MATCH command in Hamelin supports pattern matching over time-ordered event sequences. Two translation strategies are available:

1. **MATCH_RECOGNIZE** - Uses Trino's native `MATCH_RECOGNIZE` clause
2. **CTE** - Uses standard SQL CTEs with window functions and regex

Both strategies implement **ONE ROW PER MATCH** with **AFTER MATCH SKIP TO NEXT ROW** semantics, meaning:
- Only one row is returned per match (the row where the pattern starts)
- Overlapping matches are allowed (each row can start a new match)

## CTE Strategy: Forward-Looking Window Approach

The CTE strategy uses a forward-looking window to match MATCH_RECOGNIZE semantics exactly.

### Why Forward-Looking?

MATCH_RECOGNIZE processes events in timestamp order and looks **ahead** to find pattern matches. The pattern `A B+` starting at row 1 looks forward to find subsequent B events.

A backward-looking window (`PRECEDING...CURRENT ROW`) would fire at every B row in a sequence, producing too many matches. A forward-looking window (`CURRENT ROW...FOLLOWING`) only fires at the A row where the pattern starts.

### Generated SQL Structure

```sql
WITH __ordered_events AS (
    -- Add pattern labels to each row
    SELECT *, 
           IF(a IS NOT NULL, 'a', IF(b IS NOT NULL, 'b', NULL)) AS __pattern_label
    FROM <input>
),
__with_state AS (
    -- Build forward-looking state string
    SELECT *,
        ARRAY_JOIN(ARRAY_AGG(__pattern_label) OVER (
            ORDER BY timestamp ASC
            RANGE BETWEEN CURRENT ROW AND <within> FOLLOWING
        ), ',') AS __state
    FROM __ordered_events
    WHERE __pattern_label IS NOT NULL
)
SELECT <output_cols>
FROM __with_state
WHERE __pattern_label = '<first_var>'
  AND REGEXP_LIKE(__state, '^<pattern>')
ORDER BY timestamp
```

### Key Design Decisions

1. **Forward-looking window**: `RANGE BETWEEN CURRENT ROW AND <within> FOLLOWING`
   - Looks ahead from current row to find matches
   - Matches MATCH_RECOGNIZE's forward processing model

2. **Start anchor `^`**: The regex uses `^pattern` not `pattern$`
   - Ensures pattern starts at the current row
   - With forward-looking window, current row is first in the array

3. **Filter on starting variable**: `WHERE __pattern_label = '<first_var>'`
   - Only check rows where the pattern can start
   - Combined with REGEXP_LIKE for efficient filtering

4. **REGEXP_LIKE for matching**: Simple boolean check if pattern matches
   - No need to extract match length since we return the start row
   - The timestamp of the matching row IS the start of the pattern

### Example: Pattern A B+ Window Behavior

Given events ordered by timestamp (earliest first), with a `WITHIN 5m` window:

| Row | Timestamp | Label | Forward Window State |
|-----|-----------|-------|---------------------|
| 1 | 12:00:00 | a | a,b,b,a,b |
| 2 | 12:01:00 | b | b,b,a,b |
| 3 | 12:02:00 | b | b,a,b |
| 4 | 12:03:00 | a | a,b |
| 5 | 12:04:00 | b | b |

Each row's forward window state contains the labels of all events from the current row through the end of the WITHIN window. Row 1 at 12:00:00 looks ahead and sees all 5 events (all within 5 minutes). Row 4 at 12:03:00 only sees itself and row 5.

**Matching process** (pattern `^a,b(,b)*`):
- Row 1 (a): `__pattern_label = 'a'` YES, regex matches `a,b,b` in `a,b,b,a,b` → **MATCH** at 12:00:00
- Row 2 (b): `__pattern_label = 'a'` NO → filtered out
- Row 3 (b): `__pattern_label = 'a'` NO → filtered out
- Row 4 (a): `__pattern_label = 'a'` YES, regex matches `a,b` in `a,b` → **MATCH** at 12:03:00
- Row 5 (b): `__pattern_label = 'a'` NO → filtered out

**Result**: 2 matches at 12:00:00 and 12:03:00

### Comparison with Backward-Looking (Incorrect)

A backward-looking window (`PRECEDING...CURRENT ROW`) with end anchor `$`:

| Row | Timestamp | Label | Backward Window State |
|-----|-----------|-------|----------------------|
| 1 | 12:00:00 | a | a |
| 2 | 12:01:00 | b | a,b |
| 3 | 12:02:00 | b | a,b,b |
| 4 | 12:03:00 | a | a,b,b,a |
| 5 | 12:04:00 | b | a,b,b,a,b |

With pattern `a,b(,b)*$`:
- Row 2: matches (incorrectly fires early)
- Row 3: matches (incorrectly fires again)
- Row 5: matches

This produces **3 matches** instead of the correct **2 matches** because it fires at every B row rather than at the start of each greedy match sequence.

## Pattern Regex Generation

Patterns are converted to regex for matching comma-separated label strings:

| Pattern | Regex |
|---------|-------|
| A B | `^a,b` |
| A B+ | `^a,b(,b)*` |
| A B+ A | `^a,(b,)+a` |
| A B+ A B | `^a,(b,)+a,b` |
| B A{2} B | `^b,a,a,b` |
| B? A B | `^(b,)?a,b` |
| A B? | `^a(,b)?` |
| A B* | `^a(,b(,b)*)?` |
| A B? C? | `^a(,b)?(,c)?` |

**Note**: The same pattern variable can appear multiple times in a pattern sequence. For example, `A B+ A` means "event from A, one or more events from B, then another event from A".

The `^` anchor ensures the pattern starts at the current row (first element in forward window).

**Optional trailing elements**: When optional elements (`?` or `*`) appear at the end of a pattern, the comma is included inside the optional group. This handles cases where the optional element matches zero times (e.g., `A B?` matching just "a" without a trailing comma).

## AGG Clause

The `AGG` clause allows aggregating values across matched pattern elements. This is useful for computing summary statistics over the events that participated in a match.

### Syntax

```
MATCH <patterns> [WITHIN <duration>] [AGG <assignments>]
```

### Example

```
MATCH
    login AS event_type == "login"
    failed+ AS event_type == "failed_login"
WITHIN 5m
AGG
    failed_count = COUNT(failed.timestamp),
    first_failure = MIN(failed.timestamp),
    last_failure = MAX(failed.timestamp)
```

This matches a login event followed by one or more failed login attempts within 5 minutes, and computes:
- `failed_count`: Number of failed login attempts
- `first_failure`: Timestamp of the first failure
- `last_failure`: Timestamp of the last failure

### How It Works

In the CTE strategy, AGG aggregations are computed using window functions over the events that fall within the match window:

```sql
SELECT *,
    COUNT(IF(__pattern_label = 'failed', failed.timestamp, NULL)) OVER (
        ORDER BY timestamp ASC
        RANGE BETWEEN CURRENT ROW AND <within> FOLLOWING
    ) AS failed_count,
    ...
FROM __ordered_events
```

Each aggregation:
1. Filters to only consider rows matching the specified pattern variable
2. Applies the aggregate function (COUNT, MIN, MAX, SUM, etc.) as a window function
3. Uses the same forward-looking window as the pattern matching

This ensures the aggregated values reflect only the events that could participate in the match starting at each row.