rsigma-eval 0.12.0

# rsigma-eval

[![CI](https://github.com/timescale/rsigma/actions/workflows/ci.yml/badge.svg)](https://github.com/timescale/rsigma/actions/workflows/ci.yml)

`rsigma-eval` is an evaluator for [Sigma](https://github.com/SigmaHQ/sigma) detection rules. It compiles Sigma rules into optimized in-memory matchers and evaluates them against JSON events. Rules are compiled once; evaluation is zero-allocation on the hot path.

This library is part of [rsigma].

## Public API

### Engine

| Method | Description |
|--------|-------------|
| `Engine::new()` | Create an empty engine |
| `Engine::new_with_pipeline(pipeline)` | Create engine with an initial pipeline |
| `set_include_event(include: bool)` | Global override: include full event JSON in all match results |
| `add_pipeline(pipeline)` | Add a pipeline (sorted by `priority` after add) |
| `add_rule(rule: &SigmaRule)` | Apply pipelines, compile, and incrementally fold into engine indexes (amortized O(1) per rule) |
| `add_rules(rules)` | Batched add of many rules; returns per-rule compile errors as `(index, error)` pairs and rebuilds engine indexes once |
| `add_collection(collection: &SigmaCollection)` | Add all rules, then apply all filters |
| `add_collection_with_pipelines(collection, pipelines)` | Temporarily replace pipelines, add collection, restore |
| `add_compiled_rule(rule: CompiledRule)` | Add a pre-compiled rule directly; folds into engine indexes incrementally (amortized O(1) per rule) |
| `extend_compiled_rules(rules)` | Batched add of pre-compiled rules; rebuilds engine indexes once |
| `apply_filter(filter: &FilterRule)` | Inject filter as `AND NOT` into referenced rules |
| `evaluate(event: &Event)` | Evaluate all rules against an event |
| `evaluate_with_logsource(event, logsource)` | Evaluate with logsource-based pre-filtering |
| `evaluate_batch(events: &[&Event])` | Evaluate all rules against multiple events (parallel with `parallel` feature) |
| `set_bloom_prefilter(enabled: bool)` | Enable opt-in bloom-filter pre-filtering of positive substring matchers (off by default; see [Bloom Pre-Filter](#bloom-pre-filter-opt-in)) |
| `bloom_prefilter_enabled()` | Whether bloom pre-filtering is currently enabled |
| `set_bloom_max_bytes(max_bytes: usize)` | Override the bloom memory budget (default: 1 MB) |
| `bloom_max_bytes()` | Configured bloom memory budget, or `None` if using the default |
| `set_cross_rule_ac(enabled: bool)` | Enable opt-in cross-rule Aho-Corasick pre-filter (requires `daachorse-index` feature; see [Cross-Rule AC Index](#cross-rule-aho-corasick-index-opt-in-feature-gated)) |
| `cross_rule_ac_enabled()` | Whether the cross-rule AC pre-filter is currently enabled (`daachorse-index` only) |
| `rule_count()` | Number of loaded rules |
| `rules()` | Access the compiled rules slice |

### Correlation Engine

| Method | Description |
|--------|-------------|
| `CorrelationEngine::new(config)` | Create with a `CorrelationConfig` |
| `set_include_event(include: bool)` | Global override for event inclusion |
| `add_collection(collection)` | Add rules and correlations |
| `add_rule(rule: &SigmaRule)` | Add a single detection rule |
| `add_correlation(corr: &CorrelationRule)` | Add a single correlation rule |
| `process_event(event: &Event)` | Evaluate + update correlation state (wall-clock time) |
| `process_event_at(event, timestamp_secs)` | Evaluate + update state with explicit timestamp |
| `evaluate(event: &Event)` | Run detection only (no correlation state update) |
| `process_with_detections(event, detections, ts)` | Feed pre-computed detections into correlation state |
| `process_batch(events: &[&Event])` | Parallel detection + sequential correlation for a batch of events |
| `evict_expired(now)` | Manually evict expired state entries |
| `state_count()` | Number of active correlation state entries |
| `event_buffer_count()` | Total events stored across all buffers |
| `event_buffer_bytes()` | Total bytes of compressed event data |
| `set_bloom_prefilter(enabled: bool)` | Forward to the inner detection engine (off by default) |
| `set_bloom_max_bytes(max_bytes: usize)` | Forward to the inner detection engine |
| `set_cross_rule_ac(enabled: bool)` | Forward to the inner detection engine (requires `daachorse-index` feature) |

### Compilation

| Function | Description |
|----------|-------------|
| `compile_rule(rule: &SigmaRule)` | Compile a parsed rule into a `CompiledRule` |
| `compile_detection(detection: &Detection)` | Compile a detection tree |
| `evaluate_rule(rule: &CompiledRule, event: &Event)` | Evaluate one compiled rule |
| `eval_condition(expr, detections, event, matched)` | Evaluate a condition expression tree |

### Pipeline

| Function | Description |
|----------|-------------|
| `parse_pipeline(yaml: &str)` | Parse a pipeline from a YAML string |
| `parse_pipeline_file(path: &Path)` | Parse a pipeline from a YAML file |
| `apply_pipelines(pipelines, rule)` | Apply all pipelines to a rule in priority order |
| `apply_pipelines_with_state(pipelines, rule)` | Apply pipelines and return the merged `PipelineState` (for backends) |
| `merge_pipelines(pipelines)` | Merge multiple pipelines into one (sorted by priority) |

## Detection Engine

- **Compiled matchers**: optimized matching for all 30 modifier combinations — exact, contains, startswith, endswith, regex, CIDR, numeric comparison, base64 offset (3 alignment variants), windash expansion (5 replacement characters), field references, placeholder expansion, timestamp part extraction
- **Logsource routing**: optional pre-filtering by `category`/`product`/`service` to reduce the number of rules evaluated per event
- **Condition tree evaluation**: short-circuit boolean logic, selector patterns with quantifiers (`1 of selection_*`, `all of them`)
- **Filter application**: runtime injection of filter rules as `AND NOT` conditions on referenced rules

### Compilation Pipeline

1. **Rule compilation** (`compile_rule`): For each named detection, call `compile_detection`. Reads `rsigma.include_event` from `custom_attributes`.
2. **Detection compilation** (`compile_detection`):
   - `AllOf` → compile each item, reject empty.
   - `AnyOf` → recursively compile each sub-detection, reject empty.
   - `Keywords` → compile each value as case-insensitive contains, combine with `AnyOf`.
3. **Value compilation** (`compile_value`): Handles modifiers in this order: `|expand` → timestamp part → `|fieldref` → `|re` → `|cidr` → numeric comparison → `|neq` → string modifiers. String modifiers: `|wide`/`|utf16le` → `|utf16be` → `|utf16` → `|base64` → `|base64offset` → `|windash` → string match.

### Compiled Matcher Types

| Matcher | Modifier | Notes |
|---------|----------|-------|
| `Exact` | (default) | Case-insensitive by default; `\|cased` makes it sensitive |
| `Contains` | `\|contains` | Substring match |
| `StartsWith` | `\|startswith` | Prefix match |
| `EndsWith` | `\|endswith` | Suffix match |
| `Regex` | `\|re` | `\|i` adds `(?i)`, `\|m` adds multiline, `\|s` adds dotall |
| `Cidr` | `\|cidr` | IP network matching via `IpNet` |
| `NumericEq/Gt/Gte/Lt/Lte` | `\|gt`, `\|gte`, etc. | f64 comparison |
| `Exists` | `\|exists` | Accepts `true`/`yes`/`false`/`no` as values |
| `FieldRef` | `\|fieldref` | Compares against another field's value |
| `Null` | — | Matches null or missing values |
| `BoolEq` | — | Boolean equality |
| `Expand` | `\|expand` | Placeholder template expansion |
| `TimestampPart` | `\|minute`, `\|hour`, `\|day`, `\|week`, `\|month`, `\|year` | Extract timestamp component, match inner value |
| `Not` | `\|neq` | Wraps inner matcher with negation |
| `AnyOf` / `AllOf` | — | Multiple values combined (OR / AND with `\|all`) |

### Value Coercion

- **Arrays**: string matchers use OR semantics (`any element matches`).
- **Numbers**: coerced to string for string matchers.
- **Booleans**: `"true"`, `"1"`, `"yes"` → true; `"false"`, `"0"`, `"no"` → false.

### Filter Rule Behavior

- Filters match by `rule.id` or `rule.title` (from `filter.rules`).
- If the filter has a `logsource`, the rule must be compatible (symmetric check).
- Empty `filter.rules` applies the filter to all rules.
- Filter detections are added as `__filter_{counter}_{name}` (counter prevents key collisions when multiple filters share detection names); the condition is wrapped as `original AND NOT filter`.

### Selector Pattern Matching

- `*` — matches any detection name.
- `selection_*` — prefix match.
- `*_filter` — suffix match.
- `exact` — exact match.
- `them` — matches all names except those starting with `_`.

## Event Model

The `Event` wrapper provides flexible field access over `serde_json::Value`:

- **Flat-key precedence**: `"actor.user.name"` as a literal top-level key takes priority over nested traversal.
- **Dot-notation**: if no flat key matches and the path contains `.`, split and traverse nested objects.
- **Array traversal**: arrays are searched with OR semantics (first matching element wins).
- **Keyword detection**: `matches_keyword` searches all string values across all fields recursively.
- **Max nesting depth**: recursive traversal stops at depth **64** (`MAX_NESTING_DEPTH`).

## Correlation Engine

Stateful processing with sliding time windows, group-by aggregation, and all 8 correlation types.

### CorrelationConfig

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `timestamp_fields` | `Vec<String>` | `["@timestamp", "timestamp", "EventTime", "TimeCreated", "eventTime"]` | Field names to try for timestamp extraction, in priority order |
| `timestamp_fallback` | `TimestampFallback` | `WallClock` | `WallClock` (use `Utc::now()`) or `Skip` (skip event from correlation) |
| `max_state_entries` | `usize` | `100,000` | Hard cap across all correlations and group keys |
| `suppress` | `Option<u64>` | `None` | Default suppression window in seconds |
| `action_on_match` | `CorrelationAction` | `Alert` | `Alert` (keep state) or `Reset` (clear window state) |
| `emit_detections` | `bool` | `true` | Whether to emit detection-level matches for correlation-only rules |
| `correlation_event_mode` | `CorrelationEventMode` | `None` | `None`, `Full` (deflate-compressed), or `Refs` (timestamp + ID) |
| `max_correlation_events` | `usize` | `10` | Max events stored per `(correlation, group_key)` window |

### Core Features

- **Group-by partitioning**: composite keys with field aliasing across referenced rules
- **Correlation chaining**: correlation results propagate to higher-level correlations (max depth: **10**, `MAX_CHAIN_DEPTH`)
- **Extended temporal conditions**: boolean expressions over rule references (e.g. `rule_a and rule_b and not rule_c`)
- **Cycle detection**: DFS-based validation of the correlation reference graph at load time

### Alert Management

- **Suppression**: per-correlation or global suppression windows to prevent alert floods. After a `(correlation, group_key)` fires, suppress re-alerts for the configured duration
- **Action-on-fire**: `Alert` (keep state, re-fire on next match) or `Reset` (clear window state, require fresh threshold)
- **Generate flag**: Sigma-standard `generate` support — suppress detection output for correlation-only rules

### Event Inclusion

- **Full mode**: contributing events stored as individually deflate-compressed blobs (compression level 1, 3-5x memory savings on typical JSON)
- **Refs mode**: lightweight references (timestamp + optional event ID) at ~40 bytes per event
- **Event ID extraction** (Refs mode): tries fields in order: `id`, `_id`, `event_id`, `EventRecordID`, `event.id`
- **Configurable cap**: `max_correlation_events` bounds memory per window
- **Zero cost when disabled**: buffers are not allocated unless mode is `Full` or `Refs`
- **Per-correlation override**: set `rsigma.correlation_event_mode` via `custom_attributes` in YAML

### Memory Management

- **Max state entries**: configurable hard cap (default: 100,000) across all correlations and group keys
- **Time-based eviction**: entries outside their correlation window are evicted automatically
- **Hard-cap eviction**: when over the limit, entries are evicted until 90% of the cap is reached (the stalest 10% are dropped in bulk to avoid evicting on every event)
- **Stale alert cleanup**: expired suppression entries are garbage-collected

### Timestamp Extraction

- **Field priority list**: configurable ordered list of fields to try (default: `@timestamp`, `timestamp`, `EventTime`, `TimeCreated`, `eventTime`)
- **Format support**: RFC 3339, `%Y-%m-%dT%H:%M:%S`, `%Y-%m-%dT%H:%M:%S%.f`, `%Y-%m-%d %H:%M:%S`, epoch seconds, epoch milliseconds (auto-detected if value > 10^12)
- **Fallback policy**: `WallClock` (use `Utc::now()`, good for real-time streaming) or `Skip` (skip event from correlation, recommended for batch/replay)

### Value Percentile

`value_percentile` uses linear interpolation (C=1 method). The condition threshold represents the percentile rank (0-100), clamped to that range.

## Output Types

### MatchResult

| Field | Type | Description |
|-------|------|-------------|
| `rule_title` | `String` | Rule title |
| `rule_id` | `Option<String>` | Rule UUID |
| `level` | `Option<Level>` | Severity level |
| `tags` | `Vec<String>` | Tags |
| `matched_selections` | `Vec<String>` | Detection names that matched |
| `matched_fields` | `Vec<FieldMatch>` | Field/value pairs that contributed to the match |
| `event` | `Option<Value>` | Full event JSON when `include_event` is enabled |

### FieldMatch

| Field | Type |
|-------|------|
| `field` | `String` |
| `value` | `serde_json::Value` |

### CorrelationResult

| Field | Type | Description |
|-------|------|-------------|
| `rule_title` | `String` | Correlation rule title |
| `rule_id` | `Option<String>` | Rule UUID |
| `level` | `Option<Level>` | Severity level |
| `tags` | `Vec<String>` | Tags |
| `correlation_type` | `CorrelationType` | e.g. `event_count`, `temporal` |
| `group_key` | `Vec<(String, String)>` | Group-by field/value pairs |
| `aggregated_value` | `f64` | Computed aggregate (count, sum, avg, percentile, median) |
| `timespan_secs` | `u64` | Correlation window duration |
| `events` | `Option<Vec<Value>>` | Contributing events (Full mode) |
| `event_refs` | `Option<Vec<EventRef>>` | Event references (Refs mode) |

### EventRef

| Field | Type |
|-------|------|
| `timestamp` | `i64` |
| `id` | `Option<String>` |

## Processing Pipelines

pySigma-compatible pipeline system for field mapping, logsource transformation, and backend-specific configuration. Supports multi-pipeline chaining with priority ordering.

### Pipeline Chaining

- **Priority**: `Pipeline.priority` (default `0`); lower runs first.
- **Sorting**: pipelines are sorted by `priority` on add.
- **State isolation**: each pipeline gets its own `PipelineState`; state is not shared across pipelines. Use `apply_pipelines_with_state()` to collect merged state for backends.

### Transformation Item Fields

Each transformation item in a pipeline can have:

| Field | Description |
|-------|-------------|
| `id` | Identifier for `processing_item_applied` conditions |
| `rule_conditions` | All must match (AND logic) for the transformation to apply |
| `rule_cond_expression` | Logical expression over rule condition IDs (alternative to `rule_conditions`) |
| `detection_item_conditions` | Conditions on individual detection items |
| `field_name_conditions` | Conditions on field names |
| `field_name_cond_not` | Negate field name conditions |

### Transformations (26 types)

| Type | Fields | Description |
|------|--------|-------------|
| `field_name_mapping` | `mapping: {k: v \| [v1, v2, ...]}` | Rename fields via a mapping dict; list values expand the matched detection item into an OR over the alternatives (one-to-many, pySigma-compatible) |
| `field_name_prefix_mapping` | `mapping: {prefix: replacement}` | Rename fields matching a prefix |
| `field_name_prefix` | `prefix` | Add a prefix to all field names |
| `field_name_suffix` | `suffix` | Add a suffix to all field names |
| `field_name_transform` | `transform_func`, `mapping` | Case transformation (see below) |
| `drop_detection_item` | — | Remove matching detection items |
| `add_condition` | `conditions: {k: v}`, `negated` (default: `false`) | Inject additional detection conditions |
| `change_logsource` | `category`, `product`, `service` | Modify logsource fields |
| `replace_string` | `regex`, `replacement`, `skip_special` (default: `false`) | Regex-based string replacement (`skip_special` preserves wildcards) |
| `map_string` | `mapping: {k: v \| [v1, v2]}` | Map string values to replacements (supports one-to-many) |
| `set_value` | `value` | Replace detection item values |
| `convert_type` | `target_type` (`str`/`int`/`float`/`bool`, default: `str`) | Convert values between types |
| `value_placeholders` | — | Expand `%placeholder%` in values |
| `wildcard_placeholders` | — | Expand placeholders to wildcards |
| `query_expression_placeholders` | `expression` (default: `""`) | Backend query placeholders (no-op in eval) |
| `set_state` | `key`, `value` | Store key-value pairs in pipeline state |
| `rule_failure` | `message` (default: `"rule failure"`) | Raise an error for matching rules |
| `detection_item_failure` | `message` (default: `"detection item failure"`) | Raise an error for matching detection items |
| `hashes_fields` | `valid_hash_algos`, `field_prefix` (default: `"File"`), `drop_algo_prefix` (default: `false`) | Transform hash field names |
| `add_field` | `field` | Add a new detection item with a fixed value |
| `remove_field` | `field` | Remove a field from detection items |
| `set_field` | `fields: [...]` | Rename the field of a detection item |
| `set_custom_attribute` | `attribute`, `value` | Set key-value attributes on rules |
| `case_transformation` | `case_type` / `case` (`lower`/`upper`/`snake_case`) | Transform case of field values |
| `nest` | `items` or `transformations` | Apply a group of transformations conditionally |
| `regex` | — | Regex transformation (no-op in eval) |

**Aliases**: `case` is accepted as an alias for `case_transformation`.

#### `field_name_transform` Functions

| Value | Behavior |
|-------|----------|
| `lower` / `lowercase` | `to_lowercase` |
| `upper` / `uppercase` | `to_uppercase` |
| `title` | Capitalize each word, join with `_` (e.g. `hello_world` → `Hello_World`) |
| `snake_case` | camelCase → snake_case |

### Conditions (3 levels)

#### Rule Conditions

| Type | Fields |
|------|--------|
| `logsource` | `category`, `product`, `service` |
| `contains_detection_item` | `field`, `value` (optional) |
| `processing_item_applied` | `processing_item_id` |
| `processing_state` | `key`, `val` |
| `is_sigma_rule` | — |
| `is_sigma_correlation_rule` | — |
| `rule_attribute` | `attribute` (`level`/`status`/`author`/`title`/`id`/`date`/`description`), `value` |
| `tag` | `tag` |

#### Detection Item Conditions

| Type | Fields |
|------|--------|
| `match_string` | `pattern` (default: `".*"`), `negate` (default: `false`) |
| `is_null` | `negate` |
| `processing_item_applied` | `processing_item_id` |
| `processing_state` | `key`, `val` |

#### Field Name Conditions

| Type | Fields |
|------|--------|
| `include_fields` | `fields`, `match_type` (`plain` or `regex`, default: `plain`) |
| `exclude_fields` | `fields`, `match_type` |
| `processing_item_applied` | `processing_item_id` |
| `processing_state` | `key`, `val` |

### Finalizers (3 types)

| Type | Fields | Defaults |
|------|--------|----------|
| `concat` | `separator`, `prefix`, `suffix` | `" "`, `""`, `""` |
| `json` | `indent` | — |
| `template` | `template` | `""` |

Finalizers are stored in the pipeline and not executed in eval mode. Each finalizer has an `apply()` method used by `rsigma-convert` backends to transform a `Vec<String>` of queries into a single output string.

## Custom Attributes (`rsigma.*`)

Pipeline transformations can configure engine behavior via `SetCustomAttribute`, following the same pattern as pySigma backends (e.g. [pySigma-backend-loki](https://github.com/grafana/pySigma-backend-loki)):

| Attribute | Effect | CLI equivalent | Scope |
|-----------|--------|----------------|-------|
| `rsigma.timestamp_field` | Prepends a field name to the timestamp extraction priority list | `--timestamp-field` | Engine |
| `rsigma.suppress` | Sets the suppression window (e.g. `5m`) | `--suppress` | Engine + per-correlation |
| `rsigma.action` | Sets the post-fire action (`alert` or `reset`) | `--action` | Engine + per-correlation |
| `rsigma.include_event` | Embeds the full event JSON in detection output | `--include-event` | Per-rule |
| `rsigma.correlation_event_mode` | Sets event inclusion mode (`full` or `refs`) | `--correlation-event-mode` | Per-correlation |
| `rsigma.max_correlation_events` | Caps stored events per correlation window (integer) | `--max-correlation-events` | Per-correlation |

CLI flags and the library API always take precedence over pipeline attributes. Engine-level attributes (`timestamp_field`, `suppress`, `action`) are only applied when the CLI did not already set the corresponding flag. Per-correlation attributes override engine defaults for individual correlation rules.

```yaml
# Example pipeline with custom attributes
transformations:
  - type: set_custom_attribute
    attribute: rsigma.timestamp_field
    value: time
  - type: set_custom_attribute
    attribute: rsigma.suppress
    value: 5m
```

## Bloom Pre-Filter (Opt-In)

The engine can build a per-field bloom filter at rule-load time over every positive substring needle (`Contains` / `StartsWith` / `EndsWith` / `AhoCorasickSet`). When enabled, `Engine::evaluate` short-circuits any positive substring detection item whose field value cannot possibly contain a needle trigram, skipping the matcher entirely.

**Off by default.** The per-event probe (trigram extraction + double hashing) costs ~1 µs on a typical CommandLine field. On rule sets where most events overlap with at least one needle, the probe is pure overhead. The bloom only pays off on substring-heavy rule sets paired with mostly-non-matching events (e.g. high-volume telemetry against an active threat-intel ruleset).

```rust
let mut engine = Engine::new();
engine.set_bloom_prefilter(true);
// Optional: tighten or relax the 1 MB default budget.
engine.set_bloom_max_bytes(2 * 1024 * 1024);
engine.add_collection(&collection)?;
```

CLI equivalents on `rsigma engine eval` and `rsigma engine daemon`:

```
--bloom-prefilter              # enable
--bloom-max-bytes <BYTES>      # override 1 MB default
```

Always benchmark against representative events before flipping it on; the `eval_bloom_rejection` Criterion group in `crates/rsigma-eval/benches/eval.rs` reports throughput with both default-off and bloom-on engines so you can size the win on your corpus.

## Cross-Rule Aho-Corasick Index (Opt-In, Feature-Gated)

For deployments with very large rule sets (> ~5K rules) and many shared substring patterns (threat-intel feeds, IOC packs), the engine can build a single per-field [`DoubleArrayAhoCorasick`](https://crates.io/crates/daachorse) automaton over every rule's positive substring needles. At eval time, the engine scans each indexed field once with the per-field automaton and drops AC-prunable rules from the candidate set when none of their patterns hit the event.

A rule is AC-prunable when:

- It has at least one positive substring detection item, AND
- Every detection consists exclusively of positive substring matchers (`Contains` / `StartsWith` / `EndsWith` / `AhoCorasickSet`, possibly nested under `AnyOf`/`AllOf`/`CaseInsensitiveGroup`), AND
- No condition expression contains `Not`.

These rules can be safely pruned because their firing requires at least one substring to match. Rules with `Exact`, `Regex`, `Numeric`, `Cidr`, etc. matchers, or with `not` selectors in their conditions, are kept in the candidate set unfiltered.

**Off by default.** For smaller rule sets the per-rule [`AhoCorasickSet`] matcher is already optimal; the cross-rule index only adds build time and lookup overhead. Pattern count per field is capped at 100K (rules referencing fields above that cap are kept unfiltered). Build time scales linearly with total pattern count.

Enable via the `daachorse-index` Cargo feature:

```toml
rsigma-eval = { version = "0.10", features = ["daachorse-index"] }
```

```rust
let mut engine = Engine::new();
engine.set_cross_rule_ac(true);
engine.add_collection(&collection)?;
```

CLI equivalents on `rsigma engine eval` and `rsigma engine daemon` (when the CLI is compiled with the feature):

```
--cross-rule-ac
```

Always benchmark against representative rule sets and event streams before flipping it on; the `eval_cross_rule_ac` Criterion group reports throughput at 1K / 5K / 10K rules with the index on and off.

## Constants and Limits

| Constant | Value | Purpose |
|----------|-------|---------|
| `MAX_NESTING_DEPTH` | 64 | Recursive JSON traversal depth for keyword search |
| `MAX_WINDASH_DASHES` | 8 | Maximum dash characters expanded by windash (5^8 variants) |
| `WINDASH_CHARS` | 5 | `-`, `/`, `–` (en-dash), `—` (em-dash), `―` (horizontal bar) |
| `MAX_CHAIN_DEPTH` | 10 | Maximum correlation chaining depth |
| `max_state_entries` | 100,000 | Default hard cap for correlation state |
| Eviction target | 90% | Hard-cap eviction drops the stalest 10% |
| `max_correlation_events` | 10 | Default per-window event cap |
| Epoch threshold | 10^12 | Numeric timestamps above this are treated as milliseconds |

## Error Types

| Error | When |
|-------|------|
| `InvalidRegex` | Regex compilation failure |
| `InvalidCidr` | CIDR parse failure |
| `Base64` | Base64 encoding error |
| `UnknownDetection` | Condition references missing detection (caught at compile time) |
| `InvalidModifiers` | Invalid modifier combo, empty AllOf/AnyOf, windash overflow, pipeline failure |
| `IncompatibleValue` | Wrong type for modifier (e.g. null for string) |
| `ExpectedNumeric` | Numeric modifier with non-numeric value |
| `Parser` | Parser error (from rsigma-parser) |
| `CorrelationError` | Correlation compile/runtime error |
| `UnknownRuleRef` | Correlation references unknown rule (caught at `add_collection` time) |
| `CorrelationCycle` | Cycle in correlation references |

## Usage

**Detection only:**

```rust
use rsigma_parser::parse_sigma_yaml;
use rsigma_eval::{Engine, parse_pipeline};
use rsigma_eval::event::JsonEvent;
use serde_json::json;

let yaml = r#"
title: Detect Whoami
logsource:
    product: windows
    category: process_creation
detection:
    selection:
        CommandLine|contains: 'whoami'
    condition: selection
level: medium
"#;

let collection = parse_sigma_yaml(yaml).unwrap();

let pipeline = parse_pipeline(r#"
name: ECS Mapping
transformations:
  - type: field_name_mapping
    mapping:
      CommandLine: process.command_line
    rule_conditions:
      - type: logsource
        product: windows
"#).unwrap();

let mut engine = Engine::new_with_pipeline(pipeline);
engine.add_collection(&collection).unwrap();

// Rule now expects ECS field names
let event = JsonEvent::borrow(&json!({"process.command_line": "whoami"}));
let matches = engine.evaluate(&event);
```

`field_name_mapping` also accepts a list of alternatives, matching pySigma's
`FieldMappingTransformation`. The matched detection item is expanded into an
OR over the alternatives — when the surrounding `AllOf` selection has other
items, they're preserved across each branch via a Cartesian expansion so the
`AND` / `OR` semantics stay correct:

```yaml
name: Hashes mapping
transformations:
  - type: field_name_mapping
    mapping:
      Hashes:
        - file.hash.md5
        - file.hash.sha1
        - file.hash.sha256
```

After applying this pipeline, a rule selecting `Hashes: 'abc123'` matches an
event populating *any* of `file.hash.md5`, `file.hash.sha1`, or
`file.hash.sha256`. Correlation rules (`group_by`, `aliases`, threshold
`field`) consume only the first listed alternative since those positions are
inherently scalar.

**With correlations:**

```rust
use rsigma_eval::{CorrelationEngine, CorrelationConfig, CorrelationAction, CorrelationEventMode};

let config = CorrelationConfig {
    suppress: Some(300),                         // 5-minute suppression window
    action_on_match: CorrelationAction::Reset,   // clear state after firing
    emit_detections: false,                      // only emit correlation alerts
    correlation_event_mode: CorrelationEventMode::Full, // include full events
    max_correlation_events: 20,                        // keep last 20 events per window
    ..Default::default()
};

let mut engine = CorrelationEngine::new(config);
engine.set_include_event(true);                  // embed event JSON in all match results
engine.add_collection(&collection).unwrap();
let result = engine.process_event_at(&event, timestamp_secs);
// result.detections: Vec<MatchResult>
// result.correlations: Vec<CorrelationResult>
// result.correlations[0].events: Option<Vec<serde_json::Value>>     (Full mode)
// result.correlations[0].event_refs: Option<Vec<EventRef>>          (Refs mode)
```

## Benchmarks

Criterion.rs benchmarks with synthetic rules and events (Apple M-series, single-threaded unless noted).
Rules are pre-filtered at evaluation time using an inverted index on exact-match field-value pairs.

### Detection Evaluation

| Scenario | Baseline | Indexed | Speedup |
|----------|----------|---------|---------|
| Compile 1,000 rules | 740 µs | 946 µs | 0.8x (index build) |
| Compile 5,000 rules | 3.8 ms | 4.7 ms | 0.8x (index build) |
| 1 event vs 100 rules | 5.6 µs | 2.3 µs | **2.4x** |
| 1 event vs 1,000 rules | 81 µs | 30 µs | **2.7x** |
| 1 event vs 5,000 rules | 415 µs | 164 µs | **2.5x** |
| 100K events vs 100 rules | 613 ms (163K/s) | 253 ms (396K/s) | **2.4x** |
| Wildcard-heavy (1,000 rules, 100 events) | 16 µs | 18 µs | ~1x (unindexable) |
| Regex-heavy (1,000 rules, 100 events) | 3.3 µs | 5.1 µs | ~1x (unindexable) |

### Batch Evaluation (sequential vs `parallel` feature)

| Rules | Events | Sequential | Batch |
|-------|--------|------------|-------|
| 100 | 1,000 | 2.5 ms | 2.5 ms |
| 1,000 | 1,000 | 30.5 ms | 31.0 ms |
| 5,000 | 1,000 | 162 ms | 164 ms |

### Correlation Engine

| Scenario | Baseline | Indexed | Speedup |
|----------|----------|---------|---------|
| 1K events, 20 event_count correlations | 1.08 ms (926K/s) | 988 µs (1.01M/s) | 1.1x |
| 1K events, 10 temporal correlations | 489 µs (2.04M/s) | 502 µs (1.99M/s) | ~1x |
| 100K events, 50 detection + 10 correlation rules | 293 ms (342K/s) | 176 ms (568K/s) | **1.7x** |
| Batch 10K events (sequential) | -- | 18.1 ms (552K/s) | -- |
| Batch 10K events (process_batch) | -- | 19.2 ms (521K/s) | -- |
| 50K unique group keys (state pressure) | 32.9 ms (1.52M/s) | 38.4 ms (1.30M/s) | ~1x |

```bash
cargo bench --bench eval
cargo bench --bench correlation
```

## License

MIT License.

[rsigma]: https://github.com/timescale/rsigma