rsigma-parser 0.5.0

Parser for Sigma detection rules, correlations, and filters
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
# rsigma-parser

[![CI](https://github.com/timescale/rsigma/actions/workflows/ci.yml/badge.svg)](https://github.com/timescale/rsigma/actions/workflows/ci.yml)

`rsigma-parser` is a parser for [Sigma](https://github.com/SigmaHQ/sigma) detection rules, correlations, and filters. It parses Sigma YAML into a strongly-typed AST covering the full Sigma 2.0 specification, and includes a 65-rule linter derived from the Sigma v2.1.0 spec.

This library is part of [rsigma].

## Public API

### Parsing

| Function | Description |
|----------|-------------|
| `parse_sigma_yaml(yaml: &str)` | Parse a multi-document YAML string into a `SigmaCollection` |
| `parse_sigma_file(path: &Path)` | Parse a single YAML file |
| `parse_sigma_directory(dir: &Path)` | Recursively parse all `.yml`/`.yaml` files in a directory |
| `parse_condition(input: &str)` | Parse a condition expression string into a `ConditionExpr` |
| `parse_field_spec(key: &str)` | Parse a field specification like `"CommandLine\|contains\|all"` into a `FieldSpec` |

### Linting

| Function | Description |
|----------|-------------|
| `lint_yaml_value(value: &Value)` | Lint a single YAML document value (auto-detects document type) |
| `lint_yaml_str(text: &str)` | Lint raw YAML string with source span resolution |
| `lint_yaml_str_with_config(text: &str, config: &LintConfig)` | Lint with config-based suppression |
| `lint_yaml_file(path: &Path)` | Lint all documents in a file |
| `lint_yaml_file_with_config(path: &Path, config: &LintConfig)` | Lint a file with config |
| `lint_yaml_directory(dir: &Path)` | Recursively lint all `.yml`/`.yaml` in a directory |
| `lint_yaml_directory_with_config(dir: &Path, config: &LintConfig)` | Lint a directory with config |
| `parse_inline_suppressions(text: &str)` | Parse `# rsigma-disable` comments from YAML text |
| `apply_suppressions(warnings, config, inline)` | Filter and override warnings using config and inline suppressions |

### Auto-Fix Types

Each `LintWarning` can carry an optional `Fix` describing how to automatically correct the issue.

| Type | Description |
|------|-------------|
| `Fix` | A suggested fix: title, disposition (`Safe`/`Unsafe`), and ordered patches |
| `FixDisposition` | `Safe` (no semantic change) or `Unsafe` (may change meaning) |
| `FixPatch` | A single edit: `ReplaceValue`, `ReplaceKey`, or `Remove` with a JSON-pointer path |

Fix types are yamlpath/yamlpatch-agnostic (all owned data, no external dependencies). Consumers (CLI, LSP) convert `FixPatch` to concrete patch operations at apply time.

### Value Types

| Type/Function | Description |
|---------------|-------------|
| `SigmaString::new(s: &str)` | Parse a string with wildcard interpretation (`*`, `?`, `\` escape) |
| `SigmaString::from_raw(s: &str)` | Create from raw string (no wildcard parsing; used for `\|re` modifier) |
| `SigmaString::is_plain()` | Returns `true` if no wildcards |
| `SigmaString::contains_wildcards()` | Returns `true` if any wildcard present |
| `SigmaString::as_plain()` | Get plain string; `None` if wildcards present |
| `SigmaValue::from_yaml(v: &Value)` | Create from a YAML value |
| `Timespan::parse(s: &str)` | Parse timespan like `1h`, `15s`, `7d` |

## Parsing

- **Multi-document YAML**: `---` separators, `action: global/reset/repeat` for rule templates
- **Condition expressions**: PEG grammar (pest) with Pratt parsing and correct operator precedence (`NOT` > `AND` > `OR`). Supports `and`, `or`, `not`, `1 of`, `all of`, `any of`, `N of`, parenthesized groups, wildcard patterns — `them` excludes `_`-prefixed identifiers per spec
- **Value types**: strings with wildcards (`*`, `?`), escape sequences (`\*`, `\?`, `\\`), integers, floats, booleans, null
- **Timespan parsing**: `15s`, `30m`, `1h`, `7d`, `1w`, `1M`, `1y`
- **Logsource**: `category`, `product`, `service`, `definition`, custom fields

### Multi-Document Behavior

1. `serde_yaml::Deserializer` yields documents separated by `---`.
2. Non-mapping documents are skipped; errors are accumulated in `collection.errors`.
3. YAML parse errors stop iteration (the deserializer may not recover from malformed input).
4. **Collection actions:**
   - `action: global` — store document as a template; remove `action` key; do not produce a rule.
   - `action: reset` — clear the global template.
   - `action: repeat` — merge current document onto the previous document; apply global template if present; parse the merged result. Error if no previous document exists.
5. **Merge order:** For normal documents: `merged = deep_merge(global, value)`. For repeat: `merged = deep_merge(global, deep_merge(previous, repeat_doc))`.
6. **`deep_merge`:** Recursive. Source mappings override destination keys; non-mapping source replaces destination entirely.
7. **Previous tracking:** Updated after each non-action document and after each repeat. Repeat chains from the *last* document, not the original.

### SigmaString Escape Semantics

| Input | Parsed as |
|-------|-----------|
| `\*` | literal `*` (not a wildcard) |
| `\?` | literal `?` (not a wildcard) |
| `\\` | literal `\` |
| `\W` (non-special) | literal `\W` (both characters kept) |

Backslash only consumes itself when followed by `*`, `?`, or `\`. This preserves Windows paths like `\Windows\System32`.

### Condition Expression Grammar (PEG)

The full PEG grammar is defined in [`src/sigma.pest`](src/sigma.pest). It implements the [Sigma condition expression syntax](https://sigmahq.io/docs/basics/conditions.html) using a Pratt parser with `not` > `and` > `or` precedence.

**Parsing quirks:**

- `!ident_char` lookahead ensures `and_filter` parses as a single identifier, not `and` + `filter`.
- `1 of` and `any of` both map to `Quantifier::Any`.
- Nested same-type binary ops are flattened: `a and b and c` becomes `And([a, b, c])`, not `And(a, And(b, c))`.

### Operator Precedence

| Precedence (highest first) | Operator | Associativity |
|---------------------------|----------|---------------|
| 1 | `not` (prefix) ||
| 2 | `and` (infix) | Left |
| 3 | `or` (infix) | Left |

`a or not b and c` parses as `a or ((not b) and c)`.

## AST Types

### Core Types

| Type | Description |
|------|-------------|
| `SigmaCollection` | Collection of rules, correlations, filters, and errors |
| `SigmaRule` | A parsed detection rule with metadata, logsource, and detections |
| `CorrelationRule` | A correlation rule with type, referenced rules, timespan, and conditions |
| `FilterRule` | A filter rule that injects `AND NOT` conditions into referenced rules |
| `Detections` | Named detections, condition expressions, and optional timeframe |
| `Detection` | `AllOf` (AND-linked items), `AnyOf` (OR-linked), or `Keywords` (plain values) |
| `FieldSpec` | Field name + modifier chain; `name` is `None` for keyword detections |
| `ConditionExpr` | `And`, `Or`, `Not`, `Identifier`, or `Selector` with quantifier and pattern |
| `SigmaValue` | `String`, `Integer`, `Float`, `Bool`, or `Null` |
| `SigmaString` | String with wildcard parts (`Plain` text + `WildcardMulti`/`WildcardSingle`) |

### Enums

| Enum | Variants |
|------|----------|
| `Status` | `Stable`, `Test`, `Experimental`, `Deprecated`, `Unsupported` |
| `Level` | `Informational`, `Low`, `Medium`, `High`, `Critical` |
| `RelationType` | `Derived`, `Obsolete`, `Merged`, `Renamed`, `Similar` |
| `Quantifier` | `Any`, `All`, `Count(u64)` |
| `SelectorPattern` | `Them`, `Pattern(String)` |
| `CorrelationType` | `EventCount`, `ValueCount`, `Temporal`, `TemporalOrdered`, `ValueSum`, `ValueAvg`, `ValuePercentile`, `ValueMedian` |

### Detection Parsing

- **YAML mapping**`Detection::AllOf` (AND-linked items)
- **YAML list of mappings**`Detection::AnyOf` (OR-linked)
- **YAML list of plain values**`Detection::Keywords` (keyword search across all fields)
- **Condition as list** (`condition: [s1, s2]`) → multiple `ConditionExpr` parsed independently
- **Empty field name** (`parse_field_spec("")`) → `FieldSpec { name: None, modifiers: [] }` (keyword)

## Field Modifiers (30)

The parser recognizes 30 modifier variants, some with aliases:

| Category | Modifiers | Aliases |
|----------|-----------|---------|
| String matching | `contains`, `startswith`, `endswith` ||
| Value linking | `all` ||
| Encoding | `base64`, `base64offset`, `wide`, `utf16be`, `utf16`, `windash` | `utf16le``wide` |
| Pattern | `re`, `cidr` ||
| Case | `cased` ||
| Existence | `exists` ||
| Placeholder | `expand` ||
| Field reference | `fieldref` ||
| Numeric comparison | `gt`, `gte`, `lt`, `lte`, `neq` ||
| Regex flags | `i`, `m`, `s` | `ignorecase``i`, `multiline``m`, `dotall``s` |
| Timestamp parts | `minute`, `hour`, `day`, `week`, `month`, `year` ||

When the `re` modifier is present, string values are parsed with `SigmaValue::from_raw_string` (no wildcard interpretation).

## Correlation Rules (8 types)

| Type | Description |
|------|-------------|
| `event_count` | Count matching events per group key |
| `value_count` | Count distinct field values per group key |
| `temporal` | Require multiple rules to fire in the same window |
| `temporal_ordered` | Same as temporal, but rules must fire in order |
| `value_sum` | Sum a numeric field across events |
| `value_avg` | Average a numeric field across events |
| `value_percentile` | Compute a percentile of a numeric field |
| `value_median` | Compute the median of a numeric field |

### Correlation Conditions

- **Threshold (mapping):** `{ gte: 100 }` or `{ gt: 10, lte: 100, field: "TargetUser" }`. Operators: `lt`, `lte`, `gt`, `gte`, `eq`, `neq`. Values must be numeric.
- **Extended (string):** `"rule_a and rule_b"` for temporal types — parsed as a boolean expression over rule references.
- **Default (temporal, no condition):** `Threshold { predicates: [(Gte, 1)], field: None }`.
- **Timeframe/timespan:** The parser accepts both `timeframe` and `timespan` keys.
- **Custom attributes:** Correlation rules support a `custom_attributes` mapping (string-to-string) for `rsigma.*` engine extensions.

## Filter Rules

Filter rules inject exclusion conditions into referenced detection rules — enables centralized tuning without modifying original rule files. See the [Sigma Filters Specification](https://sigmahq.io/sigma-specification/specification/sigma-filters-specification.html) for the full standard.

Per the spec, `selection`, `condition`, and `rules` all live inside the `filter` section:

```yaml
title: Exclude Admin Users
logsource:
    category: process_creation
    product: windows
filter:
    rules:
        - <rule-id>
    selection:
        User|startswith: 'adm_'
    condition: selection
```

## Timespan Parsing

| Unit | Suffix | Multiplier (seconds) |
|------|--------|----------------------|
| Second | `s` | 1 |
| Minute | `m` | 60 |
| Hour | `h` | 3,600 |
| Day | `d` | 86,400 |
| Week | `w` | 604,800 |
| Month | `M` (uppercase) | 2,629,746 (~30.44 days) |
| Year | `y` | 31,556,952 (~365.25 days) |

The string must be at least 2 characters (e.g. `1h`). The last character is the unit; the prefix must be a positive integer.

## Linter (65 rules)

65 built-in lint rules derived from the Sigma v2.1.0 specification. Four severity levels: **Error** (spec violation), **Warning** (best-practice issue), **Info** (soft suggestion), **Hint** (stylistic). Info/Hint findings don't cause lint failure.

The linter operates on raw YAML values to catch issues the parser silently ignores.

### Infrastructure (4)

| Rule | Severity | Fix | Trigger |
|------|----------|-----|---------|
| `yaml_parse_error` | Error | | YAML parse failure |
| `not_a_mapping` | Error | | Document is not a YAML mapping |
| `file_read_error` | Error | | Cannot read file |
| `schema_violation` | Error | | JSON schema validation failure (optional) |

### Shared Metadata (16)

| Rule | Severity | Fix | Trigger |
|------|----------|-----|---------|
| `missing_title` | Error | | No `title` field |
| `empty_title` | Error | | `title` is empty or whitespace |
| `title_too_long` | Warning | | `title` exceeds 256 characters |
| `missing_description` | Info | | No `description` |
| `missing_author` | Info | | No `author` |
| `invalid_id` | Warning | | `id` not a valid UUID (8-4-4-4-12 hex) |
| `invalid_status` | Error | Yes | `status` not in `stable`/`test`/`experimental`/`deprecated`/`unsupported` |
| `missing_level` | Warning | | No `level` (detection rules) |
| `invalid_level` | Error | Yes | `level` not in `informational`/`low`/`medium`/`high`/`critical` |
| `invalid_date` | Error | | `date` not `YYYY-MM-DD` with valid day-of-month |
| `invalid_modified` | Error | | `modified` not `YYYY-MM-DD` |
| `modified_before_date` | Warning | | `modified` is earlier than `date` |
| `description_too_long` | Warning | | `description` exceeds 65,535 characters |
| `name_too_long` | Warning | | `name` exceeds 256 characters |
| `taxonomy_too_long` | Warning | | `taxonomy` exceeds 256 characters |
| `non_lowercase_key` | Warning | Yes | Top-level key is not lowercase |

### Detection Rules (17)

| Rule | Severity | Fix | Trigger |
|------|----------|-----|---------|
| `missing_logsource` | Error | | No `logsource` |
| `missing_detection` | Error | | No `detection` |
| `missing_condition` | Error | | No `condition` in detection |
| `empty_detection` | Warning | | No named search identifiers |
| `invalid_related_type` | Error | | `related[].type` not in `derived`/`obsolete`/`merged`/`renamed`/`similar` |
| `invalid_related_id` | Warning | | `related[].id` not a valid UUID |
| `related_missing_required` | Error | | `related[]` missing `id` or `type` |
| `deprecated_without_related` | Warning | | `status: deprecated` but no `related` |
| `invalid_tag` | Warning | | Tag doesn't match `^[a-z0-9_-]+\.[a-z0-9._-]+$` |
| `unknown_tag_namespace` | Warning | | Tag namespace not in `attack`/`car`/`cve`/`d3fend`/`detection`/`stp`/`tlp` |
| `duplicate_tags` | Warning | Yes | Duplicate tag |
| `duplicate_references` | Warning | Yes | Duplicate reference URL |
| `duplicate_fields` | Warning | Yes | Duplicate field name |
| `falsepositive_too_short` | Warning | | `falsepositives` entry under 2 characters |
| `scope_too_short` | Warning | | `scope` entry under 2 characters |
| `logsource_value_not_lowercase` | Warning | Yes | Logsource `category`/`product`/`service` not lowercase |
| `condition_references_unknown` | Error | | Condition references non-existent detection identifier |

### Correlation Rules (13)

| Rule | Severity | Fix | Trigger |
|------|----------|-----|---------|
| `missing_correlation` | Error | | No `correlation` or not a mapping |
| `missing_correlation_type` | Error | | No `correlation.type` |
| `invalid_correlation_type` | Error | | Type not a recognized correlation type |
| `missing_correlation_rules` | Error | | No `correlation.rules` |
| `empty_correlation_rules` | Warning | | `correlation.rules` is empty |
| `missing_correlation_timespan` | Error | | No `correlation.timespan` or `correlation.timeframe` |
| `invalid_timespan_format` | Error | | Timespan format invalid |
| `missing_group_by` | Error | | No `correlation.group-by` |
| `missing_correlation_condition` | Error | | Non-temporal type without condition |
| `missing_condition_field` | Error | | `value_count`/`value_sum`/`value_avg`/`value_percentile` without `condition.field` |
| `invalid_condition_operator` | Error | | Operator not in `gt`/`gte`/`lt`/`lte`/`eq`/`neq` |
| `condition_value_not_numeric` | Error | | Condition value not numeric |
| `generate_not_boolean` | Error | | `generate` is not a boolean |

### Filter Rules (8)

| Rule | Severity | Fix | Trigger |
|------|----------|-----|---------|
| `missing_filter` | Error | | No `filter` or not a mapping |
| `missing_filter_rules` | Error | | No `filter.rules` |
| `empty_filter_rules` | Warning | | `filter.rules` is empty |
| `missing_filter_selection` | Error | | No `filter.selection` |
| `missing_filter_condition` | Error | | No `filter.condition` |
| `filter_has_level` | Warning | Yes | Filter has `level` (not applicable) |
| `filter_has_status` | Warning | Yes | Filter has `status` (not applicable) |
| `missing_filter_logsource` | Warning | | No `logsource` |

### Detection Logic (7)

| Rule | Severity | Fix | Trigger |
|------|----------|-----|---------|
| `null_in_value_list` | Warning | | `null` mixed with other values in a list |
| `single_value_all_modifier` | Warning | Yes | `\|all` with a single value |
| `all_with_re` | Warning | Yes | `\|all` and `\|re` combined |
| `incompatible_modifiers` | Warning | | Incompatible modifier combination (e.g. `contains\|startswith`, `re\|contains`, `gt\|contains`, regex flags without `re`) |
| `empty_value_list` | Warning | | Empty value list |
| `wildcard_only_value` | Warning | Yes | Lone `*` value (suggests `\|exists: true` instead) |
| `unknown_key` | Info | Yes | Top-level key likely a typo of a known key (edit distance ≤ 2); custom fields are allowed per the Sigma spec |

### Rule Suppression

Three-tier system to disable or override lint rules:

- **CLI**: `--disable rule1,rule2` suppresses specific rules globally
- **Config file**: `.rsigma-lint.yml` (or `.rsigma-lint.yaml`) with `disabled_rules` and `severity_overrides`, auto-discovered by walking ancestor directories from the target path upward
- **Inline comments**: `# rsigma-disable`, `# rsigma-disable rule1, rule2`, `# rsigma-disable-next-line`, `# rsigma-disable-next-line rule1, rule2`

```yaml
# .rsigma-lint.yml
disabled_rules:
  - missing_description
  - missing_author
severity_overrides:
  title_too_long: info
```

**Suppression order:** Warnings are first filtered by `disabled_rules`, then by inline suppressions, then `severity_overrides` are applied to remaining warnings.

Inline `#` inside quoted YAML strings is not treated as a comment.

### Optional Schema Validation

The `schema_violation` lint rule optionally validates rules against a JSON schema. The schema can be the official Sigma schema (downloaded and cached for 7 days, with offline fallback to stale cache) or a local file.

## Directory Parsing and Linting

- **`parse_sigma_directory`:** Recursively walks directories; only processes `.yml`/`.yaml` files. File-level parse errors are accumulated in `collection.errors` (not fatal). Sub-collections (rules, correlations, filters) are merged.
- **`lint_yaml_directory`:** Skips hidden directories (`.` prefix). Canonicalizes paths to detect symlink cycles. Sorts entries by path for deterministic output.

## Error Types

| Error | When |
|-------|------|
| `Yaml` | serde_yaml parse failure |
| `Condition` | Condition expression parse failure (PEG/Pratt); carries optional `SourceLocation` with line/column |
| `UnknownModifier` | Unknown modifier in field spec |
| `InvalidFieldSpec` | Invalid field specification |
| `InvalidRule` | Document not a mapping, or invalid structure |
| `MissingField` | Required field missing (e.g. `title`, `detection`) |
| `InvalidDetection` | Detection section invalid |
| `InvalidCorrelation` | Correlation rule invalid |
| `InvalidTimespan` | Timespan string invalid (wrong format, unknown unit) |
| `InvalidValue` | Invalid value in detection |
| `InvalidAction` | Unknown collection action |
| `Io` | File read error |

## Compatibility

Tested against the [SigmaHQ/sigma](https://github.com/SigmaHQ/sigma) rule repository:

| Corpus | Rules Parsed | Errors |
|--------|-------------|--------|
| `rules/` | 3,110 | 0 |
| `rules-emerging-threats/` | 436 | 0 |
| `rules-threat-hunting/` | 133 | 0 |
| `rules-compliance/` | 3 | 0 |
| `rules-placeholder/` | 14 | 0 |
| `unsupported/` | 31 | 58 (deprecated pipe syntax) |
| `deprecated/` | 165 | 1 (deprecated pipe syntax) |
| **Total** | **3,892** | **0 real errors** |

The deprecated pipe aggregation syntax (`selection | count(field) by field > N`) is intentionally rejected, matching [pySigma](https://github.com/SigmaHQ/pySigma) behavior.

## Usage

```rust
use rsigma_parser::{parse_sigma_yaml, parse_condition};

let collection = parse_sigma_yaml(yaml).unwrap();
assert_eq!(collection.rules[0].title, "Detect Whoami");

// Parse a condition expression directly
let expr = parse_condition("selection1 and not filter").unwrap();
```

## Benchmarks

Criterion.rs benchmarks with synthetic rules (Apple M-series, single-threaded):

| Scenario | Time |
|----------|------|
| 1 rule | 11.7 us |
| 100 rules | 1.1 ms |
| 1,000 rules | 11.1 ms |
| Complex condition (8 selections, nested booleans) | 23.2 us |

```bash
cargo bench --bench parse
```

## License

MIT License.

[rsigma]: https://github.com/timescale/rsigma