ferroni 1.2.8

Pure-Rust Oniguruma regex engine with SIMD-accelerated search
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
<p align="center">
  <strong>Ferroni</strong><br>
  Pure-Rust Oniguruma regex engine. Full feature set, no C toolchain, drop-in compatible.<br>
  Includes a multi-pattern scanner for TextMate grammar tokenization.
</p>

<p align="center">
  <a href="https://github.com/sebastian-software/ferroni/actions"><img src="https://img.shields.io/github/actions/workflow/status/sebastian-software/ferroni/ci.yml?branch=main&style=flat-square&logo=github&label=CI" alt="CI"></a>
  <a href="https://codspeed.io/sebastian-software/ferroni?utm_source=badge"><img src="https://img.shields.io/badge/CodSpeed-measured-blue?style=flat-square&logo=data:image/svg%2bxml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCI+PHBhdGggZmlsbD0id2hpdGUiIGQ9Ik0yMy4zNSAxMi44NGEuODMuODMgMCAwIDAtLjE1LS42OWwtMS40LTEuNzdhLjgyLjgyIDAgMCAwLS42Ny0uMzJoLTEuNjFsLTEuNjQtMS44YS44My44MyAwIDAgMC0uNjItLjI3SDEwLjlhLjguOCAwIDAgMC0uNTguMjVsLTIuMiAyLjI1SDUuMzNhLjgzLjgzIDAgMCAwLS42LjI2TDIuMTYgMTMuNmEuODQuODQgMCAwIDAgLjYgMS40aDIuMjNsLTIuNjMgMi44YS44My44MyAwIDAgMCAuNjEgMS4zOWg0LjA1YS44My44MyAwIDAgMCAuNjEtLjI3bDMuMzMtMy42MWgyLjk2bC0zLjc5IDQuMDRhLjgyLjgyIDAgMCAwIC42MSAxLjM5aDQuMjRjLjIgMCAuNC0uMDguNTUtLjIybDMuNy0zLjZoMS4yN2wuOS43OGMuMi4yMy41Mi4zLjguMTdsMS44Mi0xLjE0YS44My44MyAwIDAgMCAuMzMtLjYxdi0xLjI3YS44My44MyAwIDAgMC0uMi0uNTN6Ii8+PC9zdmc+" alt="CodSpeed"></a>
  <a href="https://codecov.io/gh/sebastian-software/ferroni"><img src="https://img.shields.io/codecov/c/github/sebastian-software/ferroni?style=flat-square&logo=codecov&label=Coverage" alt="Coverage"></a>
  <a href="https://github.com/sebastian-software/ferroni/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-BSD--2--Clause-blue?style=flat-square" alt="License"></a>
  <a href="https://github.com/sebastian-software/ferroni"><img src="https://img.shields.io/badge/unsafe-0.4%25-green?style=flat-square" alt="Unsafe"></a>
  <a href="https://github.com/sebastian-software/ferroni"><img src="https://img.shields.io/badge/tests-1%2C882_passing-brightgreen?style=flat-square" alt="Tests"></a>
  <a href="https://github.com/sebastian-software/ferroni"><img src="https://img.shields.io/badge/C_parity-100%25-brightgreen?style=flat-square" alt="C Parity"></a>
</p>

---

[Oniguruma](https://github.com/kkos/oniguruma) is the regex engine behind
[Ruby](https://www.ruby-lang.org/), [PHP](https://www.php.net/) (mbstring),
[TextMate](https://macromates.com/) grammars, and tools like
[jq](https://jqlang.github.io/jq/). It supports features that most regex
libraries don't: named captures with multiple syntaxes, look-behind of
variable length, conditional patterns, absent expressions, 886 Unicode
properties, subexpression calls, and 12 syntax modes from Perl to POSIX.

Ferroni is a line-by-line Rust port of this engine — same structure, same
opcodes, same optimization passes — with SIMD-vectorized search via
[`memchr`](https://crates.io/crates/memchr) layered on top. The result:
**up to 61x faster than C** on scanner first-match, while an idiomatic Rust
API (`Regex::new()`, typed errors, `Match`/`Captures`) keeps the ergonomics
clean.

For syntax highlighting, Ferroni also includes a multi-pattern
[Scanner API](#scanner-api) compatible with
[vscode-oniguruma](https://github.com/nicolo-ribaudo/vscode-oniguruma),
used by [Shiki](https://shiki.style/), VS Code, and other TextMate-based
highlighters.

## Why Ferroni?

**Full Oniguruma, pure Rust.** Named captures, variable-length look-behind,
conditionals, absent expressions, Unicode properties, subexpression calls —
everything the C engine supports, without linking against C. If your pattern
works in Oniguruma, it works in Ferroni. Every opcode and optimization pass
is ported 1:1 and verified by [1,882 tests](#test-coverage) from three
independent sources.

**No more CVEs from C.** C Oniguruma has a track record of memory safety
vulnerabilities --
[CVE-2019-13224](https://nvd.nist.gov/vuln/detail/CVE-2019-13224) (CVSS 9.8),
[CVE-2019-19204](https://nvd.nist.gov/vuln/detail/CVE-2019-19204),
[CVE-2019-19246](https://nvd.nist.gov/vuln/detail/CVE-2019-19246),
[CVE-2019-19012](https://nvd.nist.gov/vuln/detail/CVE-2019-19012),
[CVE-2019-13225](https://nvd.nist.gov/vuln/detail/CVE-2019-13225) --
affecting Ruby, PHP, and anything linking against it. Ferroni eliminates
buffer overflows, use-after-free, and NULL dereferences structurally through
Rust's type system. 0.4% unsafe code, all documented in
[ADR-002](docs/adr/002-unsafe-code-policy.md).

**No C toolchain required.** Pure `cargo build`. Cross-compiles to
`wasm32-unknown-unknown`. Ship it as a Node.js native module via
[napi-rs](https://napi.rs/) without `node-gyp` or a C compiler on the
user's machine.

**Built-in multi-pattern scanner.** For syntax highlighting with TextMate
grammars, Ferroni includes a
[vscode-oniguruma-compatible Scanner API](#scanner-api) — regex engine and
scanner in a single dependency. `cargo add ferroni` and you're done.

## Quick start

Add to your `Cargo.toml`:

```toml
[dependencies]
ferroni = "1"
```

### Regex

```rust
use ferroni::prelude::*;

fn main() -> Result<(), RegexError> {
    let re = Regex::new(r"(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})")?;

    let caps = re.captures("Date: 2026-02-12").unwrap();
    assert_eq!(caps.get(0).unwrap().as_str(), "2026-02-12");
    assert_eq!(caps.name("year").unwrap().as_str(), "2026");
    assert_eq!(caps.name("month").unwrap().as_str(), "02");
    Ok(())
}
```

### Scanner API

The Scanner matches multiple patterns simultaneously -- the core operation
behind TextMate-based syntax highlighting. Results include UTF-16 position
mapping for direct use with vscode-textmate and Shiki.

```rust
use ferroni::scanner::{Scanner, ScannerFindOptions};

let mut scanner = Scanner::new(&[
    r"\b(function|const|let|var)\b",  // keywords
    r#""[^"]*""#,                      // strings
    r"//.*$",                          // comments
]).unwrap();

let code = r#"const x = "hello" // greeting"#;
let m = scanner.find_next_match(code, 0, ScannerFindOptions::NONE).unwrap();

assert_eq!(m.index, 0); // pattern 0 matched first ("const")
assert_eq!(m.capture_indices[0].start, 0);
assert_eq!(m.capture_indices[0].end, 5);
```

For fine-grained control, use `RegexBuilder`:

```rust
use ferroni::prelude::*;

let re = Regex::builder(r"hello")
    .case_insensitive(true)
    .build()
    .unwrap();
assert!(re.is_match("Hello World"));
```

<details>
<summary><strong>Low-level C-style API</strong></summary>

The full C-ported API is also available for advanced usage:

```rust
use ferroni::regcomp::onig_new;
use ferroni::regexec::onig_search;
use ferroni::oniguruma::*;
use ferroni::regsyntax::OnigSyntaxOniguruma;

let reg = onig_new(
    b"\\d{4}-\\d{2}-\\d{2}",
    ONIG_OPTION_NONE,
    &ferroni::encodings::utf8::ONIG_ENCODING_UTF8,
    &OnigSyntaxOniguruma,
).unwrap();

let input = b"Date: 2026-02-12";
let (result, region) = onig_search(
    &reg, input, input.len(), 0, input.len(),
    Some(OnigRegion::new()), ONIG_OPTION_NONE,
);

assert!(result >= 0);
assert_eq!(result, 6); // match starts at byte 6
```

</details>

## Supported features

**Scanner** -- multi-pattern matching with result caching, two search
strategies (RegSet for short strings, per-regex for long strings), and
automatic UTF-16 position mapping. API-compatible with
[vscode-oniguruma](https://github.com/nicolo-ribaudo/vscode-oniguruma).

**Full Oniguruma regex** -- every feature from the C engine:

- All Perl/Ruby/Python syntax -- `(?:...)`, `(?=...)`, `(?!...)`, `(?<=...)`, `(?<!...)`, `(?>...)`
- Named captures -- `(?<name>...)`, `(?'name'...)`, `(?P<name>...)`
- Backreferences -- `\k<name>`, `\g<name>`, relative `\g<-1>`
- Conditionals -- `(?(cond)T|F)`
- Absent expressions -- `(?~...)`
- Unicode properties -- `\p{Script_Extensions=Greek}`, `\p{Lu}`, `\p{Emoji}` (886 names)
- Grapheme clusters -- `\X`, text segment boundaries `\y`, `\Y`
- Callouts -- `(?{...})`, `(*FAIL)`, `(*MAX{n})`, `(*COUNT)`, `(*CMP)`
- 12 syntax modes -- Oniguruma, Ruby, Perl, Perl_NG, Python, Java, Emacs, Grep, GNU, POSIX Basic/Extended, ASIS
- Safety limits -- retry, time, stack, subexp call depth (global + per-search)

## Performance

All numbers compare Ferroni against C Oniguruma at `-O3`,
measured with [Criterion](https://github.com/bheisler/criterion.rs) on
Apple M1 Ultra. **Bold** = faster engine. See
[full tables](docs/perf/benchmark-results.md) for all benchmarks.

### Syntax highlighting

Syntax highlighters like [Shiki](https://shiki.style/) compile a full
TextMate grammar -- hundreds of regex patterns -- and scan each line
token by token. We benchmark against complete, unmodified Shiki grammars
for TypeScript (279 patterns), CSS (117 patterns), and Rust (81 patterns).
No cherry-picked subsets.

| Scenario | Ferroni | C Oniguruma | Speedup |
|----------|--------:|------------:|--------:|
| **TypeScript** (279 patterns) | | | |
| Compile | **10.1 ms** | 16.8 ms | **1.7x** |
| First match | **414 ns** | 25.3 us | **61x** |
| Tokenize full line | **7.0 us** | 221 us | **32x** |
| **Rust** (81 patterns) | | | |
| Compile | 257 us | **181 us** | 0.7x |
| First match | **181 ns** | 5.6 us | **31x** |
| Tokenize full line | **8.2 us** | 82.2 us | **10x** |
| **CSS** (117 patterns) | | | |
| Compile | **13.7 ms** | 19.0 ms | **1.4x** |
| Tokenize full line | **1.60 ms** | 14.9 ms | **9.3x** |

### Text search and log scanning

First-match latency and full-scan rejection on log-sized inputs. The
[`regex`](https://crates.io/crates/regex) crate is included where the
pattern is compatible with its syntax.

| Scenario | Ferroni | C Oniguruma | `regex` |
|----------|--------:|------------:|--------:|
| Literal in 50 KB | 74 ns | 150 ns | **10 ns** |
| No match, 50 KB | 1.53 us | 9.5 us | **1.46 us** |
| No match, 10 KB | 357 ns | 1.96 us | **298 ns** |
| Field extract, 50 KB | 127 ns | 172 ns | **56 ns** |
| Timestamp, 50 KB | **120 ns** | 177 ns | **54 ns** |
| RegSet multi-pattern (5) | **101 ns** | 395 ns | — |

The `regex` crate's DFA engine gives it a clear advantage on text search
workloads. [`memchr`](https://crates.io/crates/memchr) (shared by both
Ferroni and `regex`) enables SIMD-accelerated literal scans, but `regex`
goes further with full DFA-based matching that avoids per-character
backtracking. RegSet multi-pattern has no direct `regex` equivalent.

### Pattern matching

One representative pattern per regex feature. **Bold** = fastest engine.
`regex` is omitted for features it does not support (lookaround,
backreferences).

| Category | Ferroni | C Oniguruma | `regex` |
|----------|--------:|------------:|--------:|
| Literal exact | 104 ns | 159 ns | **11 ns** |
| Quantifier greedy | 183 ns | 261 ns | **65 ns** |
| Lookaround combined | **83 ns** | 292 ns | — |
| Unicode `\p{Greek}+` | 96 ns | 251 ns | **60 ns** |
| Backref `(\w+) \1` | **79 ns** | 199 ns | — |
| Case-insensitive phrase | 101 ns | 188 ns | **62 ns** |
| Alternation, 2 branches | 62 ns | 157 ns | **48 ns** |
| Alternation, 10 branches | 49 ns | 225 ns | **21 ns** |
| Named capture date | 361 ns | 277 ns | **44 ns** |

### Compilation

Simple patterns compile within 5% of C. The `regex` crate compiles
significantly slower due to DFA construction -- the cost of its faster
matching. Lookbehind is not supported by `regex`.

| Pattern | Ferroni | C Oniguruma | `regex` |
|---------|--------:|------------:|--------:|
| Literal | **439 ns** | 448 ns | 2.33 us |
| Named capture | **4.67 us** | 5.78 us | 193 us |
| Lookbehind | 992 ns | **556 ns** | — |

### Where Ferroni is slower

- **vs `regex` crate** -- for patterns that `regex` supports, its DFA engine
  is 2-10x faster at matching (but 5-40x slower to compile)
- **Named capture extraction** -- 1.3x vs C (region bookkeeping overhead)
- **Scanner warm cache** -- 2.2x vs C (C's pointer comparison vs hash lookup)

### Ferroni vs the `regex` crate

The `regex` crate is faster at matching for all patterns it supports, thanks
to its DFA-based engine with guaranteed linear time. However, it compiles
5-40x slower and does not support: variable-length lookbehind,
backreferences, conditional patterns, absent expressions, subexpression
calls, named captures with multiple syntaxes (`(?<n>)`, `(?'n')`,
`(?P<n>)`), TextMate grammar support, or drop-in replacement for Ruby/PHP
regex behavior. Use [`regex`](https://crates.io/crates/regex) when your
patterns fit its syntax and compilation cost is amortized. Use Ferroni when
you need full Oniguruma compatibility.

<details>
<summary><strong>Running benchmarks</strong></summary>

```bash
cargo bench --features ffi                          # full suite (~8 min)
cargo bench --features ffi -- scanner_highlighting  # tier 1: highlighting
cargo bench --features ffi -- text_scanning         # tier 1: log scanning
cargo bench --features ffi -- single_pattern        # tier 1: per-feature
cargo bench --features ffi -- compilation           # tier 1: compile time
cargo bench --features ffi -- regression_           # tier 2: all regression
# HTML report: target/criterion/report/index.html
```

</details>

## Architecture

Each C source file maps 1:1 to a Rust module ([ADR-001](docs/adr/001-one-to-one-parity-with-c-original.md)):

| C File | Rust Module | Purpose |
|--------|-------------|---------|
| regparse.c | `regparse.rs` | Pattern parser |
| regcomp.c | `regcomp.rs` | AST-to-bytecode compiler |
| regexec.c | `regexec.rs` | VM executor |
| regint.h | `regint.rs` | Internal types and opcodes |
| oniguruma.h | `oniguruma.rs` | Public types and constants |
| regenc.c | `regenc.rs` | Encoding trait |
| regsyntax.c | `regsyntax.rs` | 12 syntax definitions |
| regset.c | `regset.rs` | Multi-regex search (RegSet) |
| regerror.c | `regerror.rs` | Error messages |
| regtrav.c | `regtrav.rs` | Capture tree traversal |
| unicode.c | `unicode/mod.rs` | Unicode tables and segmentation |
| -- | `scanner.rs` | Multi-pattern scanner for syntax highlighting |

**Compilation pipeline** (same as C):

```
onig_new() -> onig_compile()
  -> onig_parse_tree()     (pattern -> AST)
  -> reduce_string_list()  (merge adjacent strings)
  -> tune_tree()           (6 optimization sub-passes)
  -> compile_tree()        (AST -> VM bytecode)
  -> set_optimize_info()   (extract search strategy)
```

## Scope

Ferroni targets ASCII/UTF-8 workloads. The following are intentionally not included:

- **27 of 29 encodings** -- only ASCII and UTF-8 ([ADR-003](docs/adr/003-encoding-scope-ascii-and-utf8-only.md))
- **POSIX/GNU API** -- `regcomp`/`regexec`/`regfree` ([ADR-012](docs/adr/012-posix-and-gnu-api-not-ported.md))
- **C memory management** -- replaced by Rust's `Drop` trait
- **`onig_new_deluxe`** -- C-specific allocation, use `onig_new()` instead

## Running tests

```bash
# Full UTF-8 suite (requires increased stack for debug builds)
RUST_MIN_STACK=268435456 cargo test --test compat_utf8 -- --test-threads=1

# Other suites
cargo test --test compat_syntax
cargo test --test compat_options
cargo test --test compat_regset
RUST_MIN_STACK=268435456 cargo test --test compat_back -- --test-threads=1
```

> **Warning:** Never run `cargo test -- --ignored` -- the
> `conditional_recursion_complex` test intentionally hangs.

## Test coverage

1,882 tests from three independent sources:

- **1,554** ported 1:1 from C Oniguruma's test suite
- **25** from [vscode-oniguruma](https://github.com/nicolo-ribaudo/vscode-oniguruma)'s
  TypeScript tests (Scanner API, UTF-16 mapping)
- **303** Rust-specific tests for edge cases, error paths, and gaps in the
  upstream suites

C Oniguruma has no coverage reporting. Ferroni's test suite is a strict
superset.

| Metric | Value | Notes |
|--------|------:|-------|
| Function coverage | >94% | All reachable API and internal functions |
| Line coverage | ~82% | 42 deeply recursive tests overflow under LLVM instrumentation |
| Tests executed | 1,840 of 1,882 | All 1,882 pass in normal `cargo test` |

Coverage measured with
[cargo-llvm-cov](https://github.com/taiki-e/cargo-llvm-cov), reported to
[Codecov](https://codecov.io/gh/sebastian-software/ferroni).

## Architecture decision records

| ADR | Decision |
|-----|----------|
| [001](docs/adr/001-one-to-one-parity-with-c-original.md) | 1:1 structural parity with C original |
| [002](docs/adr/002-unsafe-code-policy.md) | Unsafe code policy |
| [003](docs/adr/003-encoding-scope-ascii-and-utf8-only.md) | Encoding scope: ASCII and UTF-8 only |
| [004](docs/adr/004-c-to-rust-translation-patterns.md) | C-to-Rust translation patterns |
| [005](docs/adr/005-idiomatic-rust-api-layer.md) | Idiomatic Rust API layer |
| [006](docs/adr/006-scanner-api.md) | Scanner API for TextMate tokenization |
| [007](docs/adr/007-simd-accelerated-search.md) | SIMD-accelerated search via memchr |
| [008](docs/adr/008-rust-only-optimizations.md) | Rust-only optimizations and performance philosophy |
| [009](docs/adr/009-dependency-philosophy.md) | Dependency philosophy |
| [010](docs/adr/010-benchmark-strategy.md) | Benchmark strategy |
| [011](docs/adr/011-test-strategy-and-c-test-parity.md) | Test strategy and C test suite parity |
| [012](docs/adr/012-posix-and-gnu-api-not-ported.md) | POSIX and GNU API not ported |
| [013](docs/adr/013-stack-overflow-debug-builds.md) | Stack overflow mitigation in debug builds |
| [014](docs/adr/014-porting-bugs-lessons-learned.md) | Porting bugs: lessons learned |

## Contributing

Contributions are welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md)
and review the ADRs before submitting a PR.

## Acknowledgments

Ferroni is built on the work of [K. Kosako](https://github.com/kkos) and
the Oniguruma contributors. The C original powers regex in
[Ruby](https://www.ruby-lang.org/), [PHP](https://www.php.net/),
[TextMate](https://macromates.com/), [jq](https://jqlang.github.io/jq/),
and many other projects. The Scanner API and its test suite are based on
[vscode-oniguruma](https://github.com/nicolo-ribaudo/vscode-oniguruma)
by [Nicol&ograve; Ribaudo](https://github.com/nicolo-ribaudo) and the
VS Code team.

## License

[BSD-2-Clause](LICENSE) (same as Oniguruma)

---

Copyright 2026 [Sebastian Software GmbH](https://www.sebastian-software.de/)