Ferroni is a line-by-line port of Oniguruma's
C source into Rust -- same structure, same function names, same semantics. On
top of that foundation, the search pipeline uses SIMD-vectorized scanning
(NEON on ARM, SSE2/AVX2 on x86-64) via the memchr
crate, making it up to 6x faster than C Oniguruma on full-text search
workloads. No bindings, no FFI -- pure Rust.
Why Ferroni?
Security first. C Oniguruma has a history of memory safety CVEs
(CVE-2019-13224 CVSS 9.8,
CVE-2019-19204,
CVE-2019-19246,
CVE-2019-19012,
CVE-2019-13225)
affecting Ruby, PHP, and any application linking against it. Ferroni
eliminates buffer overflows, use-after-free, and NULL dereferences
structurally through Rust's type system. See ADR-005
for our unsafe policy.
Drop-in behavior. Every regex feature, every opcode, every optimization pass is ported 1:1 from C. If your pattern works in Oniguruma, it works in Ferroni -- verified by 1,695 tests ported directly from the C test suite.
No C toolchain required. Pure cargo build. Cross-compiles to
wasm32-unknown-unknown out of the box.
Easy Node.js bindings. Rust's napi-rs ecosystem
makes it straightforward to publish a native Node.js module -- no
node-gyp, no C compiler on the user's machine.
Quick Start
Add to your Cargo.toml:
[]
= { = "https://github.com/sebastian-software/ferroni.git" }
use onig_new;
use onig_search;
use *;
use OnigSyntaxOniguruma;
Supported Features
- All Perl/Ruby/Python syntax --
(?:...),(?=...),(?!...),(?<=...),(?<!...),(?>...) - Named captures --
(?<name>...),(?'name'...),(?P<name>...) - Backreferences --
\k<name>,\g<name>, relative\g<-1> - Conditionals --
(?(cond)T|F) - Absent expressions --
(?~...) - Unicode properties --
\p{Script_Extensions=Greek},\p{Lu},\p{Emoji}(886 names) - Grapheme clusters --
\X, text segment boundaries\y,\Y - Callouts --
(?{...}),(*FAIL),(*MAX{n}),(*COUNT),(*CMP) - 12 syntax modes -- Oniguruma, Ruby, Perl, Perl_NG, Python, Java, Emacs, Grep, GNU, POSIX Basic/Extended, ASIS
- Safety limits -- retry, time, stack, subexp call depth (global + per-search)
Performance
Criterion benchmarks vs. C Oniguruma at -O3 on Apple M1 Ultra.
Bold = faster engine. Full tables in the expandable section below.
Highlights
| Scenario | Rust | C | Speedup |
|---|---|---|---|
| No match, 50 KB haystack | 1.5 us | 9.3 us | 6.1x |
| No match, 10 KB haystack | 381 ns | 1.9 us | 4.9x |
| RegSet, 5 patterns (position) | 148 ns | 389 ns | 2.6x |
| 5-way alternation | 118 ns | 186 ns | 1.6x |
| Greedy quantifier | 222 ns | 257 ns | 1.2x |
| Possessive quantifier | 189 ns | 235 ns | 1.2x |
Ferroni wins 29 of 39 execution benchmarks. The SIMD-accelerated forward
search is the standout: memchr replaces
hand-written byte loops with vectorized scans, delivering 5-6x gains on
full-text scanning. See ADR-006.
Compilation is 1.2-1.7x slower (Rust allocates more than C's pre-allocated buffers), but compilation is a one-time cost -- real-world consumers compile patterns once and match millions of times.
Regex Execution
| Benchmark | Rust | C | Ratio |
|---|---|---|---|
| Literal match | |||
| exact string | 134 ns | 149 ns | 0.90 |
| anchored start | 103 ns | 147 ns | 0.70 |
| anchored end | 165 ns | 160 ns | 1.03 |
| word boundary | 118 ns | 155 ns | 0.76 |
| Quantifiers | |||
| greedy | 222 ns | 257 ns | 0.86 |
| lazy | 194 ns | 222 ns | 0.87 |
| possessive | 189 ns | 235 ns | 0.80 |
| nested | 180 ns | 234 ns | 0.77 |
| Alternation | |||
| 2 branches | 106 ns | 153 ns | 0.69 |
| 5 branches | 118 ns | 186 ns | 0.64 |
| 10 branches | 255 ns | 223 ns | 1.14 |
| nested | 125 ns | 176 ns | 0.71 |
| Backreferences | |||
simple (\w+) \1 |
143 ns | 188 ns | 0.76 |
| nested | 148 ns | 201 ns | 0.74 |
| named | 143 ns | 189 ns | 0.76 |
| Lookaround | |||
| positive lookahead | 124 ns | 163 ns | 0.76 |
| negative lookahead | 137 ns | 179 ns | 0.77 |
| positive lookbehind | 276 ns | 264 ns | 1.05 |
| negative lookbehind | 353 ns | 332 ns | 1.06 |
| combined | 299 ns | 286 ns | 1.05 |
| Unicode properties | |||
\p{Lu}+ |
93 ns | 143 ns | 0.65 |
\p{Letter}+ |
126 ns | 170 ns | 0.74 |
\p{Greek}+ |
320 ns | 239 ns | 1.34 |
\p{Cyrillic}+ |
435 ns | 324 ns | 1.34 |
| Case-insensitive | |||
| single word | 106 ns | 154 ns | 0.69 |
| phrase | 161 ns | 185 ns | 0.87 |
| alternation | 112 ns | 157 ns | 0.71 |
| Named captures | |||
| date extraction | 454 ns | 272 ns | 1.67 |
| Large text (first match) | |||
| literal 10 KB | 112 ns | 147 ns | 0.76 |
| literal 50 KB | 112 ns | 142 ns | 0.79 |
| timestamp 10 KB | 230 ns | 180 ns | 1.28 |
| timestamp 50 KB | 230 ns | 179 ns | 1.28 |
| field extract 10 KB | 160 ns | 176 ns | 0.91 |
| field extract 50 KB | 158 ns | 173 ns | 0.91 |
| no match 10 KB | 381 ns | 1.9 us | 0.20 |
| no match 50 KB | 1.5 us | 9.3 us | 0.16 |
| RegSet | |||
| position-lead (5 patterns) | 148 ns | 389 ns | 0.38 |
| regex-lead (5 patterns) | 162 ns | 234 ns | 0.69 |
| Match at position | |||
\d+ at offset 4 |
117 ns | 152 ns | 0.77 |
Regex Compilation
| Pattern | Rust | C | Ratio |
|---|---|---|---|
| literal | 429 ns | 466 ns | 0.92 |
.* |
769 ns | 532 ns | 1.45 |
| alternation | 1,791 ns | 1,449 ns | 1.24 |
| char class | 673 ns | 636 ns | 1.06 |
| quantifier | 1,403 ns | 1,049 ns | 1.34 |
| group | 1,076 ns | 786 ns | 1.37 |
| backref | 1,631 ns | 967 ns | 1.69 |
| lookahead | 763 ns | 482 ns | 1.58 |
| lookbehind | 721 ns | 552 ns | 1.31 |
| named capture | 46,849 ns | 5,751 ns | 8.15 |
Running Benchmarks
# HTML report: target/criterion/report/index.html
Architecture
Each C source file maps 1:1 to a Rust module (ADR-001):
| C File | Rust Module | Purpose |
|---|---|---|
| regparse.c | regparse.rs |
Pattern parser |
| regcomp.c | regcomp.rs |
AST-to-bytecode compiler |
| regexec.c | regexec.rs |
VM executor |
| regint.h | regint.rs |
Internal types and opcodes |
| oniguruma.h | oniguruma.rs |
Public types and constants |
| regenc.c | regenc.rs |
Encoding trait |
| regsyntax.c | regsyntax.rs |
12 syntax definitions |
| regset.c | regset.rs |
Multi-regex search (RegSet) |
| regerror.c | regerror.rs |
Error messages |
| regtrav.c | regtrav.rs |
Capture tree traversal |
| unicode.c | unicode/mod.rs |
Unicode tables and segmentation |
Compilation pipeline (same as C):
onig_new() -> onig_compile()
-> onig_parse_tree() (pattern -> AST)
-> reduce_string_list() (merge adjacent strings)
-> tune_tree() (6 optimization sub-passes)
-> compile_tree() (AST -> VM bytecode)
-> set_optimize_info() (extract search strategy)
Scope
Ferroni targets ASCII/UTF-8 workloads. The following are intentionally not included:
- 27 of 29 encodings -- only ASCII and UTF-8 (ADR-002)
- POSIX/GNU API --
regcomp/regexec/regfree(ADR-007) - C memory management -- replaced by Rust's
Droptrait onig_new_deluxe-- C-specific allocation, useonig_new()instead
Running Tests
# Full UTF-8 suite (requires increased stack for debug builds)
RUST_MIN_STACK=268435456
# Other suites
RUST_MIN_STACK=268435456
Warning: Never run
cargo test -- --ignored-- theconditional_recursion_complextest intentionally hangs.
Architecture Decision Records
| ADR | Decision |
|---|---|
| 001 | 1:1 structural parity with C original |
| 002 | ASCII and UTF-8 only |
| 003 | Stack overflow mitigation in debug builds |
| 004 | C-to-Rust translation patterns |
| 005 | Unsafe code policy |
| 006 | SIMD-accelerated search via memchr |
| 007 | POSIX and GNU API not ported |
| 008 | Test strategy and C test suite parity |
| 009 | Porting bugs: lessons learned |
Contributing
Contributions are welcome! Please read CONTRIBUTING.md and review the ADRs before submitting a PR.
Acknowledgments
Ferroni is built on the work of K. Kosako and the Oniguruma contributors. The C original powers regex in Ruby, PHP, TextMate, and many other projects.
License
BSD-2-Clause (same as Oniguruma)