ferroni 1.0.1

Pure-Rust Oniguruma regex engine with SIMD-accelerated search
Documentation

Ferroni started as a line-by-line port of Oniguruma's C source into Rust -- same structure, same function names, same semantics. From there it gained SIMD-vectorized search (NEON on ARM, SSE2/AVX2 on x86-64) via memchr, making it up to 6x faster than C Oniguruma on full-text scanning. Finally, an idiomatic Rust API layer (Regex::new(), typed errors, Match/Captures) was added on top -- so you get ergonomic Rust with battle-tested Oniguruma semantics underneath. No bindings, no FFI -- pure Rust.

Why Ferroni?

Security first. C Oniguruma has a history of memory safety CVEs (CVE-2019-13224 CVSS 9.8, CVE-2019-19204, CVE-2019-19246, CVE-2019-19012, CVE-2019-13225) affecting Ruby, PHP, and any application linking against it. Ferroni eliminates buffer overflows, use-after-free, and NULL dereferences structurally through Rust's type system. See ADR-005 for our unsafe policy.

Drop-in behavior. Every regex feature, every opcode, every optimization pass is ported 1:1 from C. If your pattern works in Oniguruma, it works in Ferroni -- verified by 1,695 tests ported directly from the C test suite.

No C toolchain required. Pure cargo build. Cross-compiles to wasm32-unknown-unknown out of the box.

Easy Node.js bindings. Rust's napi-rs ecosystem makes it straightforward to publish a native Node.js module -- no node-gyp, no C compiler on the user's machine.

Quick Start

Add to your Cargo.toml:

[dependencies]
ferroni = { git = "https://github.com/sebastian-software/ferroni.git" }
use ferroni::prelude::*;

fn main() -> Result<(), RegexError> {
    let re = Regex::new(r"(?<year>\d{4})-(?<month>\d{2})-(?<day>\d{2})")?;

    let caps = re.captures("Date: 2026-02-12").unwrap();
    assert_eq!(caps.get(0).unwrap().as_str(), "2026-02-12");
    assert_eq!(caps.name("year").unwrap().as_str(), "2026");
    assert_eq!(caps.name("month").unwrap().as_str(), "02");
    Ok(())
}

For fine-grained control, use RegexBuilder:

use ferroni::prelude::*;

let re = Regex::builder(r"hello")
    .case_insensitive(true)
    .build()
    .unwrap();
assert!(re.is_match("Hello World"));

The full C-ported API is also available for advanced usage:

use ferroni::regcomp::onig_new;
use ferroni::regexec::onig_search;
use ferroni::oniguruma::*;
use ferroni::regsyntax::OnigSyntaxOniguruma;

let reg = onig_new(
    b"\\d{4}-\\d{2}-\\d{2}",
    ONIG_OPTION_NONE,
    &ferroni::encodings::utf8::ONIG_ENCODING_UTF8,
    &OnigSyntaxOniguruma,
).unwrap();

let input = b"Date: 2026-02-12";
let (result, region) = onig_search(
    &reg, input, input.len(), 0, input.len(),
    Some(OnigRegion::new()), ONIG_OPTION_NONE,
);

assert!(result >= 0);
assert_eq!(result, 6); // match starts at byte 6

Supported Features

  • All Perl/Ruby/Python syntax -- (?:...), (?=...), (?!...), (?<=...), (?<!...), (?>...)
  • Named captures -- (?<name>...), (?'name'...), (?P<name>...)
  • Backreferences -- \k<name>, \g<name>, relative \g<-1>
  • Conditionals -- (?(cond)T|F)
  • Absent expressions -- (?~...)
  • Unicode properties -- \p{Script_Extensions=Greek}, \p{Lu}, \p{Emoji} (886 names)
  • Grapheme clusters -- \X, text segment boundaries \y, \Y
  • Callouts -- (?{...}), (*FAIL), (*MAX{n}), (*COUNT), (*CMP)
  • 12 syntax modes -- Oniguruma, Ruby, Perl, Perl_NG, Python, Java, Emacs, Grep, GNU, POSIX Basic/Extended, ASIS
  • Safety limits -- retry, time, stack, subexp call depth (global + per-search)

Performance

Criterion benchmarks vs. C Oniguruma at -O3 on Apple M1 Ultra. Bold = faster engine. Full tables in the expandable section below.

Highlights

Scenario Rust C Speedup
No match, 50 KB haystack 1.5 us 9.3 us 6.1x
No match, 10 KB haystack 381 ns 1.9 us 4.9x
RegSet, 5 patterns (position) 148 ns 389 ns 2.6x
5-way alternation 118 ns 186 ns 1.6x
Greedy quantifier 222 ns 257 ns 1.2x
Possessive quantifier 189 ns 235 ns 1.2x

Ferroni wins 29 of 39 execution benchmarks. The SIMD-accelerated forward search is the standout: memchr replaces hand-written byte loops with vectorized scans, delivering 5-6x gains on full-text scanning. See ADR-006.

Compilation is 1.2-1.7x slower (Rust allocates more than C's pre-allocated buffers), but compilation is a one-time cost -- real-world consumers compile patterns once and match millions of times.

Regex Execution

Benchmark Rust C Ratio
Literal match
exact string 134 ns 149 ns 0.90
anchored start 103 ns 147 ns 0.70
anchored end 165 ns 160 ns 1.03
word boundary 118 ns 155 ns 0.76
Quantifiers
greedy 222 ns 257 ns 0.86
lazy 194 ns 222 ns 0.87
possessive 189 ns 235 ns 0.80
nested 180 ns 234 ns 0.77
Alternation
2 branches 106 ns 153 ns 0.69
5 branches 118 ns 186 ns 0.64
10 branches 255 ns 223 ns 1.14
nested 125 ns 176 ns 0.71
Backreferences
simple (\w+) \1 143 ns 188 ns 0.76
nested 148 ns 201 ns 0.74
named 143 ns 189 ns 0.76
Lookaround
positive lookahead 124 ns 163 ns 0.76
negative lookahead 137 ns 179 ns 0.77
positive lookbehind 276 ns 264 ns 1.05
negative lookbehind 353 ns 332 ns 1.06
combined 299 ns 286 ns 1.05
Unicode properties
\p{Lu}+ 93 ns 143 ns 0.65
\p{Letter}+ 126 ns 170 ns 0.74
\p{Greek}+ 320 ns 239 ns 1.34
\p{Cyrillic}+ 435 ns 324 ns 1.34
Case-insensitive
single word 106 ns 154 ns 0.69
phrase 161 ns 185 ns 0.87
alternation 112 ns 157 ns 0.71
Named captures
date extraction 454 ns 272 ns 1.67
Large text (first match)
literal 10 KB 112 ns 147 ns 0.76
literal 50 KB 112 ns 142 ns 0.79
timestamp 10 KB 230 ns 180 ns 1.28
timestamp 50 KB 230 ns 179 ns 1.28
field extract 10 KB 160 ns 176 ns 0.91
field extract 50 KB 158 ns 173 ns 0.91
no match 10 KB 381 ns 1.9 us 0.20
no match 50 KB 1.5 us 9.3 us 0.16
RegSet
position-lead (5 patterns) 148 ns 389 ns 0.38
regex-lead (5 patterns) 162 ns 234 ns 0.69
Match at position
\d+ at offset 4 117 ns 152 ns 0.77

Regex Compilation

Pattern Rust C Ratio
literal 429 ns 466 ns 0.92
.* 769 ns 532 ns 1.45
alternation 1,791 ns 1,449 ns 1.24
char class 673 ns 636 ns 1.06
quantifier 1,403 ns 1,049 ns 1.34
group 1,076 ns 786 ns 1.37
backref 1,631 ns 967 ns 1.69
lookahead 763 ns 482 ns 1.58
lookbehind 721 ns 552 ns 1.31
named capture 46,849 ns 5,751 ns 8.15

Running Benchmarks

cargo bench --features ffi               # full suite (~8 min)
cargo bench --features ffi -- compile    # specific group
cargo bench --features ffi -- "large_"   # pattern filter
# HTML report: target/criterion/report/index.html

Architecture

Each C source file maps 1:1 to a Rust module (ADR-001):

C File Rust Module Purpose
regparse.c regparse.rs Pattern parser
regcomp.c regcomp.rs AST-to-bytecode compiler
regexec.c regexec.rs VM executor
regint.h regint.rs Internal types and opcodes
oniguruma.h oniguruma.rs Public types and constants
regenc.c regenc.rs Encoding trait
regsyntax.c regsyntax.rs 12 syntax definitions
regset.c regset.rs Multi-regex search (RegSet)
regerror.c regerror.rs Error messages
regtrav.c regtrav.rs Capture tree traversal
unicode.c unicode/mod.rs Unicode tables and segmentation

Compilation pipeline (same as C):

onig_new() -> onig_compile()
  -> onig_parse_tree()     (pattern -> AST)
  -> reduce_string_list()  (merge adjacent strings)
  -> tune_tree()           (6 optimization sub-passes)
  -> compile_tree()        (AST -> VM bytecode)
  -> set_optimize_info()   (extract search strategy)

Scope

Ferroni targets ASCII/UTF-8 workloads. The following are intentionally not included:

  • 27 of 29 encodings -- only ASCII and UTF-8 (ADR-002)
  • POSIX/GNU API -- regcomp/regexec/regfree (ADR-007)
  • C memory management -- replaced by Rust's Drop trait
  • onig_new_deluxe -- C-specific allocation, use onig_new() instead

Running Tests

# Full UTF-8 suite (requires increased stack for debug builds)
RUST_MIN_STACK=268435456 cargo test --test compat_utf8 -- --test-threads=1

# Other suites
cargo test --test compat_syntax
cargo test --test compat_options
cargo test --test compat_regset
RUST_MIN_STACK=268435456 cargo test --test compat_back -- --test-threads=1

Warning: Never run cargo test -- --ignored -- the conditional_recursion_complex test intentionally hangs.

Test Coverage

Ferroni ships 1,695 tests -- 1,554 ported 1:1 from C Oniguruma's test suite plus 141 additional Rust-specific tests covering edge cases, error paths, and features that the C suite leaves untested. The original C project has no coverage reporting; Ferroni's test suite is strictly a superset of it. Combined with Rust's compile-time guarantees against buffer overflows, use-after-free, and data races, this makes Ferroni's long-term quality posture stronger than the C original.

Coverage is measured with cargo-llvm-cov on nightly and reported to Codecov.

Metric Value Notes
Function coverage >94% All reachable API and internal functions
Line coverage ~82% Limited by stack overflow under instrumentation
Tests executed 1,653 of 1,695 42 skipped during coverage runs only

Why not higher? Oniguruma supports deeply recursive patterns -- backreferences like (\w+)\1 that backtrack through thousands of stack frames, and subexpression calls like \g<name> that recurse into the VM. These tests pass normally but cause stack overflows under LLVM's coverage instrumentation, which significantly increases each stack frame's size. Since macOS and Linux cap thread stacks at ~1 GB, roughly 42 of the most recursive tests must be skipped during coverage runs.

The skipped tests exercise branches deep inside the three largest modules (parser, compiler, VM executor -- together ~18,000 lines). The functions themselves are covered, but specific branches within them (e.g., nested backref matching, mutual recursion, conditional recursion) remain unexercised under instrumentation. All 1,695 tests pass in regular cargo test builds without any skips.

Functions that cannot be meaningfully tested -- global configuration setters, encoding stubs for the unused ASCII path, error message formatting, callout argument accessors -- are excluded via #[coverage(off)] on nightly to avoid inflating the denominator.

Architecture Decision Records

ADR Decision
001 1:1 structural parity with C original
002 ASCII and UTF-8 only
003 Stack overflow mitigation in debug builds
004 C-to-Rust translation patterns
005 Unsafe code policy
006 SIMD-accelerated search via memchr
007 POSIX and GNU API not ported
008 Test strategy and C test suite parity
009 Porting bugs: lessons learned
010 Idiomatic Rust API layer

Contributing

Contributions are welcome! Please read CONTRIBUTING.md and review the ADRs before submitting a PR.

Acknowledgments

Ferroni is built on the work of K. Kosako and the Oniguruma contributors. The C original powers regex in Ruby, PHP, TextMate, and many other projects.

License

BSD-2-Clause (same as Oniguruma)