ferroni 0.1.0 - Docs.rs

Ferroni is a line-by-line port of Oniguruma's C source into Rust -- same structure, same function names, same semantics. On top of that foundation, the search pipeline uses SIMD-vectorized scanning (NEON on ARM, SSE2/AVX2 on x86-64) via the memchr crate, making it up to 6x faster than C Oniguruma on full-text search workloads. No bindings, no FFI -- pure Rust.

Why Ferroni?

Security first. C Oniguruma has a history of memory safety CVEs (CVE-2019-13224 CVSS 9.8, CVE-2019-19204, CVE-2019-19246, CVE-2019-19012, CVE-2019-13225) affecting Ruby, PHP, and any application linking against it. Ferroni eliminates buffer overflows, use-after-free, and NULL dereferences structurally through Rust's type system. See ADR-005 for our unsafe policy.

Drop-in behavior. Every regex feature, every opcode, every optimization pass is ported 1:1 from C. If your pattern works in Oniguruma, it works in Ferroni -- verified by 1,695 tests ported directly from the C test suite.

No C toolchain required. Pure cargo build. Cross-compiles to wasm32-unknown-unknown out of the box.

Easy Node.js bindings. Rust's napi-rs ecosystem makes it straightforward to publish a native Node.js module -- no node-gyp, no C compiler on the user's machine.

Quick Start

Add to your Cargo.toml:

[dependencies]
ferroni = { git = "https://github.com/sebastian-software/ferroni.git" }

use ferroni::regcomp::onig_new;
use ferroni::regexec::onig_search;
use ferroni::oniguruma::*;
use ferroni::regsyntax::OnigSyntaxOniguruma;

fn main() {
    let reg = onig_new(
        b"(?<year>\\d{4})-(?<month>\\d{2})-(?<day>\\d{2})",
        ONIG_OPTION_NONE,
        &ferroni::encodings::utf8::ONIG_ENCODING_UTF8,
        &OnigSyntaxOniguruma as *const OnigSyntaxType,
    ).unwrap();

    let input = b"Date: 2026-02-12";
    let (result, region) = onig_search(
        &reg, input, input.len(), input.len(), 0,
        Some(OnigRegion::new()), ONIG_OPTION_NONE,
    );

    let region = region.unwrap();
    assert!(result >= 0);
    assert_eq!(region.beg[0], 6);  // "2026-02-12" starts at byte 6
    assert_eq!(region.end[0], 16);
}

Supported Features

All Perl/Ruby/Python syntax -- (?:...), (?=...), (?!...), (?<=...), (?<!...), (?>...)
Named captures -- (?<name>...), (?'name'...), (?P<name>...)
Backreferences -- \k<name>, \g<name>, relative \g<-1>
Conditionals -- (?(cond)T|F)
Absent expressions -- (?~...)
Unicode properties -- \p{Script_Extensions=Greek}, \p{Lu}, \p{Emoji} (886 names)
Grapheme clusters -- \X, text segment boundaries \y, \Y
Callouts -- (?{...}), (*FAIL), (*MAX{n}), (*COUNT), (*CMP)
12 syntax modes -- Oniguruma, Ruby, Perl, Perl_NG, Python, Java, Emacs, Grep, GNU, POSIX Basic/Extended, ASIS
Safety limits -- retry, time, stack, subexp call depth (global + per-search)

Performance

Criterion benchmarks vs. C Oniguruma at -O3 on Apple M1 Ultra. Bold = faster engine. Full tables in the expandable section below.

Highlights

Scenario	Rust	C	Speedup
No match, 50 KB haystack	1.5 us	9.3 us	6.1x
No match, 10 KB haystack	381 ns	1.9 us	4.9x
RegSet, 5 patterns (position)	148 ns	389 ns	2.6x
5-way alternation	118 ns	186 ns	1.6x
Greedy quantifier	222 ns	257 ns	1.2x
Possessive quantifier	189 ns	235 ns	1.2x

Ferroni wins 29 of 39 execution benchmarks. The SIMD-accelerated forward search is the standout: memchr replaces hand-written byte loops with vectorized scans, delivering 5-6x gains on full-text scanning. See ADR-006.

Compilation is 1.2-1.7x slower (Rust allocates more than C's pre-allocated buffers), but compilation is a one-time cost -- real-world consumers compile patterns once and match millions of times.

Regex Execution

Benchmark	Rust	C	Ratio
Literal match
exact string	134 ns	149 ns	0.90
anchored start	103 ns	147 ns	0.70
anchored end	165 ns	160 ns	1.03
word boundary	118 ns	155 ns	0.76
Quantifiers
greedy	222 ns	257 ns	0.86
lazy	194 ns	222 ns	0.87
possessive	189 ns	235 ns	0.80
nested	180 ns	234 ns	0.77
Alternation
2 branches	106 ns	153 ns	0.69
5 branches	118 ns	186 ns	0.64
10 branches	255 ns	223 ns	1.14
nested	125 ns	176 ns	0.71
Backreferences
simple `(\w+) \1`	143 ns	188 ns	0.76
nested	148 ns	201 ns	0.74
named	143 ns	189 ns	0.76
Lookaround
positive lookahead	124 ns	163 ns	0.76
negative lookahead	137 ns	179 ns	0.77
positive lookbehind	276 ns	264 ns	1.05
negative lookbehind	353 ns	332 ns	1.06
combined	299 ns	286 ns	1.05
Unicode properties
`\p{Lu}+`	93 ns	143 ns	0.65
`\p{Letter}+`	126 ns	170 ns	0.74
`\p{Greek}+`	320 ns	239 ns	1.34
`\p{Cyrillic}+`	435 ns	324 ns	1.34
Case-insensitive
single word	106 ns	154 ns	0.69
phrase	161 ns	185 ns	0.87
alternation	112 ns	157 ns	0.71
Named captures
date extraction	454 ns	272 ns	1.67
Large text (first match)
literal 10 KB	112 ns	147 ns	0.76
literal 50 KB	112 ns	142 ns	0.79
timestamp 10 KB	230 ns	180 ns	1.28
timestamp 50 KB	230 ns	179 ns	1.28
field extract 10 KB	160 ns	176 ns	0.91
field extract 50 KB	158 ns	173 ns	0.91
no match 10 KB	381 ns	1.9 us	0.20
no match 50 KB	1.5 us	9.3 us	0.16
RegSet
position-lead (5 patterns)	148 ns	389 ns	0.38
regex-lead (5 patterns)	162 ns	234 ns	0.69
Match at position
`\d+` at offset 4	117 ns	152 ns	0.77

Regex Compilation

Pattern	Rust	C	Ratio
literal	429 ns	466 ns	0.92
`.*`	769 ns	532 ns	1.45
alternation	1,791 ns	1,449 ns	1.24
char class	673 ns	636 ns	1.06
quantifier	1,403 ns	1,049 ns	1.34
group	1,076 ns	786 ns	1.37
backref	1,631 ns	967 ns	1.69
lookahead	763 ns	482 ns	1.58
lookbehind	721 ns	552 ns	1.31
named capture	46,849 ns	5,751 ns	8.15

Running Benchmarks

cargo bench --features ffi               # full suite (~8 min)
cargo bench --features ffi -- compile    # specific group
cargo bench --features ffi -- "large_"   # pattern filter
# HTML report: target/criterion/report/index.html

Architecture

Each C source file maps 1:1 to a Rust module (ADR-001):

C File	Rust Module	Purpose
regparse.c	`regparse.rs`	Pattern parser
regcomp.c	`regcomp.rs`	AST-to-bytecode compiler
regexec.c	`regexec.rs`	VM executor
regint.h	`regint.rs`	Internal types and opcodes
oniguruma.h	`oniguruma.rs`	Public types and constants
regenc.c	`regenc.rs`	Encoding trait
regsyntax.c	`regsyntax.rs`	12 syntax definitions
regset.c	`regset.rs`	Multi-regex search (RegSet)
regerror.c	`regerror.rs`	Error messages
regtrav.c	`regtrav.rs`	Capture tree traversal
unicode.c	`unicode/mod.rs`	Unicode tables and segmentation

Compilation pipeline (same as C):

onig_new() -> onig_compile()
  -> onig_parse_tree()     (pattern -> AST)
  -> reduce_string_list()  (merge adjacent strings)
  -> tune_tree()           (6 optimization sub-passes)
  -> compile_tree()        (AST -> VM bytecode)
  -> set_optimize_info()   (extract search strategy)

Scope

Ferroni targets ASCII/UTF-8 workloads. The following are intentionally not included:

27 of 29 encodings -- only ASCII and UTF-8 (ADR-002)
POSIX/GNU API -- regcomp/regexec/regfree (ADR-007)
C memory management -- replaced by Rust's Drop trait
onig_new_deluxe -- C-specific allocation, use onig_new() instead

Running Tests

# Full UTF-8 suite (requires increased stack for debug builds)
RUST_MIN_STACK=268435456 cargo test --test compat_utf8 -- --test-threads=1

# Other suites
cargo test --test compat_syntax
cargo test --test compat_options
cargo test --test compat_regset
RUST_MIN_STACK=268435456 cargo test --test compat_back -- --test-threads=1

Warning: Never run cargo test -- --ignored -- the conditional_recursion_complex test intentionally hangs.

Architecture Decision Records

ADR	Decision
001	1:1 structural parity with C original
002	ASCII and UTF-8 only
003	Stack overflow mitigation in debug builds
004	C-to-Rust translation patterns
005	Unsafe code policy
006	SIMD-accelerated search via memchr
007	POSIX and GNU API not ported
008	Test strategy and C test suite parity
009	Porting bugs: lessons learned

Contributing

Contributions are welcome! Please read CONTRIBUTING.md and review the ADRs before submitting a PR.

Acknowledgments

Ferroni is built on the work of K. Kosako and the Oniguruma contributors. The C original powers regex in Ruby, PHP, TextMate, and many other projects.

License

BSD-2-Clause (same as Oniguruma)