fuzzy-regex 0.1.0

High-performance fuzzy regular expression engine combining regex with Damerau-Levenshtein distance
Documentation

CI crates.io docs API Docs MSRV

fuzzy-regex

A high-performance fuzzy regular expression engine written in Rust that combines traditional regex constructs with approximate string matching using Damerau-Levenshtein automata and the Bitap algorithm.

Features

  • Fuzzy Matching: Match strings with configurable edit distance tolerance (insertions, deletions, substitutions, transpositions)
  • Full Regex Support: Character classes, quantifiers, groups, alternation, anchors, lookahead
  • Per-Segment Fuzziness: Control fuzziness for individual parts of a pattern
  • Capture Groups: Named and numbered capture groups with fuzzy matching
  • Similarity Scoring: Get match quality scores (0.0 - 1.0)
  • Streaming API: Process large files and network streams incrementally
  • High Performance: Bitap algorithm for patterns ≤64 chars, SIMD optimizations
  • Unicode Support: Full Unicode support including case-insensitive matching

Installation

Add to your Cargo.toml:

[dependencies]
fuzzy-regex = "0.1"

Feature Flags

  • simd (default): Enable SIMD optimizations for faster matching
  • mimalloc: Use mimalloc allocator for better performance

Documentation

  • Book - Full documentation and guide
  • API Docs - Rust API reference

MSRV

The minimum supported Rust version is 1.88.

Quick Start

use fuzzy_regex::FuzzyRegex;

// Simple fuzzy matching - allows up to 2 edits
let re = FuzzyRegex::new("(?:hello){e<=2}").unwrap();

assert!(re.is_match("hello"));   // Exact match
assert!(re.is_match("helo"));    // 1 deletion
assert!(re.is_match("helllo"));  // 1 insertion
assert!(re.is_match("hallo"));   // 1 substitution
assert!(re.is_match("hlelo"));   // 1 transposition

Unicode Support

By default, \w, \d, \s match ASCII characters only. Enable Unicode mode with (?u):

use fuzzy_regex::FuzzyRegexBuilder;

// Without Unicode (ASCII only)
let re1 = FuzzyRegex::new(r"\w+").unwrap();
re1.is_match("hello"); // true
re1.is_match("привет"); // false (Cyrillic not matched)

// With Unicode mode
let re2 = FuzzyRegex::new("(?u)\\w+").unwrap();
re2.is_match("hello"); // true
re2.is_match("привет"); // true (Cyrillic matched)

// Or via builder
let re3 = FuzzyRegexBuilder::new(r"\w+")
    .unicode(true)
    .build()
    .unwrap();

Pattern Syntax

Fuzziness Markers

Syntax Description
(?:text){e<=2} Allow up to 2 total edits
(?:text){i<=1} Allow up to 1 insertion
(?:text){d<=1} Allow up to 1 deletion
(?:text){s<=1} Allow up to 1 substitution
(?:text){t<=1} Allow up to 1 transposition
(?:text){i<=1,d<=1,s<=1} Separate limits per edit type
(?:text){e<=1:[a-z]} Restrict edits to character class
(?:text){c<=3} Cost-based: total cost ≤ 3
(?:text){2i+1d+1s+1t<=4} Weighted cost constraint
text~2 Shorthand: allow up to 2 edits

Standard Regex Constructs

Syntax Description
[a-z], [^0-9] Character classes
\d, \w, \s Predefined classes (digit, word, whitespace)
Note: By default matches ASCII only. Use (?u) for Unicode
\D, \W, \S Negated predefined classes
. Any character (except newline by default)
*, +, ? Quantifiers (0+, 1+, 0-1)
*?, +?, ?? Lazy quantifiers
{n}, {n,}, {n,m} Repetition counts
(group) Capture group
(?:group) Non-capturing group
(?<name>...) Named capture group
a|b Alternation
^, $ Start/end anchors
\b, \B Word boundary / non-boundary
(?=...), (?!...) Positive/negative lookahead
(?<=...), (?<!...) Positive/negative lookbehind (fixed-length)
\1, \2 Backreferences

Inline Flags

Flag Description
(?i) Case insensitive
(?m) Multi-line (^ and $ match line boundaries)
(?s) Dot-all (. matches newlines)
(?x) Verbose mode (ignore whitespace, allow comments)
(?U) Ungreedy (swap greedy/lazy behavior)
(?u) Unicode mode (\w, \d, \s match Unicode chars)
(?b) BESTMATCH - find best match within alternatives
(?e) ENHANCEMATCH - enhance match with more edits
(?p) POSIX - find longest match at leftmost position

Advanced Features

Syntax Description
\K Reset match start position
(?>...) Atomic group (no backtracking)
*+, ++, ?+ Possessive quantifiers
(?R), (?1) Recursive patterns
(?<name>...) Named capture group with fuzzy modifier
\L<name> Named list reference (via set_word_list)
\G Match at end of previous match position

API Methods

  • find() - Find first match
  • find_iter() - Find all matches iteratively
  • find_rev() - Find rightmost match
  • find_iter_rev() - Find all matches in reverse order
  • captures() - Get capture groups with positions
  • fullmatch() - Match entire string from start to end
  • is_full_match() - Check if entire string matches
  • split() - Split by matches
  • replace() - Replace matches with replacement string
  • find_with_timeout() - Match with timeout support

Partial Matches

Match incomplete/streaming input with partial matching:

use fuzzy_regex::FuzzyRegexBuilder;

let re = FuzzyRegexBuilder::new("(?:hello){e<=1}")
    .partial(true)
    .build()
    .unwrap();

// Match at end of text - marked as partial
let m = re.find("hello").unwrap();
assert!(m.partial());  // Match reached end of input

Examples

Basic Fuzzy Search

use fuzzy_regex::FuzzyRegexBuilder;

let re = FuzzyRegexBuilder::new("(?:teh){e<=1}")
.similarity(0.7)
.build()
.unwrap();

// Finds "the" even when misspelled as "teh"
let m = re.find("I saw teh cat").unwrap();
assert_eq!(m.as_str(), "teh");

Fuzzy Search with Context

use fuzzy_regex::FuzzyRegex;

// Mix exact and fuzzy matching
let re = FuzzyRegex::new(r"The (?:quick){e<=1} brown").unwrap();

assert!(re.is_match("The quick brown fox"));  // Exact
assert!(re.is_match("The quikc brown fox"));  // Typo in "quick"

Capture Groups

use fuzzy_regex::FuzzyRegex;

let re = FuzzyRegex::new(r"(?<user>\w+)@(?<domain>\w+\.\w+)").unwrap();
let caps = re.captures("john@example.com").unwrap();

assert_eq!(caps.name("user").unwrap().as_str(), "john");
assert_eq!(caps.name("domain").unwrap().as_str(), "example.com");

Find All Matches

use fuzzy_regex::FuzzyRegex;

let re = FuzzyRegex::new(r"\b\w+\b").unwrap();
let matches: Vec<_ > = re.find_iter("hello world").collect();

assert_eq!(matches.len(), 2);
assert_eq!(matches[0].as_str(), "hello");
assert_eq!(matches[1].as_str(), "world");

Replace

use fuzzy_regex::FuzzyRegex;

let re = FuzzyRegex::new("world").unwrap();

// Replace first occurrence
let result = re.replace("hello world world", "rust");
assert_eq!(result, "hello rust world");

// Replace all occurrences
let result = re.replace_all("hello world world", "rust");
assert_eq!(result, "hello rust rust");

Streaming API

Process large files or network streams without loading everything into memory:

use fuzzy_regex::FuzzyRegex;
use std::io::BufReader;
use std::fs::File;

let re = FuzzyRegex::new("(?:needle){e<=1}").unwrap();
let mut stream = re.stream();

// Feed data in chunks
for m in stream.feed(b"hay") {
println ! ("Match at {}-{}", m.start(), m.end());
}
for m in stream.feed(b"stack needle here") {
println ! ("Match at {}-{}", m.start(), m.end());
}

// Or process a reader directly
let file = File::open("large_file.txt").unwrap();
for m in re.stream().search_reader(BufReader::new(file)) {
    println!("Match at byte offset {}", m.start());
}

Byte-Level Search

use fuzzy_regex::FuzzyRegex;

let re = FuzzyRegex::new("(?:hello){e<=1}").unwrap();

// Search byte slices directly
if let Some(m) = re.find_bytes(b"hello world") {
    println!("Match at {}..{}", m.start(), m.end());
}

// Check if pattern supports fast streaming
if re.supports_streaming() {
println ! ("Pattern uses optimized Bitap algorithm");
}

Case Insensitive Matching

use fuzzy_regex::FuzzyRegexBuilder;

let re = FuzzyRegexBuilder::new("(?:hello){e<=1}")
.case_insensitive(true)
.build()
.unwrap();

assert!(re.is_match("HELLO"));
assert!(re.is_match("HeLLo"));
assert!(re.is_match("HALLO")); // Case insensitive + fuzzy

DNA Sequence Matching

use fuzzy_regex::FuzzyRegex;

// Find DNA sequences allowing for sequencing errors
let re = FuzzyRegex::new("(?:ACGTACGT){e<=2}").unwrap();

let dna = "NNNNACGTACGTNNNN";
let m = re.find(dna).unwrap();
assert_eq!(m.as_str(), "ACGTACGT");

// Also matches with up to 2 errors
assert!(re.is_match("ACGTAGGT")); // 1 substitution

Character Class Restrictions

Restrict which characters can be used for edits:

use fuzzy_regex::FuzzyRegex;

// Only allow lowercase letters as substitutions
let re = FuzzyRegex::new(r"(?:hello){s<=1:[a-z]}").unwrap();

assert!(re.is_match("hallo"));  // 'a' is in [a-z]
assert!(!re.is_match("h3llo")); // '3' is not in [a-z]

Lookbehind

Match only after a specific pattern:

use fuzzy_regex::FuzzyRegex;

// Positive lookbehind - match "world" only when preceded by "hello "
let re = FuzzyRegex::new(r"(?<=hello )world").unwrap();

assert!(re.is_match("hello world"));
assert!(!re.is_match("bye world"));

// Negative lookbehind - match "world" only when NOT preceded by "hello "
let re = FuzzyRegex::new(r"(?<!hello )world").unwrap();

assert!(re.is_match("bye world"));
assert!(!re.is_match("hello world"));

// Variable-length lookbehind (fuzzy)
let re = FuzzyRegex::new(r"(?<=(?:hello){e<=1})world").unwrap();
assert!(re.is_match("hallo world")); // "hello" with 1 error before "world"

Cost-Based Matching

Assign different costs to edit operations for fine-grained control:

use fuzzy_regex::FuzzyRegex;

// Simple cost: all operations cost 1, total cost <= 2
let re = FuzzyRegex::new("(?:hello){c<=2}").unwrap();
assert!(re.is_match("hallo")); // 1 sub, cost=1
assert!(re.is_match("helo"));  // 1 del, cost=1

// Weighted costs: insertions cost 2, others cost 1
let re = FuzzyRegex::new("(?:ab){2i+1d+1s+1t<=3}").unwrap();
assert!(re.is_match("abc")); // 1 ins, cost=2
assert!(re.is_match("a"));   // 1 del, cost=1
assert!(re.is_match("ba"));  // 1 transposition, cost=1

Performance

Algorithm Selection

The library automatically selects the best algorithm:

  • Bitap Algorithm: Used for patterns ≤64 characters. O(n×k) time complexity where n is text length and k is max edits.
  • Levenshtein NFA: Used for longer patterns or complex regex features.
  • DFA Fast Path: For patterns without capture groups or lazy quantifiers.

Optimization Tips

  1. Use specific edit limits: {e<=1} is faster than {e<=5}
  2. Prefer shorter patterns: Bitap is very fast for short patterns
  3. Use streaming for large texts: Avoids loading entire file into memory
  4. Enable SIMD: Enabled by default, provides ~2x speedup on supported platforms
  5. Consider mimalloc: Enable the mimalloc feature for memory-intensive workloads

API Overview

Main Types

Type Description
FuzzyRegex Compiled regex pattern
FuzzyRegexBuilder Builder for customized regex construction
Match A single match with position and similarity
Captures Match with capture group information
StreamingMatcher Stateful matcher for incremental processing
StreamingMatch Match result from streaming search

Key Methods

// Construction
FuzzyRegex::new(pattern) -> Result<FuzzyRegex>
FuzzyRegex::builder(pattern) -> FuzzyRegexBuilder

// Searching
re.is_match(text) -> bool
re.find(text) -> Option<Match>
re.find_iter(text) -> impl Iterator<Item=Match>
re.captures(text) -> Option<Captures>

// Replacing
re.replace(text, replacement) -> String
re.replace_all(text, replacement) -> String

// Streaming
re.stream() -> StreamingMatcher
re.find_bytes(bytes) -> Option<StreamingMatch>
re.find_iter_bytes(bytes) -> impl Iterator<Item=StreamingMatch>

// Builder options
builder.similarity(threshold: f32)
builder.case_insensitive(yes: bool)
builder.multi_line(yes: bool)
builder.dot_all(yes: bool)

Compatibility

This crate provides a compatibility layer for projects migrating from fuzzy-aho-corasick:

use fuzzy_regex::compat::fac::FuzzyAhoCorasickBuilder;
use fuzzy_regex::types::FuzzyLimits;

let searcher = FuzzyAhoCorasickBuilder::new()
    .fuzzy(FuzzyLimits::new().edits(1))
    .build(["hello", "world"])
    .unwrap();

for m in searcher.find_iter("helo wrld") {
    println!("Pattern {} matched", m.pattern_index());
}

License

MIT License