Expand description
🐎 daachorse: Double-Array Aho-Corasick
A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure.
Overview
Daachorse is a crate for fast multiple pattern matching using the Aho-Corasick algorithm, running in linear time over the length of the input text. This crate uses the compact double-array data structure for implementing the pattern match automaton for time and memory efficiency. The data structure not only supports constant-time state-to-state traversal but also represents each state in the space of only 12 bytes.
The main technical ideas behind this library appear in this paper.
Example: Finding overlapped occurrences
To search for all occurrences of registered patterns that allow for positional overlap in the
input text, use DoubleArrayAhoCorasick::find_overlapping_iter()
.
When you use DoubleArrayAhoCorasick::new()
for construction, the library assigns a unique
identifier to each pattern in the input order. The match result has the byte positions of the
occurrence and its identifier.
use daachorse::DoubleArrayAhoCorasick;
let patterns = vec!["bcd", "ab", "a"];
let pma = DoubleArrayAhoCorasick::new(patterns).unwrap();
let mut it = pma.find_overlapping_iter("abcd");
let m = it.next().unwrap();
assert_eq!((0, 1, 2), (m.start(), m.end(), m.value()));
let m = it.next().unwrap();
assert_eq!((0, 2, 1), (m.start(), m.end(), m.value()));
let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));
assert_eq!(None, it.next());
Example: Finding non-overlapped occurrences with standard matching
If you do not want to allow positional overlap, use DoubleArrayAhoCorasick::find_iter()
instead.
This function performs the search on the Aho-Corasick automaton and reports patterns first found in each iteration.
use daachorse::DoubleArrayAhoCorasick;
let patterns = vec!["bcd", "ab", "a"];
let pma = DoubleArrayAhoCorasick::new(patterns).unwrap();
let mut it = pma.find_iter("abcd");
let m = it.next().unwrap();
assert_eq!((0, 1, 2), (m.start(), m.end(), m.value()));
let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));
assert_eq!(None, it.next());
Example: Finding non-overlapped occurrences with longest matching
If you want to search for the longest pattern without positional overlap in each iteration, use
DoubleArrayAhoCorasick::leftmost_find_iter()
with specifying
MatchKind::LeftmostLongest
in the construction.
use daachorse::{DoubleArrayAhoCorasickBuilder, MatchKind};
let patterns = vec!["ab", "a", "abcd"];
let pma = DoubleArrayAhoCorasickBuilder::new()
.match_kind(MatchKind::LeftmostLongest)
.build(&patterns)
.unwrap();
let mut it = pma.leftmost_find_iter("abcd");
let m = it.next().unwrap();
assert_eq!((0, 4, 2), (m.start(), m.end(), m.value()));
assert_eq!(None, it.next());
Example: Finding non-overlapped occurrences with leftmost-first matching
If you want to find the earliest registered pattern among ones starting from the search
position, use DoubleArrayAhoCorasick::leftmost_find_iter()
with specifying
MatchKind::LeftmostFirst
.
This semantic is the so-called leftmost first match, a tricky search option supported in the
aho-corasick crate. For example, in the
following code, ab
is reported because it is the earliest registered one.
use daachorse::{DoubleArrayAhoCorasickBuilder, MatchKind};
let patterns = vec!["ab", "a", "abcd"];
let pma = DoubleArrayAhoCorasickBuilder::new()
.match_kind(MatchKind::LeftmostFirst)
.build(&patterns)
.unwrap();
let mut it = pma.leftmost_find_iter("abcd");
let m = it.next().unwrap();
assert_eq!((0, 2, 0), (m.start(), m.end(), m.value()));
assert_eq!(None, it.next());
Example: Associating arbitrary values with patterns
To build the automaton from pairs of a pattern and user-defined value, instead of assigning
identifiers automatically, use DoubleArrayAhoCorasick::with_values()
.
use daachorse::DoubleArrayAhoCorasick;
let patvals = vec![("bcd", 0), ("ab", 10), ("a", 20)];
let pma = DoubleArrayAhoCorasick::with_values(patvals).unwrap();
let mut it = pma.find_overlapping_iter("abcd");
let m = it.next().unwrap();
assert_eq!((0, 1, 20), (m.start(), m.end(), m.value()));
let m = it.next().unwrap();
assert_eq!((0, 2, 10), (m.start(), m.end(), m.value()));
let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));
assert_eq!(None, it.next());
Example: Building faster automaton on multibyte characters
To build a faster automaton on multibyte characters, use CharwiseDoubleArrayAhoCorasick
instead.
The standard version DoubleArrayAhoCorasick
handles strings as UTF-8 sequences and defines
transition labels using byte values. On the other hand, CharwiseDoubleArrayAhoCorasick
uses
Unicode code point values, reducing the number of transitions and faster matching.
use daachorse::CharwiseDoubleArrayAhoCorasick;
let patterns = vec!["全世界", "世界", "に"];
let pma = CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();
let mut it = pma.find_iter("全世界中に");
let m = it.next().unwrap();
assert_eq!((0, 9, 0), (m.start(), m.end(), m.value()));
let m = it.next().unwrap();
assert_eq!((12, 15, 2), (m.start(), m.end(), m.value()));
assert_eq!(None, it.next());
Re-exports
pub use bytewise::DoubleArrayAhoCorasick;
pub use charwise::CharwiseDoubleArrayAhoCorasick;
Modules
A byte-wise version of the Double-Array Aho-Corasick.
A character-wise version for faster matching on multibyte characters.
Definition of errors.
Structs
Match result.
Enums
A search option of the Aho-Corasick automaton
specified in DoubleArrayAhoCorasickBuilder::match_kind
.
Traits
Trait indicating serializability.