Expand description
§🐎 daachorse: Double-Array Aho-Corasick
A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure.
§Overview
Daachorse (pronounced “dark horse”) is a crate for fast multiple pattern matching using the Aho-Corasick algorithm, running in linear time over the length of the input text. This crate uses the compact double-array data structure for implementing the pattern match automaton for time and memory efficiency. The data structure not only supports constant-time state-to-state traversal but also represents each state using only 12 bytes of memory.
The main technical ideas behind this library appear in the following paper:
Shunsuke Kanda, Koichi Akabe, and Yusuke Oda. Engineering faster double-array Aho-Corasick automata. Software: Practice and Experience (SPE), 53(6): 1332–1361, 2023 (arXiv)
§Example: Finding overlapping occurrences
To search for all occurrences of registered patterns that allow for positional overlap in the
input text, use DoubleArrayAhoCorasick::find_overlapping_iter().
When you use DoubleArrayAhoCorasick::new() for construction, the library assigns a unique
identifier to each pattern in the input order. The match result has the byte positions of the
occurrence and its identifier.
use daachorse::DoubleArrayAhoCorasick;
let patterns = vec!["bcd", "ab", "a"];
let pma = DoubleArrayAhoCorasick::new(patterns).unwrap();
let mut it = pma.find_overlapping_iter("abcd");
let m = it.next().unwrap();
assert_eq!((0, 1, 2), (m.start(), m.end(), m.value()));
let m = it.next().unwrap();
assert_eq!((0, 2, 1), (m.start(), m.end(), m.value()));
let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));
assert_eq!(None, it.next());§Example: Finding non-overlapping occurrences with standard matching
To disallow positional overlap, use DoubleArrayAhoCorasick::find_iter() instead.
This function performs the search on the Aho-Corasick automaton and reports the first matching pattern at each search position.
use daachorse::DoubleArrayAhoCorasick;
let patterns = vec!["bcd", "ab", "a"];
let pma = DoubleArrayAhoCorasick::new(patterns).unwrap();
let mut it = pma.find_iter("abcd");
let m = it.next().unwrap();
assert_eq!((0, 1, 2), (m.start(), m.end(), m.value()));
let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));
assert_eq!(None, it.next());§Example: Finding non-overlapping occurrences with longest matching
To search for the longest pattern without positional overlap in each iteration, specify
MatchKind::LeftmostLongest during construction and use
DoubleArrayAhoCorasick::leftmost_find_iter().
use daachorse::{DoubleArrayAhoCorasickBuilder, MatchKind};
let patterns = vec!["ab", "a", "abcd"];
let pma = DoubleArrayAhoCorasickBuilder::new()
.match_kind(MatchKind::LeftmostLongest)
.build(&patterns)
.unwrap();
let mut it = pma.leftmost_find_iter("abcd");
let m = it.next().unwrap();
assert_eq!((0, 4, 2), (m.start(), m.end(), m.value()));
assert_eq!(None, it.next());§Example: Finding non-overlapping occurrences with leftmost-first matching
To search for the earliest registered pattern among those starting from the search position,
specify MatchKind::LeftmostFirst during construction and use
DoubleArrayAhoCorasick::leftmost_find_iter().
This semantics is the so-called leftmost first match, a tricky search option supported in the
aho-corasick crate. For example, in the
following code, ab is reported because it is the earliest registered one.
use daachorse::{DoubleArrayAhoCorasickBuilder, MatchKind};
let patterns = vec!["ab", "a", "abcd"];
let pma = DoubleArrayAhoCorasickBuilder::new()
.match_kind(MatchKind::LeftmostFirst)
.build(&patterns)
.unwrap();
let mut it = pma.leftmost_find_iter("abcd");
let m = it.next().unwrap();
assert_eq!((0, 2, 0), (m.start(), m.end(), m.value()));
assert_eq!(None, it.next());§Example: Associating arbitrary values with patterns
To build the automaton from pattern-value pairs, instead of assigning identifiers automatically,
use DoubleArrayAhoCorasick::with_values().
use daachorse::DoubleArrayAhoCorasick;
let patvals = vec![("bcd", 0), ("ab", 10), ("a", 20)];
let pma = DoubleArrayAhoCorasick::with_values(patvals).unwrap();
let mut it = pma.find_overlapping_iter("abcd");
let m = it.next().unwrap();
assert_eq!((0, 1, 20), (m.start(), m.end(), m.value()));
let m = it.next().unwrap();
assert_eq!((0, 2, 10), (m.start(), m.end(), m.value()));
let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));
assert_eq!(None, it.next());§Example: Building faster automaton on multibyte characters
To build a faster automaton on multibyte characters, use CharwiseDoubleArrayAhoCorasick
instead.
The standard version DoubleArrayAhoCorasick handles strings as UTF-8 sequences and defines
transition labels using byte values. In contrast, CharwiseDoubleArrayAhoCorasick uses
Unicode code point values, reducing the number of transitions and enabling faster matching on
multibyte characters.
use daachorse::CharwiseDoubleArrayAhoCorasick;
let patterns = vec!["全世界", "世界", "に"];
let pma = CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();
let mut it = pma.find_iter("全世界中に");
let m = it.next().unwrap();
assert_eq!((0, 9, 0), (m.start(), m.end(), m.value()));
let m = it.next().unwrap();
assert_eq!((12, 15, 2), (m.start(), m.end(), m.value()));
assert_eq!(None, it.next());Re-exports§
pub use crate::bytewise::DoubleArrayAhoCorasick;pub use crate::bytewise::DoubleArrayAhoCorasickBuilder;pub use crate::charwise::CharwiseDoubleArrayAhoCorasick;pub use crate::charwise::CharwiseDoubleArrayAhoCorasickBuilder;pub use crate::errors::Result;
Modules§
- bytewise
- A byte-wise version of the Double-Array Aho-Corasick.
- charwise
- A character-wise version for faster matching on multibyte characters.
- errors
- Definition of errors.
Structs§
Enums§
- Match
Kind - A search option of the Aho-Corasick automaton
specified in
DoubleArrayAhoCorasickBuilder::match_kind.
Traits§
- Serializable
- Trait indicating serializability.