Expand description

🐎 daachorse: Double-Array Aho-Corasick

A fast implementation of the Aho-Corasick algorithm using the compact double-array data structure.

Overview

Daachorse is a crate for fast multiple pattern matching using the Aho-Corasick algorithm, running in linear time over the length of the input text. This crate uses the compact double-array data structure for implementing the pattern match automaton for time and memory efficiency. The data structure not only supports constant-time state-to-state traversal but also represents each state in the space of only 12 bytes.

The main technical ideas behind this library appear in this paper.

Example: Finding overlapped occurrences

To search for all occurrences of registered patterns that allow for positional overlap in the input text, use DoubleArrayAhoCorasick::find_overlapping_iter().

When you use DoubleArrayAhoCorasick::new() for construction, the library assigns a unique identifier to each pattern in the input order. The match result has the byte positions of the occurrence and its identifier.

use daachorse::DoubleArrayAhoCorasick;

let patterns = vec!["bcd", "ab", "a"];
let pma = DoubleArrayAhoCorasick::new(patterns).unwrap();

let mut it = pma.find_overlapping_iter("abcd");

let m = it.next().unwrap();
assert_eq!((0, 1, 2), (m.start(), m.end(), m.value()));

let m = it.next().unwrap();
assert_eq!((0, 2, 1), (m.start(), m.end(), m.value()));

let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));

assert_eq!(None, it.next());

Example: Finding non-overlapped occurrences with standard matching

If you do not want to allow positional overlap, use DoubleArrayAhoCorasick::find_iter() instead.

This function performs the search on the Aho-Corasick automaton and reports patterns first found in each iteration.

use daachorse::DoubleArrayAhoCorasick;

let patterns = vec!["bcd", "ab", "a"];
let pma = DoubleArrayAhoCorasick::new(patterns).unwrap();

let mut it = pma.find_iter("abcd");

let m = it.next().unwrap();
assert_eq!((0, 1, 2), (m.start(), m.end(), m.value()));

let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));

assert_eq!(None, it.next());

Example: Finding non-overlapped occurrences with longest matching

If you want to search for the longest pattern without positional overlap in each iteration, use DoubleArrayAhoCorasick::leftmost_find_iter() with specifying MatchKind::LeftmostLongest in the construction.

use daachorse::{DoubleArrayAhoCorasickBuilder, MatchKind};

let patterns = vec!["ab", "a", "abcd"];
let pma = DoubleArrayAhoCorasickBuilder::new()
    .match_kind(MatchKind::LeftmostLongest)
    .build(&patterns)
    .unwrap();

let mut it = pma.leftmost_find_iter("abcd");

let m = it.next().unwrap();
assert_eq!((0, 4, 2), (m.start(), m.end(), m.value()));

assert_eq!(None, it.next());

Example: Finding non-overlapped occurrences with leftmost-first matching

If you want to find the earliest registered pattern among ones starting from the search position, use DoubleArrayAhoCorasick::leftmost_find_iter() with specifying MatchKind::LeftmostFirst.

This semantic is the so-called leftmost first match, a tricky search option supported in the aho-corasick crate. For example, in the following code, ab is reported because it is the earliest registered one.

use daachorse::{DoubleArrayAhoCorasickBuilder, MatchKind};

let patterns = vec!["ab", "a", "abcd"];
let pma = DoubleArrayAhoCorasickBuilder::new()
    .match_kind(MatchKind::LeftmostFirst)
    .build(&patterns)
    .unwrap();

let mut it = pma.leftmost_find_iter("abcd");

let m = it.next().unwrap();
assert_eq!((0, 2, 0), (m.start(), m.end(), m.value()));

assert_eq!(None, it.next());

Example: Associating arbitrary values with patterns

To build the automaton from pairs of a pattern and user-defined value, instead of assigning identifiers automatically, use DoubleArrayAhoCorasick::with_values().

use daachorse::DoubleArrayAhoCorasick;

let patvals = vec![("bcd", 0), ("ab", 10), ("a", 20)];
let pma = DoubleArrayAhoCorasick::with_values(patvals).unwrap();

let mut it = pma.find_overlapping_iter("abcd");

let m = it.next().unwrap();
assert_eq!((0, 1, 20), (m.start(), m.end(), m.value()));

let m = it.next().unwrap();
assert_eq!((0, 2, 10), (m.start(), m.end(), m.value()));

let m = it.next().unwrap();
assert_eq!((1, 4, 0), (m.start(), m.end(), m.value()));

assert_eq!(None, it.next());

Example: Building faster automaton on multibyte characters

To build a faster automaton on multibyte characters, use CharwiseDoubleArrayAhoCorasick instead.

The standard version DoubleArrayAhoCorasick handles strings as UTF-8 sequences and defines transition labels using byte values. On the other hand, CharwiseDoubleArrayAhoCorasick uses Unicode code point values, reducing the number of transitions and faster matching.

use daachorse::CharwiseDoubleArrayAhoCorasick;

let patterns = vec!["全世界", "世界", "に"];
let pma = CharwiseDoubleArrayAhoCorasick::new(patterns).unwrap();

let mut it = pma.find_iter("全世界中に");

let m = it.next().unwrap();
assert_eq!((0, 9, 0), (m.start(), m.end(), m.value()));

let m = it.next().unwrap();
assert_eq!((12, 15, 2), (m.start(), m.end(), m.value()));

assert_eq!(None, it.next());

Re-exports

pub use bytewise::DoubleArrayAhoCorasick;

Modules

A byte-wise version of the Double-Array Aho-Corasick.

A character-wise version for faster matching on multibyte characters.

Definition of errors.

Structs

Enums

A search option of the Aho-Corasick automaton specified in DoubleArrayAhoCorasickBuilder::match_kind.

Traits

Trait indicating serializability.