Module aho_corasick::packed

source ·
Expand description

Provides packed multiple substring search, principally for a small number of patterns.

This sub-module provides vectorized routines for quickly finding matches of a small number of patterns. In general, users of this crate shouldn’t need to interface with this module directly, as the primary AhoCorasick searcher will use these routines automatically as a prefilter when applicable. However, in some cases, callers may want to bypass the Aho-Corasick machinery entirely and use this vectorized searcher directly.

Overview

The primary types in this sub-module are:

  • Searcher executes the actual search algorithm to report matches in a haystack.
  • Builder accumulates patterns incrementally and can construct a Searcher.
  • Config permits tuning the searcher, and itself will produce a Builder (which can then be used to build a Searcher). Currently, the only tuneable knob are the match semantics, but this may be expanded in the future.

Examples

This example shows how to create a searcher from an iterator of patterns. By default, leftmost-first match semantics are used. (See the top-level MatchKind type for more details about match semantics, which apply similarly to packed substring search.)

use aho_corasick::{packed::{MatchKind, Searcher}, PatternID};

let searcher = Searcher::new(["foobar", "foo"].iter().cloned())?;
let matches: Vec<PatternID> = searcher
    .find_iter("foobar")
    .map(|mat| mat.pattern())
    .collect();
assert_eq!(vec![PatternID::ZERO], matches);

This example shows how to use Config to change the match semantics to leftmost-longest:

use aho_corasick::{packed::{Config, MatchKind}, PatternID};

let searcher = Config::new()
    .match_kind(MatchKind::LeftmostLongest)
    .builder()
    .add("foo")
    .add("foobar")
    .build()?;
let matches: Vec<PatternID> = searcher
    .find_iter("foobar")
    .map(|mat| mat.pattern())
    .collect();
assert_eq!(vec![PatternID::must(1)], matches);

Packed substring searching

Packed substring searching refers to the use of SIMD (Single Instruction, Multiple Data) to accelerate the detection of matches in a haystack. Unlike conventional algorithms, such as Aho-Corasick, SIMD algorithms for substring search tend to do better with a small number of patterns, where as Aho-Corasick generally maintains reasonably consistent performance regardless of the number of patterns you give it. Because of this, the vectorized searcher in this sub-module cannot be used as a general purpose searcher, since building the searcher may fail even when given a small number of patterns. However, in exchange, when searching for a small number of patterns, searching can be quite a bit faster than Aho-Corasick (sometimes by an order of magnitude).

The key take away here is that constructing a searcher from a list of patterns is a fallible operation with no clear rules for when it will fail. While the precise conditions under which building a searcher can fail is specifically an implementation detail, here are some common reasons:

  • Too many patterns were given. Typically, the limit is on the order of 100 or so, but this limit may fluctuate based on available CPU features.
  • The available packed algorithms require CPU features that aren’t available. For example, currently, this crate only provides packed algorithms for x86_64 and aarch64. Therefore, constructing a packed searcher on any other target will always fail.
  • Zero patterns were given, or one of the patterns given was empty. Packed searchers require at least one pattern and that all patterns are non-empty.
  • Something else about the nature of the patterns (typically based on heuristics) suggests that a packed searcher would perform very poorly, so no searcher is built.

Structs

  • A builder for constructing a packed searcher from a collection of patterns.
  • The configuration for a packed multiple pattern searcher.
  • An iterator over non-overlapping matches from a packed searcher.
  • A packed searcher for quickly finding occurrences of multiple patterns.

Enums

  • A knob for controlling the match semantics of a packed multiple string searcher.