regex_dfa 0.2.0

A crate for turning regexes into DFAs.
docs.rs failed to build regex_dfa-0.2.0
Please check the build logs for more information.
See Builds for ideas on how to fix a failed build, or Metadata for how to configure docs.rs builds.
If you believe this is docs.rs' fault, open an issue.
Visit the last successful build: regex_dfa-0.5.0

regex_dfa

A crate for creating and manipulating deterministic finite automata (DFAs). Currently, the implementation is somewhat biased towards building DFAs from regular expressions.

Build status Coverage Status

Documentation

Why

Some regular expression implementations (e.g. rust's regex library) are based on non-deterministic finite automata (NFAs). By turning NFAs into DFAs, we can sometimes get a speed boost, at the cost of some compilation time. Preliminary benchmarks show that we can get a 5x improvement over rust's built-in regular expressions for very simple regexes.

For a favorable example, compare the results of regex_dfa::Dfa:

test bench::anchored_literal_long_match      ... bench:          51 ns/iter (+/- 2)
test bench::anchored_literal_long_non_match  ... bench:          31 ns/iter (+/- 1)
test bench::anchored_literal_short_match     ... bench:          51 ns/iter (+/- 1)
test bench::anchored_literal_short_non_match ... bench:          31 ns/iter (+/- 1)

to the same results for regex::Regex:

test bench::anchored_literal_long_match      ... bench:         353 ns/iter (+/- 7)
test bench::anchored_literal_long_non_match  ... bench:         230 ns/iter (+/- 13)
test bench::anchored_literal_short_match     ... bench:         329 ns/iter (+/- 24)
test bench::anchored_literal_short_non_match ... bench:         206 ns/iter (+/- 9)

There are also less favorable examples, however: bench::medium_1K has a throughput of 271 MB/s with regex_dfa, but 325 MB/s with regex.

Limitations

Some regex features don't map well to DFAs. Specifically, this crate does not support (nor does it plan to support) lazy repetition or subgroup captures.

regex_dfa is not universally faster than the implementation in the regex crate. For example, the character matching code is faster in regex than regex_dfa (see bench::match_class_unicode) for an example. Moreover, regex has a much more sophisticated optimization for regexes that begin with one of a small set of strings (e.g. the regex-dna benchmark in the benchmark shootout game).

regex_dfa currently only works on nightly rust.

Status/TODO

  • For good regex performance, it is very important to quickly scan the text for plausible starting points. We currently have a fairly half-hearted implementation of this, which overlaps somewhat with the implementation in the regex crate. We should split off a better implementation into a separate crate.
  • Evaluate the usefulness of this crate as part of a full regex implementation (e.g. using regex-dfa to find matches and then running regex on those matches in order to extract groups).
  • Implement a compiler plugin, and see if it improves performance.

License

regex_dfa is distributed under the MIT license and the Apache license (version 2.0). See LICENSE-APACHE and LICENSE-MIT for details.