Struct regex_automata::dfa::dense::DFA

source ·
pub struct DFA<T> { /* private fields */ }
Available on (crate features dfa-search or dfa-onepass) and crate feature dfa-search only.
Expand description

A dense table-based deterministic finite automaton (DFA).

All dense DFAs have one or more start states, zero or more match states and a transition table that maps the current state and the current byte of input to the next state. A DFA can use this information to implement fast searching. In particular, the use of a dense DFA generally makes the trade off that match speed is the most valuable characteristic, even if building the DFA may take significant time and space. (More concretely, building a DFA takes time and space that is exponential in the size of the pattern in the worst case.) As such, the processing of every byte of input is done with a small constant number of operations that does not vary with the pattern, its size or the size of the alphabet. If your needs don’t line up with this trade off, then a dense DFA may not be an adequate solution to your problem.

In contrast, a sparse::DFA makes the opposite trade off: it uses less space but will execute a variable number of instructions per byte at match time, which makes it slower for matching. (Note that space usage is still exponential in the size of the pattern in the worst case.)

A DFA can be built using the default configuration via the DFA::new constructor. Otherwise, one can configure various aspects via dense::Builder.

A single DFA fundamentally supports the following operations:

  1. Detection of a match.
  2. Location of the end of a match.
  3. In the case of a DFA with multiple patterns, which pattern matched is reported as well.

A notable absence from the above list of capabilities is the location of the start of a match. In order to provide both the start and end of a match, two DFAs are required. This functionality is provided by a Regex.

§Type parameters

A DFA has one type parameter, T, which is used to represent state IDs, pattern IDs and accelerators. T is typically a Vec<u32> or a &[u32].

§The Automaton trait

This type implements the Automaton trait, which means it can be used for searching. For example:

use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};

let dfa = DFA::new("foo[0-9]+")?;
let expected = HalfMatch::must(0, 8);
assert_eq!(Some(expected), dfa.try_search_fwd(&Input::new("foo12345"))?);

Implementations§

source§

impl DFA<Vec<u32>>

source

pub fn new(pattern: &str) -> Result<DFA<Vec<u32>>, BuildError>

Available on crate features dfa-build and syntax only.

Parse the given regular expression using a default configuration and return the corresponding DFA.

If you want a non-default configuration, then use the dense::Builder to set your own configuration.

§Example
use regex_automata::{dfa::{Automaton, dense}, HalfMatch, Input};

let dfa = dense::DFA::new("foo[0-9]+bar")?;
let expected = Some(HalfMatch::must(0, 11));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345bar"))?);
source

pub fn new_many<P: AsRef<str>>( patterns: &[P] ) -> Result<DFA<Vec<u32>>, BuildError>

Available on crate features dfa-build and syntax only.

Parse the given regular expressions using a default configuration and return the corresponding multi-DFA.

If you want a non-default configuration, then use the dense::Builder to set your own configuration.

§Example
use regex_automata::{dfa::{Automaton, dense}, HalfMatch, Input};

let dfa = dense::DFA::new_many(&["[0-9]+", "[a-z]+"])?;
let expected = Some(HalfMatch::must(1, 3));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345bar"))?);
source§

impl DFA<Vec<u32>>

source

pub fn always_match() -> Result<DFA<Vec<u32>>, BuildError>

Available on crate feature dfa-build only.

Create a new DFA that matches every input.

§Example
use regex_automata::{dfa::{Automaton, dense}, HalfMatch, Input};

let dfa = dense::DFA::always_match()?;

let expected = Some(HalfMatch::must(0, 0));
assert_eq!(expected, dfa.try_search_fwd(&Input::new(""))?);
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo"))?);
source

pub fn never_match() -> Result<DFA<Vec<u32>>, BuildError>

Available on crate feature dfa-build only.

Create a new DFA that never matches any input.

§Example
use regex_automata::{dfa::{Automaton, dense}, Input};

let dfa = dense::DFA::never_match()?;
assert_eq!(None, dfa.try_search_fwd(&Input::new(""))?);
assert_eq!(None, dfa.try_search_fwd(&Input::new("foo"))?);
source§

impl DFA<&[u32]>

source

pub fn config() -> Config

Available on crate feature dfa-build only.

Return a new default dense DFA compiler configuration.

This is a convenience routine to avoid needing to import the Config type when customizing the construction of a dense DFA.

source

pub fn builder() -> Builder

Available on crate feature dfa-build only.

Create a new dense DFA builder with the default configuration.

This is a convenience routine to avoid needing to import the Builder type in common cases.

source§

impl<T: AsRef<[u32]>> DFA<T>

source

pub fn as_ref(&self) -> DFA<&[u32]>

Cheaply return a borrowed version of this dense DFA. Specifically, the DFA returned always uses &[u32] for its transition table.

source

pub fn to_owned(&self) -> DFA<Vec<u32>>

Available on crate feature alloc only.

Return an owned version of this sparse DFA. Specifically, the DFA returned always uses Vec<u32> for its transition table.

Effectively, this returns a dense DFA whose transition table lives on the heap.

source

pub fn start_kind(&self) -> StartKind

Returns the starting state configuration for this DFA.

The default is StartKind::Both, which means the DFA supports both unanchored and anchored searches. However, this can generally lead to bigger DFAs. Therefore, a DFA might be compiled with support for just unanchored or anchored searches. In that case, running a search with an unsupported configuration will panic.

source

pub fn starts_for_each_pattern(&self) -> bool

Returns true only if this DFA has starting states for each pattern.

When a DFA has starting states for each pattern, then a search with the DFA can be configured to only look for anchored matches of a specific pattern. Specifically, APIs like Automaton::try_search_fwd can accept a non-None pattern_id if and only if this method returns true. Otherwise, calling try_search_fwd will panic.

Note that if the DFA has no patterns, this always returns false.

source

pub fn byte_classes(&self) -> &ByteClasses

Returns the equivalence classes that make up the alphabet for this DFA.

Unless Config::byte_classes was disabled, it is possible that multiple distinct bytes are grouped into the same equivalence class if it is impossible for them to discriminate between a match and a non-match. This has the effect of reducing the overall alphabet size and in turn potentially substantially reducing the size of the DFA’s transition table.

The downside of using equivalence classes like this is that every state transition will automatically use this map to convert an arbitrary byte to its corresponding equivalence class. In practice this has a negligible impact on performance.

source

pub fn alphabet_len(&self) -> usize

Returns the total number of elements in the alphabet for this DFA.

That is, this returns the total number of transitions that each state in this DFA must have. Typically, a normal byte oriented DFA would always have an alphabet size of 256, corresponding to the number of unique values in a single byte. However, this implementation has two peculiarities that impact the alphabet length:

  • Every state has a special “EOI” transition that is only followed after the end of some haystack is reached. This EOI transition is necessary to account for one byte of look-ahead when implementing things like \b and $.
  • Bytes are grouped into equivalence classes such that no two bytes in the same class can distinguish a match from a non-match. For example, in the regex ^[a-z]+$, the ASCII bytes a-z could all be in the same equivalence class. This leads to a massive space savings.

Note though that the alphabet length does not necessarily equal the total stride space taken up by a single DFA state in the transition table. Namely, for performance reasons, the stride is always the smallest power of two that is greater than or equal to the alphabet length. For this reason, DFA::stride or DFA::stride2 are often more useful. The alphabet length is typically useful only for informational purposes.

source

pub fn stride2(&self) -> usize

Returns the total stride for every state in this DFA, expressed as the exponent of a power of 2. The stride is the amount of space each state takes up in the transition table, expressed as a number of transitions. (Unused transitions map to dead states.)

The stride of a DFA is always equivalent to the smallest power of 2 that is greater than or equal to the DFA’s alphabet length. This definition uses extra space, but permits faster translation between premultiplied state identifiers and contiguous indices (by using shifts instead of relying on integer division).

For example, if the DFA’s stride is 16 transitions, then its stride2 is 4 since 2^4 = 16.

The minimum stride2 value is 1 (corresponding to a stride of 2) while the maximum stride2 value is 9 (corresponding to a stride of 512). The maximum is not 8 since the maximum alphabet size is 257 when accounting for the special EOI transition. However, an alphabet length of that size is exceptionally rare since the alphabet is shrunk into equivalence classes.

source

pub fn stride(&self) -> usize

Returns the total stride for every state in this DFA. This corresponds to the total number of transitions used by each state in this DFA’s transition table.

Please see DFA::stride2 for more information. In particular, this returns the stride as the number of transitions, where as stride2 returns it as the exponent of a power of 2.

source

pub fn memory_usage(&self) -> usize

Returns the memory usage, in bytes, of this DFA.

The memory usage is computed based on the number of bytes used to represent this DFA.

This does not include the stack size used up by this DFA. To compute that, use std::mem::size_of::<dense::DFA>().

source§

impl<T: AsRef<[u32]>> DFA<T>

Routines for converting a dense DFA to other representations, such as sparse DFAs or raw bytes suitable for persistent storage.

source

pub fn to_sparse(&self) -> Result<DFA<Vec<u8>>, BuildError>

Available on crate feature dfa-build only.

Convert this dense DFA to a sparse DFA.

If a StateID is too small to represent all states in the sparse DFA, then this returns an error. In most cases, if a dense DFA is constructable with StateID then a sparse DFA will be as well. However, it is not guaranteed.

§Example
use regex_automata::{dfa::{Automaton, dense}, HalfMatch, Input};

let dense = dense::DFA::new("foo[0-9]+")?;
let sparse = dense.to_sparse()?;

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, sparse.try_search_fwd(&Input::new("foo12345"))?);
source

pub fn to_bytes_little_endian(&self) -> (Vec<u8>, usize)

Available on crate feature dfa-build only.

Serialize this DFA as raw bytes to a Vec<u8> in little endian format. Upon success, the Vec<u8> and the initial padding length are returned.

The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release of this crate by a DFA’s deserialization APIs (assuming all other criteria for the deserialization APIs has been satisfied):

The padding returned is non-zero if the returned Vec<u8> starts at an address that does not have the same alignment as u32. The padding corresponds to the number of leading bytes written to the returned Vec<u8>.

§Example

This example shows how to serialize and deserialize a DFA:

use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

// N.B. We use native endianness here to make the example work, but
// using to_bytes_little_endian would work on a little endian target.
let (buf, _) = original_dfa.to_bytes_native_endian();
// Even if buf has initial padding, DFA::from_bytes will automatically
// ignore it.
let dfa: DFA<&[u32]> = DFA::from_bytes(&buf)?.0;

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
source

pub fn to_bytes_big_endian(&self) -> (Vec<u8>, usize)

Available on crate feature dfa-build only.

Serialize this DFA as raw bytes to a Vec<u8> in big endian format. Upon success, the Vec<u8> and the initial padding length are returned.

The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release of this crate by a DFA’s deserialization APIs (assuming all other criteria for the deserialization APIs has been satisfied):

The padding returned is non-zero if the returned Vec<u8> starts at an address that does not have the same alignment as u32. The padding corresponds to the number of leading bytes written to the returned Vec<u8>.

§Example

This example shows how to serialize and deserialize a DFA:

use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

// N.B. We use native endianness here to make the example work, but
// using to_bytes_big_endian would work on a big endian target.
let (buf, _) = original_dfa.to_bytes_native_endian();
// Even if buf has initial padding, DFA::from_bytes will automatically
// ignore it.
let dfa: DFA<&[u32]> = DFA::from_bytes(&buf)?.0;

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
source

pub fn to_bytes_native_endian(&self) -> (Vec<u8>, usize)

Available on crate feature dfa-build only.

Serialize this DFA as raw bytes to a Vec<u8> in native endian format. Upon success, the Vec<u8> and the initial padding length are returned.

The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release of this crate by a DFA’s deserialization APIs (assuming all other criteria for the deserialization APIs has been satisfied):

The padding returned is non-zero if the returned Vec<u8> starts at an address that does not have the same alignment as u32. The padding corresponds to the number of leading bytes written to the returned Vec<u8>.

Generally speaking, native endian format should only be used when you know that the target you’re compiling the DFA for matches the endianness of the target on which you’re compiling DFA. For example, if serialization and deserialization happen in the same process or on the same machine. Otherwise, when serializing a DFA for use in a portable environment, you’ll almost certainly want to serialize both a little endian and a big endian version and then load the correct one based on the target’s configuration.

§Example

This example shows how to serialize and deserialize a DFA:

use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

let (buf, _) = original_dfa.to_bytes_native_endian();
// Even if buf has initial padding, DFA::from_bytes will automatically
// ignore it.
let dfa: DFA<&[u32]> = DFA::from_bytes(&buf)?.0;

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
source

pub fn write_to_little_endian( &self, dst: &mut [u8] ) -> Result<usize, SerializeError>

Serialize this DFA as raw bytes to the given slice, in little endian format. Upon success, the total number of bytes written to dst is returned.

The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release of this crate by a DFA’s deserialization APIs (assuming all other criteria for the deserialization APIs has been satisfied):

Note that unlike the various to_byte_* routines, this does not write any padding. Callers are responsible for handling alignment correctly.

§Errors

This returns an error if the given destination slice is not big enough to contain the full serialized DFA. If an error occurs, then nothing is written to dst.

§Example

This example shows how to serialize and deserialize a DFA without dynamic memory allocation.

use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

// Create a 4KB buffer on the stack to store our serialized DFA. We
// need to use a special type to force the alignment of our [u8; N]
// array to be aligned to a 4 byte boundary. Otherwise, deserializing
// the DFA may fail because of an alignment mismatch.
#[repr(C)]
struct Aligned<B: ?Sized> {
    _align: [u32; 0],
    bytes: B,
}
let mut buf = Aligned { _align: [], bytes: [0u8; 4 * (1<<10)] };
// N.B. We use native endianness here to make the example work, but
// using write_to_little_endian would work on a little endian target.
let written = original_dfa.write_to_native_endian(&mut buf.bytes)?;
let dfa: DFA<&[u32]> = DFA::from_bytes(&buf.bytes[..written])?.0;

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
source

pub fn write_to_big_endian( &self, dst: &mut [u8] ) -> Result<usize, SerializeError>

Serialize this DFA as raw bytes to the given slice, in big endian format. Upon success, the total number of bytes written to dst is returned.

The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release of this crate by a DFA’s deserialization APIs (assuming all other criteria for the deserialization APIs has been satisfied):

Note that unlike the various to_byte_* routines, this does not write any padding. Callers are responsible for handling alignment correctly.

§Errors

This returns an error if the given destination slice is not big enough to contain the full serialized DFA. If an error occurs, then nothing is written to dst.

§Example

This example shows how to serialize and deserialize a DFA without dynamic memory allocation.

use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

// Create a 4KB buffer on the stack to store our serialized DFA. We
// need to use a special type to force the alignment of our [u8; N]
// array to be aligned to a 4 byte boundary. Otherwise, deserializing
// the DFA may fail because of an alignment mismatch.
#[repr(C)]
struct Aligned<B: ?Sized> {
    _align: [u32; 0],
    bytes: B,
}
let mut buf = Aligned { _align: [], bytes: [0u8; 4 * (1<<10)] };
// N.B. We use native endianness here to make the example work, but
// using write_to_big_endian would work on a big endian target.
let written = original_dfa.write_to_native_endian(&mut buf.bytes)?;
let dfa: DFA<&[u32]> = DFA::from_bytes(&buf.bytes[..written])?.0;

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
source

pub fn write_to_native_endian( &self, dst: &mut [u8] ) -> Result<usize, SerializeError>

Serialize this DFA as raw bytes to the given slice, in native endian format. Upon success, the total number of bytes written to dst is returned.

The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release of this crate by a DFA’s deserialization APIs (assuming all other criteria for the deserialization APIs has been satisfied):

Generally speaking, native endian format should only be used when you know that the target you’re compiling the DFA for matches the endianness of the target on which you’re compiling DFA. For example, if serialization and deserialization happen in the same process or on the same machine. Otherwise, when serializing a DFA for use in a portable environment, you’ll almost certainly want to serialize both a little endian and a big endian version and then load the correct one based on the target’s configuration.

Note that unlike the various to_byte_* routines, this does not write any padding. Callers are responsible for handling alignment correctly.

§Errors

This returns an error if the given destination slice is not big enough to contain the full serialized DFA. If an error occurs, then nothing is written to dst.

§Example

This example shows how to serialize and deserialize a DFA without dynamic memory allocation.

use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

// Create a 4KB buffer on the stack to store our serialized DFA. We
// need to use a special type to force the alignment of our [u8; N]
// array to be aligned to a 4 byte boundary. Otherwise, deserializing
// the DFA may fail because of an alignment mismatch.
#[repr(C)]
struct Aligned<B: ?Sized> {
    _align: [u32; 0],
    bytes: B,
}
let mut buf = Aligned { _align: [], bytes: [0u8; 4 * (1<<10)] };
let written = original_dfa.write_to_native_endian(&mut buf.bytes)?;
let dfa: DFA<&[u32]> = DFA::from_bytes(&buf.bytes[..written])?.0;

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
source

pub fn write_to_len(&self) -> usize

Return the total number of bytes required to serialize this DFA.

This is useful for determining the size of the buffer required to pass to one of the serialization routines:

Passing a buffer smaller than the size returned by this method will result in a serialization error. Serialization routines are guaranteed to succeed when the buffer is big enough.

§Example

This example shows how to dynamically allocate enough room to serialize a DFA.

use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};

let original_dfa = DFA::new("foo[0-9]+")?;

let mut buf = vec![0; original_dfa.write_to_len()];
// This is guaranteed to succeed, because the only serialization error
// that can occur is when the provided buffer is too small. But
// write_to_len guarantees a correct size.
let written = original_dfa.write_to_native_endian(&mut buf).unwrap();
// But this is not guaranteed to succeed! In particular,
// deserialization requires proper alignment for &[u32], but our buffer
// was allocated as a &[u8] whose required alignment is smaller than
// &[u32]. However, it's likely to work in practice because of how most
// allocators work. So if you write code like this, make sure to either
// handle the error correctly and/or run it under Miri since Miri will
// likely provoke the error by returning Vec<u8> buffers with alignment
// less than &[u32].
let dfa: DFA<&[u32]> = match DFA::from_bytes(&buf[..written]) {
    // As mentioned above, it is legal for an error to be returned
    // here. It is quite difficult to get a Vec<u8> with a guaranteed
    // alignment equivalent to Vec<u32>.
    Err(_) => return Ok(()),
    Ok((dfa, _)) => dfa,
};

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);

Note that this example isn’t actually guaranteed to work! In particular, if buf is not aligned to a 4-byte boundary, then the DFA::from_bytes call will fail. If you need this to work, then you either need to deal with adding some initial padding yourself, or use one of the to_bytes methods, which will do it for you.

source§

impl<'a> DFA<&'a [u32]>

source

pub fn from_bytes( slice: &'a [u8] ) -> Result<(DFA<&'a [u32]>, usize), DeserializeError>

Safely deserialize a DFA with a specific state identifier representation. Upon success, this returns both the deserialized DFA and the number of bytes read from the given slice. Namely, the contents of the slice beyond the DFA are not read.

Deserializing a DFA using this routine will never allocate heap memory. For safety purposes, the DFA’s transition table will be verified such that every transition points to a valid state. If this verification is too costly, then a DFA::from_bytes_unchecked API is provided, which will always execute in constant time.

The bytes given must be generated by one of the serialization APIs of a DFA using a semver compatible release of this crate. Those include:

The to_bytes methods allocate and return a Vec<u8> for you, along with handling alignment correctly. The write_to methods do not allocate and write to an existing slice (which may be on the stack). Since deserialization always uses the native endianness of the target platform, the serialization API you use should match the endianness of the target platform. (It’s often a good idea to generate serialized DFAs for both forms of endianness and then load the correct one based on endianness.)

§Errors

Generally speaking, it’s easier to state the conditions in which an error is not returned. All of the following must be true:

  • The bytes given must be produced by one of the serialization APIs on this DFA, as mentioned above.
  • The endianness of the target platform matches the endianness used to serialized the provided DFA.
  • The slice given must have the same alignment as u32.

If any of the above are not true, then an error will be returned.

§Panics

This routine will never panic for any input.

§Example

This example shows how to serialize a DFA to raw bytes, deserialize it and then use it for searching.

use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};

let initial = DFA::new("foo[0-9]+")?;
let (bytes, _) = initial.to_bytes_native_endian();
let dfa: DFA<&[u32]> = DFA::from_bytes(&bytes)?.0;

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
§Example: dealing with alignment and padding

In the above example, we used the to_bytes_native_endian method to serialize a DFA, but we ignored part of its return value corresponding to padding added to the beginning of the serialized DFA. This is OK because deserialization will skip this initial padding. What matters is that the address immediately following the padding has an alignment that matches u32. That is, the following is an equivalent but alternative way to write the above example:

use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};

let initial = DFA::new("foo[0-9]+")?;
// Serialization returns the number of leading padding bytes added to
// the returned Vec<u8>.
let (bytes, pad) = initial.to_bytes_native_endian();
let dfa: DFA<&[u32]> = DFA::from_bytes(&bytes[pad..])?.0;

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);

This padding is necessary because Rust’s standard library does not expose any safe and robust way of creating a Vec<u8> with a guaranteed alignment other than 1. Now, in practice, the underlying allocator is likely to provide a Vec<u8> that meets our alignment requirements, which means pad is zero in practice most of the time.

The purpose of exposing the padding like this is flexibility for the caller. For example, if one wants to embed a serialized DFA into a compiled program, then it’s important to guarantee that it starts at a u32-aligned address. The simplest way to do this is to discard the padding bytes and set it up so that the serialized DFA itself begins at a properly aligned address. We can show this in two parts. The first part is serializing the DFA to a file:

use regex_automata::dfa::dense::DFA;

let dfa = DFA::new("foo[0-9]+")?;

let (bytes, pad) = dfa.to_bytes_big_endian();
// Write the contents of the DFA *without* the initial padding.
std::fs::write("foo.bigendian.dfa", &bytes[pad..])?;

// Do it again, but this time for little endian.
let (bytes, pad) = dfa.to_bytes_little_endian();
std::fs::write("foo.littleendian.dfa", &bytes[pad..])?;

And now the second part is embedding the DFA into the compiled program and deserializing it at runtime on first use. We use conditional compilation to choose the correct endianness.

use regex_automata::{
    dfa::{Automaton, dense::DFA},
    util::{lazy::Lazy, wire::AlignAs},
    HalfMatch, Input,
};

// This crate provides its own "lazy" type, kind of like
// lazy_static! or once_cell::sync::Lazy. But it works in no-alloc
// no-std environments and let's us write this using completely
// safe code.
static RE: Lazy<DFA<&'static [u32]>> = Lazy::new(|| {
    // This assignment is made possible (implicitly) via the
    // CoerceUnsized trait. This is what guarantees that our
    // bytes are stored in memory on a 4 byte boundary. You
    // *must* do this or something equivalent for correct
    // deserialization.
    static ALIGNED: &AlignAs<[u8], u32> = &AlignAs {
        _align: [],
        #[cfg(target_endian = "big")]
        bytes: *include_bytes!("foo.bigendian.dfa"),
        #[cfg(target_endian = "little")]
        bytes: *include_bytes!("foo.littleendian.dfa"),
    };

    let (dfa, _) = DFA::from_bytes(&ALIGNED.bytes)
        .expect("serialized DFA should be valid");
    dfa
});

let expected = Ok(Some(HalfMatch::must(0, 8)));
assert_eq!(expected, RE.try_search_fwd(&Input::new("foo12345")));

An alternative to util::lazy::Lazy is lazy_static or once_cell, which provide stronger guarantees (like the initialization function only being executed once). And once_cell in particular provides a more expressive API. But a Lazy value from this crate is likely just fine in most circumstances.

Note that regardless of which initialization method you use, you will still need to use the AlignAs trick above to force correct alignment, but this is safe to do and from_bytes will return an error if you get it wrong.

source

pub unsafe fn from_bytes_unchecked( slice: &'a [u8] ) -> Result<(DFA<&'a [u32]>, usize), DeserializeError>

Deserialize a DFA with a specific state identifier representation in constant time by omitting the verification of the validity of the transition table and other data inside the DFA.

This is just like DFA::from_bytes, except it can potentially return a DFA that exhibits undefined behavior if its transition table contains invalid state identifiers.

This routine is useful if you need to deserialize a DFA cheaply and cannot afford the transition table validation performed by from_bytes.

§Example
use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};

let initial = DFA::new("foo[0-9]+")?;
let (bytes, _) = initial.to_bytes_native_endian();
// SAFETY: This is guaranteed to be safe since the bytes given come
// directly from a compatible serialization routine.
let dfa: DFA<&[u32]> = unsafe { DFA::from_bytes_unchecked(&bytes)?.0 };

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);

Trait Implementations§

source§

impl<T: AsRef<[u32]>> Automaton for DFA<T>

source§

fn is_special_state(&self, id: StateID) -> bool

Returns true if and only if the given identifier corresponds to a “special” state. A special state is one or more of the following: a dead state, a quit state, a match state, a start state or an accelerated state. Read more
source§

fn is_dead_state(&self, id: StateID) -> bool

Returns true if and only if the given identifier corresponds to a dead state. When a DFA enters a dead state, it is impossible to leave. That is, every transition on a dead state by definition leads back to the same dead state. Read more
source§

fn is_quit_state(&self, id: StateID) -> bool

Returns true if and only if the given identifier corresponds to a quit state. A quit state is like a dead state (it has no transitions other than to itself), except it indicates that the DFA failed to complete the search. When this occurs, callers can neither accept or reject that a match occurred. Read more
source§

fn is_match_state(&self, id: StateID) -> bool

Returns true if and only if the given identifier corresponds to a match state. A match state is also referred to as a “final” state and indicates that a match has been found. Read more
source§

fn is_start_state(&self, id: StateID) -> bool

Returns true only if the given identifier corresponds to a start state Read more
source§

fn is_accel_state(&self, id: StateID) -> bool

Returns true if and only if the given identifier corresponds to an accelerated state. Read more
source§

fn next_state(&self, current: StateID, input: u8) -> StateID

Transitions from the current state to the next state, given the next byte of input. Read more
source§

unsafe fn next_state_unchecked(&self, current: StateID, byte: u8) -> StateID

Transitions from the current state to the next state, given the next byte of input. Read more
source§

fn next_eoi_state(&self, current: StateID) -> StateID

Transitions from the current state to the next state for the special EOI symbol. Read more
source§

fn pattern_len(&self) -> usize

Returns the total number of patterns compiled into this DFA. Read more
source§

fn match_len(&self, id: StateID) -> usize

Returns the total number of patterns that match in this state. Read more
source§

fn match_pattern(&self, id: StateID, match_index: usize) -> PatternID

Returns the pattern ID corresponding to the given match index in the given state. Read more
source§

fn has_empty(&self) -> bool

Returns true if and only if this automaton can match the empty string. When it returns false, all possible matches are guaranteed to have a non-zero length. Read more
source§

fn is_utf8(&self) -> bool

Whether UTF-8 mode is enabled for this DFA or not. Read more
source§

fn is_always_start_anchored(&self) -> bool

Returns true if and only if this DFA is limited to returning matches whose start position is 0. Read more
source§

fn start_state(&self, config: &Config) -> Result<StateID, StartError>

Return the ID of the start state for this DFA for the given starting configuration. Read more
source§

fn universal_start_state(&self, mode: Anchored) -> Option<StateID>

If this DFA has a universal starting state for the given anchor mode and the DFA supports universal starting states, then this returns that state’s identifier. Read more
source§

fn accelerator(&self, id: StateID) -> &[u8]

Return a slice of bytes to accelerate for the given state, if possible. Read more
source§

fn get_prefilter(&self) -> Option<&Prefilter>

Returns the prefilter associated with a DFA, if one exists. Read more
source§

fn start_state_forward(&self, input: &Input<'_>) -> Result<StateID, MatchError>

Return the ID of the start state for this DFA when executing a forward search. Read more
source§

fn start_state_reverse(&self, input: &Input<'_>) -> Result<StateID, MatchError>

Return the ID of the start state for this DFA when executing a reverse search. Read more
source§

fn try_search_fwd( &self, input: &Input<'_> ) -> Result<Option<HalfMatch>, MatchError>

Executes a forward search and returns the end position of the leftmost match that is found. If no match exists, then None is returned. Read more
source§

fn try_search_rev( &self, input: &Input<'_> ) -> Result<Option<HalfMatch>, MatchError>

Executes a reverse search and returns the start of the position of the leftmost match that is found. If no match exists, then None is returned. Read more
source§

fn try_search_overlapping_fwd( &self, input: &Input<'_>, state: &mut OverlappingState ) -> Result<(), MatchError>

Executes an overlapping forward search. Matches, if one exists, can be obtained via the OverlappingState::get_match method. Read more
source§

fn try_search_overlapping_rev( &self, input: &Input<'_>, state: &mut OverlappingState ) -> Result<(), MatchError>

Executes a reverse overlapping forward search. Matches, if one exists, can be obtained via the OverlappingState::get_match method. Read more
source§

fn try_which_overlapping_matches( &self, input: &Input<'_>, patset: &mut PatternSet ) -> Result<(), MatchError>

Available on crate feature alloc only.
Writes the set of patterns that match anywhere in the given search configuration to patset. If multiple patterns match at the same position and the underlying DFA supports overlapping matches, then all matching patterns are written to the given set. Read more
source§

impl<T: Clone> Clone for DFA<T>

source§

fn clone(&self) -> DFA<T>

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl<T: AsRef<[u32]>> Debug for DFA<T>

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

§

impl<T> RefUnwindSafe for DFA<T>
where T: RefUnwindSafe,

§

impl<T> Send for DFA<T>
where T: Send,

§

impl<T> Sync for DFA<T>
where T: Sync,

§

impl<T> Unpin for DFA<T>
where T: Unpin,

§

impl<T> UnwindSafe for DFA<T>
where T: UnwindSafe,

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.