Struct regex_automata::dfa::dense::DFA
source · pub struct DFA<T> { /* private fields */ }
Expand description
A dense table-based deterministic finite automaton (DFA).
All dense DFAs have one or more start states, zero or more match states and a transition table that maps the current state and the current byte of input to the next state. A DFA can use this information to implement fast searching. In particular, the use of a dense DFA generally makes the trade off that match speed is the most valuable characteristic, even if building the DFA may take significant time and space. (More concretely, building a DFA takes time and space that is exponential in the size of the pattern in the worst case.) As such, the processing of every byte of input is done with a small constant number of operations that does not vary with the pattern, its size or the size of the alphabet. If your needs don’t line up with this trade off, then a dense DFA may not be an adequate solution to your problem.
In contrast, a sparse::DFA
makes the opposite
trade off: it uses less space but will execute a variable number of
instructions per byte at match time, which makes it slower for matching.
(Note that space usage is still exponential in the size of the pattern in
the worst case.)
A DFA can be built using the default configuration via the
DFA::new
constructor. Otherwise, one can
configure various aspects via dense::Builder
.
A single DFA fundamentally supports the following operations:
- Detection of a match.
- Location of the end of a match.
- In the case of a DFA with multiple patterns, which pattern matched is reported as well.
A notable absence from the above list of capabilities is the location of
the start of a match. In order to provide both the start and end of
a match, two DFAs are required. This functionality is provided by a
Regex
.
Type parameters
A DFA
has one type parameter, T
, which is used to represent state IDs,
pattern IDs and accelerators. T
is typically a Vec<u32>
or a &[u32]
.
The Automaton
trait
This type implements the Automaton
trait, which means it can be used
for searching. For example:
use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};
let dfa = DFA::new("foo[0-9]+")?;
let expected = HalfMatch::must(0, 8);
assert_eq!(Some(expected), dfa.try_search_fwd(&Input::new("foo12345"))?);
Implementations§
source§impl DFA<Vec<u32>>
impl DFA<Vec<u32>>
sourcepub fn new(pattern: &str) -> Result<DFA<Vec<u32>>, BuildError>
pub fn new(pattern: &str) -> Result<DFA<Vec<u32>>, BuildError>
Parse the given regular expression using a default configuration and return the corresponding DFA.
If you want a non-default configuration, then use the
dense::Builder
to set your own configuration.
Example
use regex_automata::{dfa::{Automaton, dense}, HalfMatch, Input};
let dfa = dense::DFA::new("foo[0-9]+bar")?;
let expected = Some(HalfMatch::must(0, 11));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345bar"))?);
sourcepub fn new_many<P: AsRef<str>>(
patterns: &[P]
) -> Result<DFA<Vec<u32>>, BuildError>
pub fn new_many<P: AsRef<str>>( patterns: &[P] ) -> Result<DFA<Vec<u32>>, BuildError>
Parse the given regular expressions using a default configuration and return the corresponding multi-DFA.
If you want a non-default configuration, then use the
dense::Builder
to set your own configuration.
Example
use regex_automata::{dfa::{Automaton, dense}, HalfMatch, Input};
let dfa = dense::DFA::new_many(&["[0-9]+", "[a-z]+"])?;
let expected = Some(HalfMatch::must(1, 3));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345bar"))?);
source§impl DFA<Vec<u32>>
impl DFA<Vec<u32>>
sourcepub fn always_match() -> Result<DFA<Vec<u32>>, BuildError>
pub fn always_match() -> Result<DFA<Vec<u32>>, BuildError>
Create a new DFA that matches every input.
Example
use regex_automata::{dfa::{Automaton, dense}, HalfMatch, Input};
let dfa = dense::DFA::always_match()?;
let expected = Some(HalfMatch::must(0, 0));
assert_eq!(expected, dfa.try_search_fwd(&Input::new(""))?);
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo"))?);
sourcepub fn never_match() -> Result<DFA<Vec<u32>>, BuildError>
pub fn never_match() -> Result<DFA<Vec<u32>>, BuildError>
Create a new DFA that never matches any input.
Example
use regex_automata::{dfa::{Automaton, dense}, Input};
let dfa = dense::DFA::never_match()?;
assert_eq!(None, dfa.try_search_fwd(&Input::new(""))?);
assert_eq!(None, dfa.try_search_fwd(&Input::new("foo"))?);
source§impl DFA<&[u32]>
impl DFA<&[u32]>
source§impl<T: AsRef<[u32]>> DFA<T>
impl<T: AsRef<[u32]>> DFA<T>
sourcepub fn as_ref(&self) -> DFA<&[u32]>
pub fn as_ref(&self) -> DFA<&[u32]>
Cheaply return a borrowed version of this dense DFA. Specifically,
the DFA returned always uses &[u32]
for its transition table.
sourcepub fn to_owned(&self) -> DFA<Vec<u32>>
pub fn to_owned(&self) -> DFA<Vec<u32>>
Return an owned version of this sparse DFA. Specifically, the DFA
returned always uses Vec<u32>
for its transition table.
Effectively, this returns a dense DFA whose transition table lives on the heap.
sourcepub fn start_kind(&self) -> StartKind
pub fn start_kind(&self) -> StartKind
Returns the starting state configuration for this DFA.
The default is StartKind::Both
, which means the DFA supports both
unanchored and anchored searches. However, this can generally lead to
bigger DFAs. Therefore, a DFA might be compiled with support for just
unanchored or anchored searches. In that case, running a search with
an unsupported configuration will panic.
sourcepub fn starts_for_each_pattern(&self) -> bool
pub fn starts_for_each_pattern(&self) -> bool
Returns true only if this DFA has starting states for each pattern.
When a DFA has starting states for each pattern, then a search with the
DFA can be configured to only look for anchored matches of a specific
pattern. Specifically, APIs like Automaton::try_search_fwd
can
accept a non-None pattern_id
if and only if this method returns true.
Otherwise, calling try_search_fwd
will panic.
Note that if the DFA has no patterns, this always returns false.
sourcepub fn byte_classes(&self) -> &ByteClasses
pub fn byte_classes(&self) -> &ByteClasses
Returns the equivalence classes that make up the alphabet for this DFA.
Unless Config::byte_classes
was disabled, it is possible that
multiple distinct bytes are grouped into the same equivalence class
if it is impossible for them to discriminate between a match and a
non-match. This has the effect of reducing the overall alphabet size
and in turn potentially substantially reducing the size of the DFA’s
transition table.
The downside of using equivalence classes like this is that every state transition will automatically use this map to convert an arbitrary byte to its corresponding equivalence class. In practice this has a negligible impact on performance.
sourcepub fn alphabet_len(&self) -> usize
pub fn alphabet_len(&self) -> usize
Returns the total number of elements in the alphabet for this DFA.
That is, this returns the total number of transitions that each state in this DFA must have. Typically, a normal byte oriented DFA would always have an alphabet size of 256, corresponding to the number of unique values in a single byte. However, this implementation has two peculiarities that impact the alphabet length:
- Every state has a special “EOI” transition that is only followed
after the end of some haystack is reached. This EOI transition is
necessary to account for one byte of look-ahead when implementing
things like
\b
and$
. - Bytes are grouped into equivalence classes such that no two bytes in
the same class can distinguish a match from a non-match. For example,
in the regex
^[a-z]+$
, the ASCII bytesa-z
could all be in the same equivalence class. This leads to a massive space savings.
Note though that the alphabet length does not necessarily equal the
total stride space taken up by a single DFA state in the transition
table. Namely, for performance reasons, the stride is always the
smallest power of two that is greater than or equal to the alphabet
length. For this reason, DFA::stride
or DFA::stride2
are
often more useful. The alphabet length is typically useful only for
informational purposes.
sourcepub fn stride2(&self) -> usize
pub fn stride2(&self) -> usize
Returns the total stride for every state in this DFA, expressed as the exponent of a power of 2. The stride is the amount of space each state takes up in the transition table, expressed as a number of transitions. (Unused transitions map to dead states.)
The stride of a DFA is always equivalent to the smallest power of 2 that is greater than or equal to the DFA’s alphabet length. This definition uses extra space, but permits faster translation between premultiplied state identifiers and contiguous indices (by using shifts instead of relying on integer division).
For example, if the DFA’s stride is 16 transitions, then its stride2
is 4
since 2^4 = 16
.
The minimum stride2
value is 1
(corresponding to a stride of 2
)
while the maximum stride2
value is 9
(corresponding to a stride of
512
). The maximum is not 8
since the maximum alphabet size is 257
when accounting for the special EOI transition. However, an alphabet
length of that size is exceptionally rare since the alphabet is shrunk
into equivalence classes.
sourcepub fn stride(&self) -> usize
pub fn stride(&self) -> usize
Returns the total stride for every state in this DFA. This corresponds to the total number of transitions used by each state in this DFA’s transition table.
Please see DFA::stride2
for more information. In particular, this
returns the stride as the number of transitions, where as stride2
returns it as the exponent of a power of 2.
sourcepub fn memory_usage(&self) -> usize
pub fn memory_usage(&self) -> usize
Returns the memory usage, in bytes, of this DFA.
The memory usage is computed based on the number of bytes used to represent this DFA.
This does not include the stack size used up by this DFA. To
compute that, use std::mem::size_of::<dense::DFA>()
.
source§impl<T: AsRef<[u32]>> DFA<T>
impl<T: AsRef<[u32]>> DFA<T>
Routines for converting a dense DFA to other representations, such as sparse DFAs or raw bytes suitable for persistent storage.
sourcepub fn to_sparse(&self) -> Result<DFA<Vec<u8>>, BuildError>
pub fn to_sparse(&self) -> Result<DFA<Vec<u8>>, BuildError>
Convert this dense DFA to a sparse DFA.
If a StateID
is too small to represent all states in the sparse
DFA, then this returns an error. In most cases, if a dense DFA is
constructable with StateID
then a sparse DFA will be as well.
However, it is not guaranteed.
Example
use regex_automata::{dfa::{Automaton, dense}, HalfMatch, Input};
let dense = dense::DFA::new("foo[0-9]+")?;
let sparse = dense.to_sparse()?;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, sparse.try_search_fwd(&Input::new("foo12345"))?);
sourcepub fn to_bytes_little_endian(&self) -> (Vec<u8>, usize)
pub fn to_bytes_little_endian(&self) -> (Vec<u8>, usize)
Serialize this DFA as raw bytes to a Vec<u8>
in little endian
format. Upon success, the Vec<u8>
and the initial padding length are
returned.
The written bytes are guaranteed to be deserialized correctly and
without errors in a semver compatible release of this crate by a
DFA
’s deserialization APIs (assuming all other criteria for the
deserialization APIs has been satisfied):
The padding returned is non-zero if the returned Vec<u8>
starts at
an address that does not have the same alignment as u32
. The padding
corresponds to the number of leading bytes written to the returned
Vec<u8>
.
Example
This example shows how to serialize and deserialize a DFA:
use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};
// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;
// N.B. We use native endianness here to make the example work, but
// using to_bytes_little_endian would work on a little endian target.
let (buf, _) = original_dfa.to_bytes_native_endian();
// Even if buf has initial padding, DFA::from_bytes will automatically
// ignore it.
let dfa: DFA<&[u32]> = DFA::from_bytes(&buf)?.0;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
sourcepub fn to_bytes_big_endian(&self) -> (Vec<u8>, usize)
pub fn to_bytes_big_endian(&self) -> (Vec<u8>, usize)
Serialize this DFA as raw bytes to a Vec<u8>
in big endian
format. Upon success, the Vec<u8>
and the initial padding length are
returned.
The written bytes are guaranteed to be deserialized correctly and
without errors in a semver compatible release of this crate by a
DFA
’s deserialization APIs (assuming all other criteria for the
deserialization APIs has been satisfied):
The padding returned is non-zero if the returned Vec<u8>
starts at
an address that does not have the same alignment as u32
. The padding
corresponds to the number of leading bytes written to the returned
Vec<u8>
.
Example
This example shows how to serialize and deserialize a DFA:
use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};
// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;
// N.B. We use native endianness here to make the example work, but
// using to_bytes_big_endian would work on a big endian target.
let (buf, _) = original_dfa.to_bytes_native_endian();
// Even if buf has initial padding, DFA::from_bytes will automatically
// ignore it.
let dfa: DFA<&[u32]> = DFA::from_bytes(&buf)?.0;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
sourcepub fn to_bytes_native_endian(&self) -> (Vec<u8>, usize)
pub fn to_bytes_native_endian(&self) -> (Vec<u8>, usize)
Serialize this DFA as raw bytes to a Vec<u8>
in native endian
format. Upon success, the Vec<u8>
and the initial padding length are
returned.
The written bytes are guaranteed to be deserialized correctly and
without errors in a semver compatible release of this crate by a
DFA
’s deserialization APIs (assuming all other criteria for the
deserialization APIs has been satisfied):
The padding returned is non-zero if the returned Vec<u8>
starts at
an address that does not have the same alignment as u32
. The padding
corresponds to the number of leading bytes written to the returned
Vec<u8>
.
Generally speaking, native endian format should only be used when you know that the target you’re compiling the DFA for matches the endianness of the target on which you’re compiling DFA. For example, if serialization and deserialization happen in the same process or on the same machine. Otherwise, when serializing a DFA for use in a portable environment, you’ll almost certainly want to serialize both a little endian and a big endian version and then load the correct one based on the target’s configuration.
Example
This example shows how to serialize and deserialize a DFA:
use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};
// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;
let (buf, _) = original_dfa.to_bytes_native_endian();
// Even if buf has initial padding, DFA::from_bytes will automatically
// ignore it.
let dfa: DFA<&[u32]> = DFA::from_bytes(&buf)?.0;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
sourcepub fn write_to_little_endian(
&self,
dst: &mut [u8]
) -> Result<usize, SerializeError>
pub fn write_to_little_endian( &self, dst: &mut [u8] ) -> Result<usize, SerializeError>
Serialize this DFA as raw bytes to the given slice, in little endian
format. Upon success, the total number of bytes written to dst
is
returned.
The written bytes are guaranteed to be deserialized correctly and
without errors in a semver compatible release of this crate by a
DFA
’s deserialization APIs (assuming all other criteria for the
deserialization APIs has been satisfied):
Note that unlike the various to_byte_*
routines, this does not write
any padding. Callers are responsible for handling alignment correctly.
Errors
This returns an error if the given destination slice is not big enough
to contain the full serialized DFA. If an error occurs, then nothing
is written to dst
.
Example
This example shows how to serialize and deserialize a DFA without dynamic memory allocation.
use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};
// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;
// Create a 4KB buffer on the stack to store our serialized DFA. We
// need to use a special type to force the alignment of our [u8; N]
// array to be aligned to a 4 byte boundary. Otherwise, deserializing
// the DFA may fail because of an alignment mismatch.
#[repr(C)]
struct Aligned<B: ?Sized> {
_align: [u32; 0],
bytes: B,
}
let mut buf = Aligned { _align: [], bytes: [0u8; 4 * (1<<10)] };
// N.B. We use native endianness here to make the example work, but
// using write_to_little_endian would work on a little endian target.
let written = original_dfa.write_to_native_endian(&mut buf.bytes)?;
let dfa: DFA<&[u32]> = DFA::from_bytes(&buf.bytes[..written])?.0;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
sourcepub fn write_to_big_endian(
&self,
dst: &mut [u8]
) -> Result<usize, SerializeError>
pub fn write_to_big_endian( &self, dst: &mut [u8] ) -> Result<usize, SerializeError>
Serialize this DFA as raw bytes to the given slice, in big endian
format. Upon success, the total number of bytes written to dst
is
returned.
The written bytes are guaranteed to be deserialized correctly and
without errors in a semver compatible release of this crate by a
DFA
’s deserialization APIs (assuming all other criteria for the
deserialization APIs has been satisfied):
Note that unlike the various to_byte_*
routines, this does not write
any padding. Callers are responsible for handling alignment correctly.
Errors
This returns an error if the given destination slice is not big enough
to contain the full serialized DFA. If an error occurs, then nothing
is written to dst
.
Example
This example shows how to serialize and deserialize a DFA without dynamic memory allocation.
use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};
// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;
// Create a 4KB buffer on the stack to store our serialized DFA. We
// need to use a special type to force the alignment of our [u8; N]
// array to be aligned to a 4 byte boundary. Otherwise, deserializing
// the DFA may fail because of an alignment mismatch.
#[repr(C)]
struct Aligned<B: ?Sized> {
_align: [u32; 0],
bytes: B,
}
let mut buf = Aligned { _align: [], bytes: [0u8; 4 * (1<<10)] };
// N.B. We use native endianness here to make the example work, but
// using write_to_big_endian would work on a big endian target.
let written = original_dfa.write_to_native_endian(&mut buf.bytes)?;
let dfa: DFA<&[u32]> = DFA::from_bytes(&buf.bytes[..written])?.0;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
sourcepub fn write_to_native_endian(
&self,
dst: &mut [u8]
) -> Result<usize, SerializeError>
pub fn write_to_native_endian( &self, dst: &mut [u8] ) -> Result<usize, SerializeError>
Serialize this DFA as raw bytes to the given slice, in native endian
format. Upon success, the total number of bytes written to dst
is
returned.
The written bytes are guaranteed to be deserialized correctly and
without errors in a semver compatible release of this crate by a
DFA
’s deserialization APIs (assuming all other criteria for the
deserialization APIs has been satisfied):
Generally speaking, native endian format should only be used when you know that the target you’re compiling the DFA for matches the endianness of the target on which you’re compiling DFA. For example, if serialization and deserialization happen in the same process or on the same machine. Otherwise, when serializing a DFA for use in a portable environment, you’ll almost certainly want to serialize both a little endian and a big endian version and then load the correct one based on the target’s configuration.
Note that unlike the various to_byte_*
routines, this does not write
any padding. Callers are responsible for handling alignment correctly.
Errors
This returns an error if the given destination slice is not big enough
to contain the full serialized DFA. If an error occurs, then nothing
is written to dst
.
Example
This example shows how to serialize and deserialize a DFA without dynamic memory allocation.
use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};
// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;
// Create a 4KB buffer on the stack to store our serialized DFA. We
// need to use a special type to force the alignment of our [u8; N]
// array to be aligned to a 4 byte boundary. Otherwise, deserializing
// the DFA may fail because of an alignment mismatch.
#[repr(C)]
struct Aligned<B: ?Sized> {
_align: [u32; 0],
bytes: B,
}
let mut buf = Aligned { _align: [], bytes: [0u8; 4 * (1<<10)] };
let written = original_dfa.write_to_native_endian(&mut buf.bytes)?;
let dfa: DFA<&[u32]> = DFA::from_bytes(&buf.bytes[..written])?.0;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
sourcepub fn write_to_len(&self) -> usize
pub fn write_to_len(&self) -> usize
Return the total number of bytes required to serialize this DFA.
This is useful for determining the size of the buffer required to pass to one of the serialization routines:
Passing a buffer smaller than the size returned by this method will result in a serialization error. Serialization routines are guaranteed to succeed when the buffer is big enough.
Example
This example shows how to dynamically allocate enough room to serialize a DFA.
use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};
let original_dfa = DFA::new("foo[0-9]+")?;
let mut buf = vec![0; original_dfa.write_to_len()];
// This is guaranteed to succeed, because the only serialization error
// that can occur is when the provided buffer is too small. But
// write_to_len guarantees a correct size.
let written = original_dfa.write_to_native_endian(&mut buf).unwrap();
// But this is not guaranteed to succeed! In particular,
// deserialization requires proper alignment for &[u32], but our buffer
// was allocated as a &[u8] whose required alignment is smaller than
// &[u32]. However, it's likely to work in practice because of how most
// allocators work. So if you write code like this, make sure to either
// handle the error correctly and/or run it under Miri since Miri will
// likely provoke the error by returning Vec<u8> buffers with alignment
// less than &[u32].
let dfa: DFA<&[u32]> = match DFA::from_bytes(&buf[..written]) {
// As mentioned above, it is legal for an error to be returned
// here. It is quite difficult to get a Vec<u8> with a guaranteed
// alignment equivalent to Vec<u32>.
Err(_) => return Ok(()),
Ok((dfa, _)) => dfa,
};
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
Note that this example isn’t actually guaranteed to work! In
particular, if buf
is not aligned to a 4-byte boundary, then the
DFA::from_bytes
call will fail. If you need this to work, then you
either need to deal with adding some initial padding yourself, or use
one of the to_bytes
methods, which will do it for you.
source§impl<'a> DFA<&'a [u32]>
impl<'a> DFA<&'a [u32]>
sourcepub fn from_bytes(
slice: &'a [u8]
) -> Result<(DFA<&'a [u32]>, usize), DeserializeError>
pub fn from_bytes( slice: &'a [u8] ) -> Result<(DFA<&'a [u32]>, usize), DeserializeError>
Safely deserialize a DFA with a specific state identifier representation. Upon success, this returns both the deserialized DFA and the number of bytes read from the given slice. Namely, the contents of the slice beyond the DFA are not read.
Deserializing a DFA using this routine will never allocate heap memory.
For safety purposes, the DFA’s transition table will be verified such
that every transition points to a valid state. If this verification is
too costly, then a DFA::from_bytes_unchecked
API is provided, which
will always execute in constant time.
The bytes given must be generated by one of the serialization APIs
of a DFA
using a semver compatible release of this crate. Those
include:
DFA::to_bytes_little_endian
DFA::to_bytes_big_endian
DFA::to_bytes_native_endian
DFA::write_to_little_endian
DFA::write_to_big_endian
DFA::write_to_native_endian
The to_bytes
methods allocate and return a Vec<u8>
for you, along
with handling alignment correctly. The write_to
methods do not
allocate and write to an existing slice (which may be on the stack).
Since deserialization always uses the native endianness of the target
platform, the serialization API you use should match the endianness of
the target platform. (It’s often a good idea to generate serialized
DFAs for both forms of endianness and then load the correct one based
on endianness.)
Errors
Generally speaking, it’s easier to state the conditions in which an error is not returned. All of the following must be true:
- The bytes given must be produced by one of the serialization APIs on this DFA, as mentioned above.
- The endianness of the target platform matches the endianness used to serialized the provided DFA.
- The slice given must have the same alignment as
u32
.
If any of the above are not true, then an error will be returned.
Panics
This routine will never panic for any input.
Example
This example shows how to serialize a DFA to raw bytes, deserialize it and then use it for searching.
use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};
let initial = DFA::new("foo[0-9]+")?;
let (bytes, _) = initial.to_bytes_native_endian();
let dfa: DFA<&[u32]> = DFA::from_bytes(&bytes)?.0;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
Example: dealing with alignment and padding
In the above example, we used the to_bytes_native_endian
method to
serialize a DFA, but we ignored part of its return value corresponding
to padding added to the beginning of the serialized DFA. This is OK
because deserialization will skip this initial padding. What matters
is that the address immediately following the padding has an alignment
that matches u32
. That is, the following is an equivalent but
alternative way to write the above example:
use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};
let initial = DFA::new("foo[0-9]+")?;
// Serialization returns the number of leading padding bytes added to
// the returned Vec<u8>.
let (bytes, pad) = initial.to_bytes_native_endian();
let dfa: DFA<&[u32]> = DFA::from_bytes(&bytes[pad..])?.0;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
This padding is necessary because Rust’s standard library does
not expose any safe and robust way of creating a Vec<u8>
with a
guaranteed alignment other than 1. Now, in practice, the underlying
allocator is likely to provide a Vec<u8>
that meets our alignment
requirements, which means pad
is zero in practice most of the time.
The purpose of exposing the padding like this is flexibility for the
caller. For example, if one wants to embed a serialized DFA into a
compiled program, then it’s important to guarantee that it starts at a
u32
-aligned address. The simplest way to do this is to discard the
padding bytes and set it up so that the serialized DFA itself begins at
a properly aligned address. We can show this in two parts. The first
part is serializing the DFA to a file:
use regex_automata::dfa::dense::DFA;
let dfa = DFA::new("foo[0-9]+")?;
let (bytes, pad) = dfa.to_bytes_big_endian();
// Write the contents of the DFA *without* the initial padding.
std::fs::write("foo.bigendian.dfa", &bytes[pad..])?;
// Do it again, but this time for little endian.
let (bytes, pad) = dfa.to_bytes_little_endian();
std::fs::write("foo.littleendian.dfa", &bytes[pad..])?;
And now the second part is embedding the DFA into the compiled program and deserializing it at runtime on first use. We use conditional compilation to choose the correct endianness.
use regex_automata::{
dfa::{Automaton, dense::DFA},
util::{lazy::Lazy, wire::AlignAs},
HalfMatch, Input,
};
// This crate provides its own "lazy" type, kind of like
// lazy_static! or once_cell::sync::Lazy. But it works in no-alloc
// no-std environments and let's us write this using completely
// safe code.
static RE: Lazy<DFA<&'static [u32]>> = Lazy::new(|| {
// This assignment is made possible (implicitly) via the
// CoerceUnsized trait. This is what guarantees that our
// bytes are stored in memory on a 4 byte boundary. You
// *must* do this or something equivalent for correct
// deserialization.
static ALIGNED: &AlignAs<[u8], u32> = &AlignAs {
_align: [],
#[cfg(target_endian = "big")]
bytes: *include_bytes!("foo.bigendian.dfa"),
#[cfg(target_endian = "little")]
bytes: *include_bytes!("foo.littleendian.dfa"),
};
let (dfa, _) = DFA::from_bytes(&ALIGNED.bytes)
.expect("serialized DFA should be valid");
dfa
});
let expected = Ok(Some(HalfMatch::must(0, 8)));
assert_eq!(expected, RE.try_search_fwd(&Input::new("foo12345")));
An alternative to util::lazy::Lazy
is lazy_static
or
once_cell
, which provide
stronger guarantees (like the initialization function only being
executed once). And once_cell
in particular provides a more
expressive API. But a Lazy
value from this crate is likely just fine
in most circumstances.
Note that regardless of which initialization method you use, you
will still need to use the AlignAs
trick above to force correct alignment, but this is safe to do and
from_bytes
will return an error if you get it wrong.
sourcepub unsafe fn from_bytes_unchecked(
slice: &'a [u8]
) -> Result<(DFA<&'a [u32]>, usize), DeserializeError>
pub unsafe fn from_bytes_unchecked( slice: &'a [u8] ) -> Result<(DFA<&'a [u32]>, usize), DeserializeError>
Deserialize a DFA with a specific state identifier representation in constant time by omitting the verification of the validity of the transition table and other data inside the DFA.
This is just like DFA::from_bytes
, except it can potentially return
a DFA that exhibits undefined behavior if its transition table contains
invalid state identifiers.
This routine is useful if you need to deserialize a DFA cheaply
and cannot afford the transition table validation performed by
from_bytes
.
Example
use regex_automata::{dfa::{Automaton, dense::DFA}, HalfMatch, Input};
let initial = DFA::new("foo[0-9]+")?;
let (bytes, _) = initial.to_bytes_native_endian();
// SAFETY: This is guaranteed to be safe since the bytes given come
// directly from a compatible serialization routine.
let dfa: DFA<&[u32]> = unsafe { DFA::from_bytes_unchecked(&bytes)?.0 };
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
Trait Implementations§
source§impl<T: AsRef<[u32]>> Automaton for DFA<T>
impl<T: AsRef<[u32]>> Automaton for DFA<T>
source§fn is_special_state(&self, id: StateID) -> bool
fn is_special_state(&self, id: StateID) -> bool
source§fn is_dead_state(&self, id: StateID) -> bool
fn is_dead_state(&self, id: StateID) -> bool
source§fn is_quit_state(&self, id: StateID) -> bool
fn is_quit_state(&self, id: StateID) -> bool
source§fn is_match_state(&self, id: StateID) -> bool
fn is_match_state(&self, id: StateID) -> bool
source§fn is_start_state(&self, id: StateID) -> bool
fn is_start_state(&self, id: StateID) -> bool
source§fn is_accel_state(&self, id: StateID) -> bool
fn is_accel_state(&self, id: StateID) -> bool
source§fn next_state(&self, current: StateID, input: u8) -> StateID
fn next_state(&self, current: StateID, input: u8) -> StateID
source§unsafe fn next_state_unchecked(&self, current: StateID, byte: u8) -> StateID
unsafe fn next_state_unchecked(&self, current: StateID, byte: u8) -> StateID
source§fn next_eoi_state(&self, current: StateID) -> StateID
fn next_eoi_state(&self, current: StateID) -> StateID
source§fn pattern_len(&self) -> usize
fn pattern_len(&self) -> usize
source§fn match_len(&self, id: StateID) -> usize
fn match_len(&self, id: StateID) -> usize
source§fn match_pattern(&self, id: StateID, match_index: usize) -> PatternID
fn match_pattern(&self, id: StateID, match_index: usize) -> PatternID
source§fn has_empty(&self) -> bool
fn has_empty(&self) -> bool
source§fn is_always_start_anchored(&self) -> bool
fn is_always_start_anchored(&self) -> bool
0
. Read moresource§fn start_state_forward(&self, input: &Input<'_>) -> Result<StateID, MatchError>
fn start_state_forward(&self, input: &Input<'_>) -> Result<StateID, MatchError>
source§fn start_state_reverse(&self, input: &Input<'_>) -> Result<StateID, MatchError>
fn start_state_reverse(&self, input: &Input<'_>) -> Result<StateID, MatchError>
source§fn universal_start_state(&self, mode: Anchored) -> Option<StateID>
fn universal_start_state(&self, mode: Anchored) -> Option<StateID>
source§fn accelerator(&self, id: StateID) -> &[u8] ⓘ
fn accelerator(&self, id: StateID) -> &[u8] ⓘ
source§fn get_prefilter(&self) -> Option<&Prefilter>
fn get_prefilter(&self) -> Option<&Prefilter>
source§fn try_search_fwd(
&self,
input: &Input<'_>
) -> Result<Option<HalfMatch>, MatchError>
fn try_search_fwd( &self, input: &Input<'_> ) -> Result<Option<HalfMatch>, MatchError>
None
is returned. Read moresource§fn try_search_rev(
&self,
input: &Input<'_>
) -> Result<Option<HalfMatch>, MatchError>
fn try_search_rev( &self, input: &Input<'_> ) -> Result<Option<HalfMatch>, MatchError>
None
is
returned. Read moresource§fn try_search_overlapping_fwd(
&self,
input: &Input<'_>,
state: &mut OverlappingState
) -> Result<(), MatchError>
fn try_search_overlapping_fwd( &self, input: &Input<'_>, state: &mut OverlappingState ) -> Result<(), MatchError>
OverlappingState::get_match
method. Read moresource§fn try_search_overlapping_rev(
&self,
input: &Input<'_>,
state: &mut OverlappingState
) -> Result<(), MatchError>
fn try_search_overlapping_rev( &self, input: &Input<'_>, state: &mut OverlappingState ) -> Result<(), MatchError>
OverlappingState::get_match
method. Read moresource§fn try_which_overlapping_matches(
&self,
input: &Input<'_>,
patset: &mut PatternSet
) -> Result<(), MatchError>
fn try_which_overlapping_matches( &self, input: &Input<'_>, patset: &mut PatternSet ) -> Result<(), MatchError>
patset
. If multiple patterns match at the same
position and the underlying DFA supports overlapping matches, then all
matching patterns are written to the given set. Read more