pub enum Inst {
Match(usize),
Save(InstSave),
Split(InstSplit),
EmptyLook(InstEmptyLook),
Char(InstChar),
Ranges(InstRanges),
Bytes(InstBytes),
}
Expand description
Inst is an instruction code in a Regex program.
Regrettably, a regex program either contains Unicode codepoint instructions (Char and Ranges) or it contains byte instructions (Bytes). A regex program can never contain both.
It would be worth investigating splitting this into two distinct types and then figuring out how to make the matching engines polymorphic over those types without sacrificing performance.
Other than the benefit of moving invariants into the type system, another
benefit is the decreased size. If we remove the Char
and Ranges
instructions from the Inst
enum, then its size shrinks from 32 bytes to
24 bytes. (This is because of the removal of a Box<[]>
in the Ranges
variant.) Given that byte based machines are typically much bigger than
their Unicode analogues (because they can decode UTF-8 directly), this ends
up being a pretty significant savings.
Variants§
Match(usize)
Match indicates that the program has reached a match state.
The number in the match corresponds to the Nth logical regular expression in this program. This index is always 0 for normal regex programs. Values greater than 0 appear when compiling regex sets, and each match instruction gets its own unique value. The value corresponds to the Nth regex in the set.
Save(InstSave)
Save causes the program to save the current location of the input in the slot indicated by InstSave.
Split(InstSplit)
Split causes the program to diverge to one of two paths in the program, preferring goto1 in InstSplit.
EmptyLook(InstEmptyLook)
EmptyLook represents a zero-width assertion in a regex program. A zero-width assertion does not consume any of the input text.
Char(InstChar)
Char requires the regex program to match the character in InstChar at the current position in the input.
Ranges(InstRanges)
Ranges requires the regex program to match the character at the current position in the input with one of the ranges specified in InstRanges.
Bytes(InstBytes)
Bytes is like Ranges, except it expresses a single byte range. It is used in conjunction with Split instructions to implement multi-byte character classes.