pub enum Utf32Str<'a> {
Ascii(&'a [u8]),
Unicode(&'a [char]),
}Expand description
A UTF32 encoded (char array) string that is used as an input to (fuzzy) matching.
Usually rusts’ utf8 encoded strings are great. However during fuzzy matching operates on codepoints (it should operate on graphemes but that’s too much hassle to deal with). We want to quickly iterate these codepoints between (up to 5 times) during matching.
Doing codepoint segmentation on the fly not only blows trough the cache (lookuptables and Icache) but also has nontrivial runtime compared to the matching itself. Furthermore there are a lot of exta optimizations available for ascii only text (but checking during each match has too much overhead).
Ofcourse this comes at exta memory cost as we usually still need the ut8 encoded variant for rendering. In the (dominant) case of ascii-only text we don’t require a copy. Furthermore fuzzy matching usually is applied while the user is typing on the fly so the same item is potentially matched many times (making the the upfront cost more worth it). That means that its basically always worth it to presegment the string.
For usecases that only match (a lot of) strings once its possible to keep char buffer around that is filled with the presegmented chars
Another advantage of this approach is that the matcher will naturally produce char indices (instead of utf8 offsets) anyway. With a codepoint basic representation like this the indices can be used directly
Variants§
Ascii(&'a [u8])
A string represented as ASCII encoded bytes. Correctness invariant: must only contain valid ASCII (<=127)
Unicode(&'a [char])
A string represented as an array of unicode codepoints (basically UTF-32).
Implementations§
Source§impl<'a> Utf32Str<'a>
impl<'a> Utf32Str<'a>
Sourcepub fn new(str: &'a str, buf: &'a mut Vec<char>) -> Self
pub fn new(str: &'a str, buf: &'a mut Vec<char>) -> Self
Convenience method to construct a Utf32Str from a normal utf8 str
Sourcepub fn slice(self, range: impl RangeBounds<usize>) -> Utf32Str<'a>
pub fn slice(self, range: impl RangeBounds<usize>) -> Utf32Str<'a>
Creates a slice with a string that contains the characters in the specified character range.
Sourcepub fn slice_u32(self, range: impl RangeBounds<u32>) -> Utf32Str<'a>
pub fn slice_u32(self, range: impl RangeBounds<u32>) -> Utf32Str<'a>
Same as slice but accepts a u32 range for convenience since
those are the indices returned by the matcher.