pub enum Utf32Str<'a> {
Ascii(&'a [u8]),
Unicode(&'a [char]),
}Expand description
A UTF-32 encoded (char array) string that is used as an input to (fuzzy) matching.
This is mostly intended as an internal string type, but some methods are exposed for
convenience. We make the following API guarantees for Utf32Str(ing)s produced from a string
using one of its From<T> constructors for string types T or from the
Utf32Str::new method.
- The
Asciivariant contains a byte buffer which is guaranteed to be a valid string slice. - It is guaranteed that the string slice internal to the
Asciivariant is identical to the original string. - The length of a
Utf32Str(ing)is exactly the number of graphemes in the original string.
Since Utf32Str(ing)s variants may be constructed directly, you must not make these
assumptions when handling Utf32Str(ing)s of unknown origin.
§Caveats
Despite the name, this type is quite far from being a true string type. Here are some examples demonstrating this.
§String conversions are not round-trip
In the presence of a multi-codepoint grapheme (e.g. "u\u{0308}" which is u + COMBINING_DIAERESIS), the trailing codepoints are truncated.
assert_eq!(Utf32String::from("u\u{0308}").to_string(), "u");§Indexing is done by grapheme
Indexing into a string is done by grapheme rather than by codepoint.
assert!(Utf32String::from("au\u{0308}").len() == 2);§A Unicode variant may be produced by all-ASCII characters.
Since the windows-style newline \r\n is ASCII only but considered to be a single grapheme,
strings containing \r\n will still result in a Unicode variant.
let s = Utf32String::from("\r\n");
assert!(!s.slice(..).is_ascii());
assert!(s.len() == 1);
assert!(s.slice(..).get(0) == '\n');§Design rationale
Usually Rust’s UTF-8 encoded strings are great. However, since fuzzy matching operates on codepoints (ideally, it should operate on graphemes but that’s too much hassle to deal with), we want to quickly iterate over codepoints (up to 5 times) during matching.
Doing codepoint segmentation on the fly not only blows trough the cache (lookup tables and I-cache) but also has nontrivial runtime compared to the matching itself. Furthermore there are many extra optimizations available for ASCII only text, but checking each match has too much overhead.
Of course, this comes at extra memory cost as we usually still need the UTF-8 encoded variant for rendering. In the (dominant) case of ASCII-only text we don’t require a copy. Furthermore fuzzy matching usually is applied while the user is typing on the fly so the same item is potentially matched many times (making the the up-front cost more worth it). That means that its basically always worth it to pre-segment the string.
For usecases that only match (a lot of) strings once its possible to keep char buffer around that is filled with the presegmented chars.
Another advantage of this approach is that the matcher will naturally produce grapheme indices (instead of utf8 offsets) anyway. With a codepoint basic representation like this the indices can be used directly
Variants§
Ascii(&'a [u8])
A string represented as ASCII encoded bytes.
Correctness invariant: must only contain valid ASCII (<= 127)
Unicode(&'a [char])
A string represented as an array of unicode codepoints (basically UTF-32).
Implementations§
Source§impl<'a> Utf32Str<'a>
impl<'a> Utf32Str<'a>
Sourcepub fn new(str: &'a str, buf: &'a mut Vec<char>) -> Utf32Str<'a>
pub fn new(str: &'a str, buf: &'a mut Vec<char>) -> Utf32Str<'a>
Convenience method to construct a Utf32Str from a normal UTF-8 str
Sourcepub fn slice(self, range: impl RangeBounds<usize>) -> Utf32Str<'a>
pub fn slice(self, range: impl RangeBounds<usize>) -> Utf32Str<'a>
Creates a slice with a string that contains the characters in the specified character range.
Sourcepub fn slice_u32(self, range: impl RangeBounds<u32>) -> Utf32Str<'a>
pub fn slice_u32(self, range: impl RangeBounds<u32>) -> Utf32Str<'a>
Same as slice but accepts a u32 range for convenience since
those are the indices returned by the matcher.
Sourcepub fn is_ascii(self) -> bool
pub fn is_ascii(self) -> bool
Returns whether this string only contains graphemes which are single ASCII chars.
This is almost equivalent to the string being ASCII, except with the additional requirement
that the string cannot contain a windows-style newline \r\n which is treated as a single
grapheme.
Trait Implementations§
Source§impl<'a> Ord for Utf32Str<'a>
impl<'a> Ord for Utf32Str<'a>
1.21.0 · Source§fn max(self, other: Self) -> Selfwhere
Self: Sized,
fn max(self, other: Self) -> Selfwhere
Self: Sized,
Source§impl<'a> PartialOrd for Utf32Str<'a>
impl<'a> PartialOrd for Utf32Str<'a>
impl<'a> Copy for Utf32Str<'a>
impl<'a> Eq for Utf32Str<'a>
impl<'a> StructuralPartialEq for Utf32Str<'a>
Auto Trait Implementations§
impl<'a> Freeze for Utf32Str<'a>
impl<'a> RefUnwindSafe for Utf32Str<'a>
impl<'a> Send for Utf32Str<'a>
impl<'a> Sync for Utf32Str<'a>
impl<'a> Unpin for Utf32Str<'a>
impl<'a> UnwindSafe for Utf32Str<'a>
Blanket Implementations§
Source§impl<T> BorrowMut<T> for Twhere
T: ?Sized,
impl<T> BorrowMut<T> for Twhere
T: ?Sized,
Source§fn borrow_mut(&mut self) -> &mut T
fn borrow_mut(&mut self) -> &mut T
Source§impl<T> CloneToUninit for Twhere
T: Clone,
impl<T> CloneToUninit for Twhere
T: Clone,
Source§impl<T> IntoEither for T
impl<T> IntoEither for T
Source§fn into_either(self, into_left: bool) -> Either<Self, Self>
fn into_either(self, into_left: bool) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left is true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read moreSource§fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
self into a Left variant of Either<Self, Self>
if into_left(&self) returns true.
Converts self into a Right variant of Either<Self, Self>
otherwise. Read more