Enum Utf32Str

Source

pub enum Utf32Str<'a> {
    Ascii(&'a [u8]),
    Unicode(&'a [char]),
}

Expand description

A UTF-32 encoded (char array) string that is used as an input to (fuzzy) matching.

This is mostly intended as an internal string type, but some methods are exposed for convenience. We make the following API guarantees for Utf32Str(ing)s produced from a string using one of its From<T> constructors for string types T or from the Utf32Str::new method.

The Ascii variant contains a byte buffer which is guaranteed to be a valid string slice.
It is guaranteed that the string slice internal to the Ascii variant is identical to the original string.
The length of a Utf32Str(ing) is exactly the number of graphemes in the original string.

Since Utf32Str(ing)s variants may be constructed directly, you must not make these assumptions when handling Utf32Str(ing)s of unknown origin.

§Caveats

Despite the name, this type is quite far from being a true string type. Here are some examples demonstrating this.

§String conversions are not round-trip

In the presence of a multi-codepoint grapheme (e.g. "u\u{0308}" which is u + COMBINING_DIAERESIS), the trailing codepoints are truncated.

assert_eq!(Utf32String::from("u\u{0308}").to_string(), "u");

§Indexing is done by grapheme

Indexing into a string is done by grapheme rather than by codepoint.

assert!(Utf32String::from("au\u{0308}").len() == 2);

§A `Unicode` variant may be produced by all-ASCII characters.

Since the windows-style newline \r\n is ASCII only but considered to be a single grapheme, strings containing \r\n will still result in a Unicode variant.

let s = Utf32String::from("\r\n");
assert!(!s.slice(..).is_ascii());
assert!(s.len() == 1);
assert!(s.slice(..).get(0) == '\n');

§Design rationale

Usually Rust’s UTF-8 encoded strings are great. However, since fuzzy matching operates on codepoints (ideally, it should operate on graphemes but that’s too much hassle to deal with), we want to quickly iterate over codepoints (up to 5 times) during matching.

Doing codepoint segmentation on the fly not only blows trough the cache (lookup tables and I-cache) but also has nontrivial runtime compared to the matching itself. Furthermore there are many extra optimizations available for ASCII only text, but checking each match has too much overhead.

Of course, this comes at extra memory cost as we usually still need the UTF-8 encoded variant for rendering. In the (dominant) case of ASCII-only text we don’t require a copy. Furthermore fuzzy matching usually is applied while the user is typing on the fly so the same item is potentially matched many times (making the the up-front cost more worth it). That means that its basically always worth it to pre-segment the string.

For usecases that only match (a lot of) strings once its possible to keep char buffer around that is filled with the presegmented chars.

Another advantage of this approach is that the matcher will naturally produce grapheme indices (instead of utf8 offsets) anyway. With a codepoint basic representation like this the indices can be used directly

Variants§

§

Ascii(&'a [u8])

A string represented as ASCII encoded bytes.

Correctness invariant: must only contain valid ASCII (<= 127)

§

Unicode(&'a [char])

A string represented as an array of unicode codepoints (basically UTF-32).

Enum Utf32Str Copy item path

§Caveats

§String conversions are not round-trip

§Indexing is done by grapheme

§A Unicode variant may be produced by all-ASCII characters.

§Design rationale

Variants§

Ascii(&'a [u8])

Unicode(&'a [char])

Implementations§

impl<'a> Utf32Str<'a>

pub fn new(str: &'a str, buf: &'a mut Vec<char>) -> Utf32Str<'a>

pub fn len(self) -> usize

pub fn is_empty(self) -> bool

pub fn slice(self, range: impl RangeBounds<usize>) -> Utf32Str<'a>

pub fn slice_u32(self, range: impl RangeBounds<u32>) -> Utf32Str<'a>

pub fn is_ascii(self) -> bool

pub fn get(self, n: u32) -> char

pub fn chars(self) -> Chars<'a>

Trait Implementations§

impl<'a> Clone for Utf32Str<'a>

fn clone(&self) -> Utf32Str<'a>

fn clone_from(&mut self, source: &Self)

impl Debug for Utf32Str<'_>

fn fmt(&self, f: &mut Formatter<'_>) -> Result<(), Error>

impl Display for Utf32Str<'_>

fn fmt(&self, f: &mut Formatter<'_>) -> Result<(), Error>

impl<'a> Hash for Utf32Str<'a>

fn hash<__H>(&self, state: &mut __H)where __H: Hasher,

fn hash_slice<H>(data: &[Self], state: &mut H)where H: Hasher, Self: Sized,

impl<'a> Ord for Utf32Str<'a>

fn cmp(&self, other: &Utf32Str<'a>) -> Ordering

fn max(self, other: Self) -> Selfwhere Self: Sized,

fn min(self, other: Self) -> Selfwhere Self: Sized,

fn clamp(self, min: Self, max: Self) -> Selfwhere Self: Sized,

impl<'a> PartialEq for Utf32Str<'a>

fn eq(&self, other: &Utf32Str<'a>) -> bool

fn ne(&self, other: &Rhs) -> bool

impl<'a> PartialOrd for Utf32Str<'a>

fn partial_cmp(&self, other: &Utf32Str<'a>) -> Option<Ordering>

fn lt(&self, other: &Rhs) -> bool

fn le(&self, other: &Rhs) -> bool

fn gt(&self, other: &Rhs) -> bool

fn ge(&self, other: &Rhs) -> bool

impl<'a> Copy for Utf32Str<'a>

impl<'a> Eq for Utf32Str<'a>

impl<'a> StructuralPartialEq for Utf32Str<'a>

Auto Trait Implementations§

impl<'a> Freeze for Utf32Str<'a>

impl<'a> RefUnwindSafe for Utf32Str<'a>

impl<'a> Send for Utf32Str<'a>

impl<'a> Sync for Utf32Str<'a>

impl<'a> Unpin for Utf32Str<'a>

impl<'a> UnwindSafe for Utf32Str<'a>

Blanket Implementations§

impl<T> Any for Twhere T: 'static + ?Sized,

fn type_id(&self) -> TypeId

impl<T> Borrow<T> for Twhere T: ?Sized,

fn borrow(&self) -> &T

impl<T> BorrowMut<T> for Twhere T: ?Sized,

fn borrow_mut(&mut self) -> &mut T

impl<T> CloneToUninit for Twhere T: Clone,

unsafe fn clone_to_uninit(&self, dest: *mut u8)

impl<T> From<T> for T

fn from(t: T) -> T

impl<T, U> Into<U> for Twhere U: From<T>,

fn into(self) -> U

impl<T> IntoEither for T

fn into_either(self, into_left: bool) -> Either<Self, Self>

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>where F: FnOnce(&Self) -> bool,

impl<T> Pointable for T

const ALIGN: usize

type Init = T

unsafe fn init(init: <T as Pointable>::Init) -> usize

unsafe fn deref<'a>(ptr: usize) -> &'a T

unsafe fn deref_mut<'a>(ptr: usize) -> &'a mut T

unsafe fn drop(ptr: usize)

impl<T> ToOwned for Twhere T: Clone,

type Owned = T

fn to_owned(&self) -> T

Enum Utf32Str

§A `Unicode` variant may be produced by all-ASCII characters.

fn hash<H>(&self, state: &mut H)
where __H: Hasher,

fn hash_slice<H>(data: &[Self], state: &mut H)
where H: Hasher, Self: Sized,

fn max(self, other: Self) -> Self
where Self: Sized,

fn min(self, other: Self) -> Self
where Self: Sized,

fn clamp(self, min: Self, max: Self) -> Self
where Self: Sized,

impl<T> Any for T
where T: 'static + ?Sized,

impl<T> Borrow<T> for T
where T: ?Sized,

impl<T> BorrowMut<T> for T
where T: ?Sized,

impl<T> CloneToUninit for T
where T: Clone,

impl<T, U> Into<U> for T
where U: From<T>,

fn into_either_with<F>(self, into_left: F) -> Either<Self, Self>
where F: FnOnce(&Self) -> bool,

impl<T> ToOwned for T
where T: Clone,

impl<T> ToString for T
where T: Display + ?Sized,

impl<T, U> TryFrom<U> for T
where U: Into<T>,

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,