Crate umbra_slice

Source
Expand description

An optimized owned slice type styled after “Umbra strings.”

This is similar to a Box<[T]> for T = u8 or T = u16 with a small-slice optimization and further optimizations for Eq and prefix comparisons.

Some background on the memory layout of &[T]/Box<[T]> is important. References to slices are “fat pointers.” They’re wider than a straight-up pointer to a memory location (hence “fat”) because they carry metadata along with the pointer. Imagine a tuple (*const T, Metadata). For slices this metadata is the slice’s length - so you can say my_slice.len() rather than passing around the length for every function call where you pass a slice. A pointer is of size usize (by definition) and the length is also represented as a usize. On a 64-bit machine (where size_of::<usize>() == 8 - 8 bytes / 64 bits) this means that your slice fat pointer takes 16 bytes to represent.

usize is rather large to represent the size of a slice though. Your typical slice is very likely to be much much smaller. Think of all of the bits, (no, bytes!) of the length usize you waste per slice.

“Umbra strings” or “German strings” or “German-style strings” (all the same thing) aim to take advantage of the common cases of strings:

  • Strings are usually short.
  • Strings are usually immutable.
  • You usually care only about a small part of the string (prefix or equality).

We’ll take advantage of these common cases by (ab)using these wasted bytes. To do this we steal bytes from the length usize and also introduce a small-string (“small-slice”) optimization (SSO).

The first change is to represent the length with a u16. That’s 6 bytes saved for future use, with the restriction that the slice is limited to 2^16 (65,536) elements - more than enough for our use-case. We’ll use these six bytes to store a prefix of the slice instead and we can use that prefix for fast equality checking between fellow Umbra slices.

We have 6 bytes of the slice already stored in the prefix, so for the common case of short strings let’s just store the rest of the string in the struct and forget about the pointer. So for, say, a UTF-8 encoded ASCII string with 14 or fewer characters we don’t need pointer indirection or allocation at all. The string contents are stored entirely in the struct itself.

+---------+---------------------------+-------------------------------------+
+   len   +          prefix                          suffix                 +
+---------+---------------------------+-------------------------------------+
    u16                            14x u8

For the rare longer slice we’ll allocate the slice as a “thin” pointer. We already store the length earlier in the struct so there’s no need to use a fat pointer.

+---------+---------------------------+-------------------------------------+
+   len   +          prefix           +              pointer                +
+---------+---------------------------+-------------------------------------+
    u16               6x u8                          8x u8

Note that the pointer here points to an allocation for the entire slice rather than just the portion not covered by the prefix - this is important so that we can reconstruct the slice type in contiguous memory (i.e. create a &str from a UmbraSlice<u8>) without reallocation. This means that this slice type duplicates the prefix Ts for larger slices. On the other hand you avoid allocating any Ts though for the small variant, so when this type is usually short it should save a noticeable amount of memory.

Also note that this type is cheaper to compare to itself (Eq) but a UmbraSlice<u8> is slightly slower to compare to a &[u8] because the len must be read and the slice must be reconstructed before comparison. (This requires an extra comparison operation.)

For further reading: https://the-mikedavis.github.io/posts/german-string-optimizations-in-spellbook/.

Structs§

TooLongError
An error used in the TryFrom implementation for UmbraSlice which rejects input slices which are too long.
UmbraSlice
An owned slice type with “German string” optimizations.
UmbraString
A UTF-8 string representation, like String but backed by an Umbra slice.

Functions§

prefix_len
Computes the number of Ts that can be stored in the prefix of the Umbra slice.
suffix_len
Computes the number of Ts that can be stored in the suffix of the Umbra slice inline.