Expand description
An optimized owned slice type styled after “Umbra strings.”
This is similar to a Box<[T]>
for T = u8
or T = u16
with a small-slice optimization and
further optimizations for Eq
and prefix comparisons.
Some background on the memory layout of &[T]
/Box<[T]>
is important. References to slices
are “fat pointers.” They’re wider than a straight-up pointer to a memory location (hence
“fat”) because they carry metadata along with the pointer. Imagine a tuple (*const T, Metadata)
. For slices this metadata is the slice’s length - so you can say my_slice.len()
rather than passing around the length for every function call where you pass a slice. A
pointer is of size usize
(by definition) and the length is also represented as a usize
. On
a 64-bit machine (where size_of::<usize>() == 8
- 8 bytes / 64 bits) this means that your
slice fat pointer takes 16 bytes to represent.
usize
is rather large to represent the size of a slice though. Your typical slice is very
likely to be much much smaller. Think of all of the bits, (no, bytes!) of the length usize
you waste per slice.
“Umbra strings” or “German strings” or “German-style strings” (all the same thing) aim to take advantage of the common cases of strings:
- Strings are usually short.
- Strings are usually immutable.
- You usually care only about a small part of the string (prefix or equality).
We’ll take advantage of these common cases by (ab)using these wasted bytes. To do this we
steal bytes from the length usize
and also introduce a small-string (“small-slice”)
optimization (SSO).
The first change is to represent the length with a u16
. That’s 6 bytes saved for future use,
with the restriction that the slice is limited to 2^16 (65,536) elements - more than enough
for our use-case. We’ll use these six bytes to store a prefix of the slice instead and we can
use that prefix for fast equality checking between fellow Umbra slices.
We have 6 bytes of the slice already stored in the prefix, so for the common case of short strings let’s just store the rest of the string in the struct and forget about the pointer. So for, say, a UTF-8 encoded ASCII string with 14 or fewer characters we don’t need pointer indirection or allocation at all. The string contents are stored entirely in the struct itself.
+---------+---------------------------+-------------------------------------+
+ len + prefix suffix +
+---------+---------------------------+-------------------------------------+
u16 14x u8
For the rare longer slice we’ll allocate the slice as a “thin” pointer. We already store the length earlier in the struct so there’s no need to use a fat pointer.
+---------+---------------------------+-------------------------------------+
+ len + prefix + pointer +
+---------+---------------------------+-------------------------------------+
u16 6x u8 8x u8
Note that the pointer
here points to an allocation for the entire slice rather than just the
portion not covered by the prefix
- this is important so that we can reconstruct the slice
type in contiguous memory (i.e. create a &str
from a UmbraSlice<u8>
) without reallocation.
This means that this slice type duplicates the prefix T
s for larger slices. On the other
hand you avoid allocating any T
s though for the small variant, so when this type is usually
short it should save a noticeable amount of memory.
Also note that this type is cheaper to compare to itself (Eq
) but a UmbraSlice<u8>
is
slightly slower to compare to a &[u8]
because the len
must be read and the slice must be
reconstructed before comparison. (This requires an extra comparison operation.)
For further reading: https://the-mikedavis.github.io/posts/german-string-optimizations-in-spellbook/.
Structs§
- TooLong
Error - An error used in the
TryFrom
implementation forUmbraSlice
which rejects input slices which are too long. - Umbra
Slice - An owned slice type with “German string” optimizations.
- Umbra
String - A UTF-8 string representation, like
String
but backed by an Umbra slice.
Functions§
- prefix_
len - Computes the number of
T
s that can be stored in the prefix of the Umbra slice. - suffix_
len - Computes the number of
T
s that can be stored in the suffix of the Umbra slice inline.