Expand description
A variable byte encoded of gap to represent a string.
This crate is used mainly for large non English text that need to be represent by two or more
bytes per character in UTF-8 encoding. It encode string by iterating on each character
then turn it to u32
then calculate distance of this u32
to previous character.
The distance here is called "gap"
. Each gap is compressed by using variable byte encoding scheme.
It assume that text is usually come in as cluster where many contiguous characters came have code point close to each other. In such case, the character is likely to take only single byte with one extra bit for sign flag. See README.md for reason behind it.
In order to obtain back a character, it need to iterate from the very first character. This is similar to typical UTF derivative encoding as each char may have different number of bytes.
In order to serialize the encoded string, feature flag serialize
must be enable.
For example, in cargo.toml
:
var_byte_str = {version="*", features=["serialize"] default=false}
Structsยง
- Chars
- An iterator that return a
char
using unsafe cast. - Gaps
- An iterator that return gap of each character as
i64
offset. The first return value can be cast tochar
. The subsequence value need to be added to first value in order to obtain ani64
that can be cast tochar
. - Gaps
Bytes - An iterator that return gap as copy of variable byte encoded along with sign boolean.
Each iteration return a tuple of
bool
andSmallVec<[u8; 5]>
. Thebool
part is true when the bytes it represent should be convert into negative value. TheSmallVec<[u8; 5]>
part contain absolute value of different in least significant byte first order. - VarByte
String - The core struct that represent variable byte encoded of gap of string.