Expand description
§Enumerated Latin
Enumerated Latin is a crate to map strings made of the 26 letters a to z or A to Z (case insensitive) to a continuous space of integers by treating the text like a base26 encoded number plus an end marker.
Example:
use enumerated_latin::EnumeratedLatinEncode;
use enumerated_latin::EnumeratedLatinDecode;
let encoded: u64 = "Example".enumerated_latin_encode().unwrap();
assert_eq!(encoded, 9540966270);
let decoded_again = encoded.enumerated_latin_decode_lowercase().unwrap();
assert_eq!(decoded_again, "example".to_string());§Intended use
Intended use of this is to generate numeric identifiers for short pieces of text, while still allowing to compare against ranges in fixed-length scenarios.
This arises — for example — when working with ISO-codes for languages, scripts countries etc. preserving the order within the same length helps with efficiently checking against private-use and similar ranges.
Intended area of use is in the backend of applications, where the difference between a string and a number actually matters.
For frontends it is recommended to prefer readability over performance whenever possible.
§How the encoding works
In short: The string prefixed with a b and then parsed like a most significant first (same order as everyday numbers) base26 number, where a maps to 0 and z to 25.
Example: az would be encoded as baz: (26^2)*1 + (26^1)*0 + (26^0)*25 = 701
use enumerated_latin::EnumeratedLatinEncode;
assert_eq!("az".enumerated_latin_encode(), Ok(701 as u16))The b at the start is because with a mapping to zero, leading as act like leading 0s in everyday base10 numbers, there is no way from the numeric value to tell how many of them were present. The trailing b ensures, that one can always deduce the original length from the numeric value.
The everyday base10 equivalent to prepending the b would be prepending a 1 i.e. 000 to 1000 and 00 to 100.
This results in the following facts about the encoding:
- An empty string encodes to a
1 - The first valid non-empty string is
awith a value of26 - Within the same length, the encoded strings sort alphabetically
- Longer string means bigger number
- There is a gap in the encoding space between different length strings
- Assuming a length
l, the first value is26^land the last one is((26^l)*2)-1).
§Encoding targets
Encoding each letter takes roughly 5 bits of information plus one bit for the end cap, you can use this information to roughly estimate which datatype you’ll need.
Valid encoding target types are:
| Type | supported length |
|---|---|
u8 | 1 |
i16 | 2 |
u16 | 3 |
i32 | 6 |
u32 | 6 |
i64 | 13 |
u64 | 13 |
i128 | 26 |
u128 | 26 |
§Licensing
enumerated_latin is licensed as LGPL-3.0-only and REUSE 3.3 compliant.
When contributing add yourself as a copyright holder to the files you modified.
Enums§
- Encoding
Error - Problem that can occur when using
enumerated_latin_encodeon invalid input.
Traits§
- Encoding
Target - Trait to imlement for numeric integer types that a string can be encoded into.
- Enumerated
Latin Decode - Trait for implementing the number to letters part of enumerated latin.
- Enumerated
Latin Encode - Trait for implementing the letters to number part of enumerated latin.