Crate id30

source ·
Expand description

Id30 is an encoding scheme for 30 bit identifiers that look like the following: bpv3uq, zvaec2, rfmbyz, jwygvk, 000000, zzzzzz. It is designed for use as opaque identifiers in URLs, that can be read and written comfortably, and is a good choice for user-facing IDs when you don’t need a key space of more than 30 bits, giving you just over 10⁹ different IDs. For comparison, YouTube surpassed this only in 2019, and is estimated to be hosting about 4*10⁹ videos in 2024. User-facing IDs should generally be chosen such that they do not reveal an underlying sequence, and, indeed, Id30 looks best for randomly generated IDs.

In machine readable form, Id30 is represented as a 32 bit integer, of which the two most significant bits are always zero. This makes u32 and i32 equally well suited.

This crate defines the type Id30 which implements this encoding scheme through various traits. To avoid introducing excessive dependencies, several of these trait implementations are opt-in via feature selection, see Features and Id30 for details.

The crate also provides a command line utility, installable with cargo (cargo install id30 --features=rand_std), for converting between integral and Id30 representations, and for generating new IDs on the command line.

§Id30 Encoding

The Id30 encoding is a case-insensitive base 32 encoding that handles some confusable characters to compensate for some common misreadings and mishearings. The canonical encoding alphabet is 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, a, b, c, d, e, f, g, h, j, k, m, n, p, q, r, t, u, v, w, x, y, z, i.e. the digits and then most, but not all, of the characters of the English alphabet. Id30 decoding handles case-insensitivity and confusable characters by mapping multiple input characters to the same values.

The confusable characters handled by Id30 are:

  • 0 and o/O, for visual similarity
  • 1, i/I and l/L, for visual similarity
  • s/S and f/F, for verbal similarity

§Encoding

The algorithm for encoding a 30 bit integer i is:

  1. Group the integer in blocks of five bits, i.e. let blocks = [(i >> 25) & 0x1f, (i >> 20) & 0x1f, (i >> 15) & 0x1f, (i >> 10) & 0x1f, (i >> 5) & 0x1f, i & 0x1f]
  2. Starting with the most significant group of bits, map each group according to the canonical encoding alphabet, i.e. let id30: String = blocks.iter().map(|x| ENCODING[x]).collect()

This yields a string of six characters. An Id30 string is always six characters long, it is not valid to strip leading zeros as is normally done for regular base 10 numbers.

This encoding scheme ensures that the numerical ordering of the 30 bit integers is the same as the lexicographical ordering of the encoded strings.

§Decoding

Decoding is more involved because of error handling and resolution of alternative encodings.

The algorithm for decoding an Id30 string to a 30 bit integer is:

  1. The given string is six characters long. Map each character according to the reverse mapping of the encoding. Additionally, map upper-case letters like their lower-case equivalents, and map the following confusable characters:

    • o, O → 0 (like 0)
    • i, I, l, L → 1 (like 1)
    • s, S → 15 (like f)

    Characters that match the canonical encoding are mapped directly to the corresponding value, eg '1' → 1. For alternative encodings, either upper case or one of the confusables, add an additional bit (1 << 5 == 0x20) to signify that the input was non-canonical, eg 'I' → 1 + 0x20. Map all other characters to a sentinel value (1 << 6 == 0x40) to signify decoding error, eg '!' → 0x40.

    This can be encoded in a lookup table, so, given the input id30: [u8; 6], we can implement this step as

    let decoded: Vec<_> = id30.iter().map(|&c| DECODING[c as usize]).collect();
  2. If any of the mapped values are the error sentinel, terminate decoding and yield an error result.

  3. If any of the mapped values have the alternative encoding bit set, let is_canonical = false. Otherwise, is_canonical = true.

  4. Compose the identifier value from the low five bits of all the mapped values:

    let masked: Vec<_> = decoded.iter().map(|x| x & 0x1f).collect();
    let value = (masked[0] << 25) + (masked[1] << 20) + (masked[2] << 15) +
        (masked[3] << 10) + (masked[4] << 5) + masked[5];

    Yield the composed value along with is_canonical.

For use in URLs, a request to a non-canonical Id30 should be redirected to the canonical encoding of the same Id30, to avoid exposing the same resource at multiple different URLs. This is valuable for caching purposes and search engine ranking. The redirection logic only needs to consider is_canonical and does not need any costly operations such as querying a database, since redirection can be done regardless of whether or not the target URL resolves to anything sensible.

§Features

This crate uses features for selection of integrations with other crates. Keeping the feature selection minimal serves to minimize the dependency tree.

To be able to support different major versions of third party crates, the exposed features are named for the dependency version. For example, to integrate with the rand crate in version 0.8.z, use the feature rand08. This scheme allows future versions of id30 to offer integrations with multiple versions of rand simultaneously.

As a convenience, there are unversioned aliases of each integration that point to the latest supported major version of the given crate. For example, using the feature rand enables integration with rand at the currently supported latest version. id30 does not consider it a breaking change to update these aliases to point to newer versions, so you may want to consider using the versioned feature names instead.

There is one default feature, rand.

The available integration features are:

  • rand08 (alias rand), for integration with rand 0.8.z
  • serde1 (alias serde), for integration with serde 1.y.z
  • diesel2 (alias diesel), for integration with diesel 2.y.z

See Id30 for details about each integration.

Structs§

  • An implementation of the Id30 encoding scheme as documented at the crate root.
  • Id30Parse represents the successful result of parsing an id30 string:
  • The given value was out of range for an Id30. The valid range is [0, 1 << 30).

Enums§