ratproto-did 0.0.3

# Overview


To describe the encoding and decoding process and how individual bits are manipulated,
placeholder letters will be used as follows:

```text
MSB ____________________________________________________________________________________
| 01100100 | 01101001 | 01100100 | 00111010 | 01110000 | 01101100 | 01100011 | 00111010 | // did:plc:
| aaaaaaaa | bbbbbbbb | cccccccc | dddddddd | eeeeeeee | ffffffff | gggggggg | hhhhhhhh |
 _______________________________________________________________________________________
| iiiiiiii | jjjjjjjj | kkkkkkkk | llllllll | mmmmmmmm | nnnnnnnn | oooooooo | pppppppp |
| qqqqqqqq | rrrrrrrr | ssssssss | tttttttt | uuuuuuuu | vvvvvvvv | wwwwwwww | xxxxxxxx |
 ____________________________________________________________________________________ LSB
```

This represents the following DID: `did:plc:abcdefghijklmnopqrstuvwx`

This module implements optimizations via x86 SIMD, so registers store data in LSB order
(little-endian, least-significant-byte). In a byte array (or a string), identifiers are stored
in MSB order. Whenever data is transferred between raw memory and registers, this order is flipped.

The optimizations outlined below work with AVX2, which supports registers up to 256 bits large.
Because many operations treat these registers as two 128-bit halves, a divider line is added in visualizations.

Base32 encodes 5 bits of data as one character. The goal of the decoding and encoding steps is to convert
between a string of 8-bit characters and packed 5-bit values.

---

# Decoding


## AVX2


### Outline


1. Load 32 bytes into a 256-bit register
2. Validate `did:plc:` prefix
3. Validate base32
    - Create lane masks for lowercase alphabetic and 2..=7 numeric chars
    - If any bytes satisfy neither, an error can be returned.
4. Convert chars into 5-bit values
    - Use byte masks from 3a.
    - If any bytes were not valid base32, a garbage value is produced.
    - Every byte is now a 5-bit value (or any char was non-base32, so the result would be invalid anyway).
5. Shuffle and bit-shift bytes.
    - Bytes can be shuffled to re-use the 256bit register and give each value 16 bits of space
    - After the u8 values are extended to u16, bits can be shifted across byte boundaries
6. Shuffle bytes again and prepare them for packing
7. Combine 8-bit values with `or` operations
    - The first byte needs to be 0, this can be done during shuffling
    - The effective data now spans 16 bytes, or 128 bits
8. Write the 16-byte value as the result
    - If we ensure that the first byte of [`DidInner`] is the discriminant, the result can be transmuted. Beware of byte
      order!

### Notes


- base32 conversion is branchless, so an invalid value will produce garbage data during the process
    - This is detected during the validation process
    - If validation fails, the first byte is set to a non-zero value
- Some bytes need bits from 3 different base32 chars
    - This makes the packing process require three `or` operations
    - Possible optimization with a different way to shuffle & shift bits?
      Something with aligning the bits at a byte boundary, thus bringing two values together

### Steps


#### Goal


We start with the following data:

```text
MSB ____________________________________________________________________________________
| 01100100 | 01101001 | 01100100 | 00111010 | 01110000 | 01101100 | 01100011 | 00111010 | // did:plc:
| aaaaaaaa | bbbbbbbb | cccccccc | dddddddd | eeeeeeee | ffffffff | gggggggg | hhhhhhhh |
 _______________________________________________________________________________________
| iiiiiiii | jjjjjjjj | kkkkkkkk | llllllll | mmmmmmmm | nnnnnnnn | oooooooo | pppppppp |
| qqqqqqqq | rrrrrrrr | ssssssss | tttttttt | uuuuuuuu | vvvvvvvv | wwwwwwww | xxxxxxxx |
 ____________________________________________________________________________________ LSB
```

This is loaded into the SIMD register in reverse order:

```text
MSB ____________________________________________________________________________________
| xxxxxxxx | wwwwwwww | vvvvvvvv | uuuuuuuu | tttttttt | ssssssss | rrrrrrrr | qqqqqqqq |
| pppppppp | oooooooo | nnnnnnnn | mmmmmmmm | llllllll | kkkkkkkk | jjjjjjjj | iiiiiiii |
 _______________________________________________________________________________________
| hhhhhhhh | gggggggg | ffffffff | eeeeeeee | dddddddd | cccccccc | bbbbbbbb | aaaaaaaa |
| 00111010 | 01100011 | 01101100 | 01110000 | 00111010 | 01100100 | 01101001 | 01100100 |
 ____________________________________________________________________________________ LSB
```

The goal is to arrange the values as such:

```text
 _______________________________________________________________________________________
| wwwxxxxx | uvvvvvww | ttttuuuu | rrssssst | qqqqqrrr | oooppppp | mnnnnnoo | llllmmmm |
| jjkkkkkl | iiiiijjj | ggghhhhh | efffffgg | ddddeeee | bbcccccd | aaaaabbb | 00000000 |
 _______________________________________________________________________________________
```

#### Convert base32


First, base32 characters are converted to their 5-bit values.
This is done by subtracting either `0x61` (to place `a` at 0), or `0x18`) (to place `2` at 26).

Validation happens in parallel, and will set a validity bit at the end. If any of the characters are not base32,
the procedure will work with garbage values, but the end result will be discarded anyway.

Assuming the identifier _is_ valid base32, the result is the following.
The "did:plc:" bytes are removed, as they're no longer relevant.

```text
 _______________________________________________________________________________________
| ...xxxxx | ...wwwww | ...vvvvv | ...uuuuu | ...ttttt | ...sssss | ...rrrrr | ...qqqqq |
| ...ppppp | ...ooooo | ...nnnnn | ...mmmmm | ...lllll | ...kkkkk | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
| ...hhhhh | ...ggggg | ...fffff | ...eeeee | ...ddddd | ...ccccc | ...bbbbb | ...aaaaa |
| ........ | ........ | ........ | ........ | ........ | ........ | ........ | ........ |
 _______________________________________________________________________________________
```

#### Swizzling & permutation process


After each character is converted into a 5-bit value,
the bytes are swizzled and permuted to finally be OR-reduced.

During swizzling, bytes cannot cross the 128-bit boundary - permutation is required for that.
These values are part of the same 128-bit half:

- a..=h
- i..=x

Values need to be shifted as follows:

Within their register:

- h, p, x << 0
- c, k, s << 1
- f, n, v << 2
- a, i, q << 3

Crossing a byte boundary:

- e, m, u >> 1
- b, j, r >> 2
- g, o, w >> 3
- d, l, t >> 4

**Starting point** (as shown above)

```text
 _______________________________________________________________________________________
| ...xxxxx | ...wwwww | ...vvvvv | ...uuuuu | ...ttttt | ...sssss | ...rrrrr | ...qqqqq |
| ...ppppp | ...ooooo | ...nnnnn | ...mmmmm | ...lllll | ...kkkkk | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
| ...hhhhh | ...ggggg | ...fffff | ...eeeee | ...ddddd | ...ccccc | ...bbbbb | ...aaaaa |
| ........ | ........ | ........ | ........ | ........ | ........ | ........ | ........ |
 _______________________________________________________________________________________
```

**Permute**

We need some bytes in the lower half of the 256-bit register to allow for shuffling later.

```text
 _______________________________________________________________________________________
| ...xxxxx | ...wwwww | ...vvvvv | ...uuuuu | ...ttttt | ...sssss | ...rrrrr | ...qqqqq |
| ...ppppp | ...ooooo | ...nnnnn | ...mmmmm | ...lllll | ...kkkkk | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
| ...hhhhh | ...ggggg | ...fffff | ...eeeee | ...ddddd | ...ccccc | ...bbbbb | ...aaaaa |
| ...ppppp | ...ooooo | ...nnnnn | ...mmmmm | ...lllll | ...kkkkk | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
```

**Swizzle**

AVX2 does not support variable (vector) shifts for 16-bit chunks, only 32 or larger.

Vertically-adjacent values need to be shifted by the same amount, so they will have to be aligned
into the same 32-bit chunk. In addition, values will cross byte boundaries, odd and even columns
will be split into two separate registers.

```text
 _______________________________________________________________________________________
| ...xxxxx | ...wwwww | ...ppppp | ...ooooo | ...vvvvv | ...uuuuu | ...nnnnn | ...mmmmm |
| ...ttttt | ...sssss | ...lllll | ...kkkkk | ...rrrrr | ...qqqqq | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
| ...hhhhh | ...ggggg | ...ppppp | ...ooooo | ...fffff | ...eeeee | ...nnnnn | ...mmmmm |
| ...ddddd | ...ccccc | ...lllll | ...kkkkk | ...bbbbb | ...aaaaa | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
```

**Split**

This has to be done via masked blending.
`unpackhi`/`unpacklo` doesn't work here, because the split has to happen vertically.

```text
 _______________________________________________________________________________________
| ...xxxxx | ........ | ...ppppp | ........ | ...vvvvv | ........ | ...nnnnn | ........ |
| ...ttttt | ........ | ...lllll | ........ | ...rrrrr | ........ | ...jjjjj | ........ |
 _______________________________________________________________________________________
| ...hhhhh | ........ | ...ppppp | ........ | ...fffff | ........ | ...nnnnn | ........ |
| ...ddddd | ........ | ...lllll | ........ | ...bbbbb | ........ | ...jjjjj | ........ |
 _______________________________________________________________________________________
 
 _______________________________________________________________________________________
| ........ | ...wwwww | ........ | ...ooooo | ........ | ...uuuuu | ........ | ...mmmmm |
| ........ | ...sssss | ........ | ...kkkkk | ........ | ...qqqqq | ........ | ...iiiii |
 _______________________________________________________________________________________
| ........ | ...ggggg | ........ | ...ooooo | ........ | ...eeeee | ........ | ...mmmmm |
| ........ | ...ccccc | ........ | ...kkkkk | ........ | ...aaaaa | ........ | ...iiiii |
 _______________________________________________________________________________________
```

**Bit-shift**

```text
 _______________________________________________________________________________________
| ........ | ...xxxxx | ........ | ...ppppp | ........ | .vvvvv.. | ........ | .nnnnn.. |
| .......t | tttt.... | .......l | llll.... | .....rrr | rr...... | .....jjj | jj...... |
 _______________________________________________________________________________________
| ........ | ...hhhhh | ........ | ...ppppp | ........ | .fffff.. | ........ | .nnnnn.. |
| .......d | dddd.... | .......l | llll.... | .....bbb | bb...... | .....jjj | jj...... |
 _______________________________________________________________________________________

 _______________________________________________________________________________________
| ......ww | www..... | ......oo | ooo..... | ....uuuu | u....... | ....mmmm | m....... |
| ........ | ..sssss. | ........ | ..kkkkk. | ........ | qqqqq... | ........ | iiiii... |
 _______________________________________________________________________________________
| ......gg | ggg..... | ......oo | ooo..... | ....eeee | e....... | ....mmmm | m....... |
| ........ | ..ccccc. | ........ | ..kkkkk. | ........ | aaaaa... | ........ | iiiii... |
 _______________________________________________________________________________________
```

**Shuffle**

The values are now aligned correctly bit-wise, so now they need to be moved into the correct bytes.

```text
 _______________________________________________________________________________________
| ...xxxxx | .vvvvv.. | tttt.... | rr...... | .....rrr | ...ppppp | .nnnnn.. | llll.... |
| ........ | ........ | ........ | .......t | ........ | ........ | ........ | ........ |
 _______________________________________________________________________________________
| .......l | ........ | ........ | ........ | ........ | .......d | ........ | ........ |
| jj...... | .....jjj | ...hhhhh | .fffff.. | dddd.... | bb...... | .....bbb | 00000000 |
 _______________________________________________________________________________________
 
 _______________________________________________________________________________________
| www..... | ......ww | ....uuuu | ..sssss. | qqqqq... | ooo..... | ......oo | ....mmmm |
| ........ | u....... | ........ | ........ | ........ | ........ | m....... | 00000000 |
 _______________________________________________________________________________________
| ........ | ........ | ........ | e....... | ........ | ........ | ........ | ........ |
| ..kkkkk. | iiiii... | ggg..... | ......gg | ....eeee | ..ccccc. | aaaaa... | 00000000 |
 _______________________________________________________________________________________
```

**OR-reduce**

Both 256-bit registers are ORed together.

```text
 _______________________________________________________________________________________
| wwwxxxxx | .vvvvvww | ttttuuuu | rrsssss. | qqqqqrrr | oooppppp | .nnnnnoo | llllmmmm |
| ........ | u....... | ........ | .......t | ........ | ........ | m....... | 00000000 |
 _______________________________________________________________________________________
| .......l | ........ | ........ | e....... | ........ | .......d | ........ | ........ |
| jjkkkkk. | iiiiijjj | ggghhhhh | .fffffgg | ddddeeee | bbccccc. | aaaaabbb | 00000000 |
 _______________________________________________________________________________________
```

Because some bytes take bits from 3 different sources, two 128-bit values are prepared using permutation,
and finally ORed together again.

```text
 _______________________________________________________________________________________
| wwwxxxxx | .vvvvvww | ttttuuuu | rrsssss. | qqqqqrrr | oooppppp | .nnnnnoo | llllmmmm |
| jjkkkkk. | iiiiijjj | ggghhhhh | .fffffgg | ddddeeee | bbccccc. | aaaaabbb | 00000000 |
 _______________________________________________________________________________________
```

```text
 _______________________________________________________________________________________
| ........ | u....... | ........ | .......t | ........ | ........ | m....... | ........ |
| .......l | ........ | ........ | e....... | ........ | .......d | ........ | ........ |
 _______________________________________________________________________________________
```

```text
 _______________________________________________________________________________________
| wwwxxxxx | uvvvvvww | ttttuuuu | rrssssst | qqqqqrrr | oooppppp | mnnnnnoo | llllmmmm |
| jjkkkkkl | iiiiijjj | ggghhhhh | efffffgg | ddddeeee | bbcccccd | aaaaabbb | 00000000 |
 _______________________________________________________________________________________
```

#### Output


Finally, the 128-bit register is written into memory. If validation failed, the first byte is set to a non-zero value.
In this case, the rest of the data will contain garbage, but the `OptionDidPlc` wrapper will then discard the data.

---

## Without AVX


If AVX is not detected, the decoding process uses bit manipulation in a manner similar to the AVX2 method above.

All bytes are first validated (`did:plc:` prefix & base32 characters).

The identifier is parsed 8 characters at a time (yielding 40 bits each time).
The characters are converted into their 5-bit values based on bit `0x40`
(characters `a..=z` have this bit set, characters `2..=7` do not).

The values are then packed, either by bit-shifting, or using the BMI2 intrinsic `_pext_u64`, if detected.

---

# Encoding


This is a bit simpler, because we don't need to validate base32.
We still need to validate that the first byte (the discriminant) is 0,
but that is handled outside the encoder.

## AVX2


With AVX, we can make use of a little bit of vectorization again:

1. Load the 16-bit value of `Did` (128 bits), duplicate to a 256-bit register.
2. Shuffle alternating 5-bit values into two 256-bit registers
    - Some values are spread across 2 bytes
3. Bit-shift packed 32-bit integers to LSb-align the values
4. OR the two registers together
5. AND-mask the lower 5 bits of every byte
6. Convert values to their base32 chars

### Steps


#### Loading data


Data starts out in MSB order:

```text
MSB ____________________________________________________________________________________
| 00000000 | aaaaabbb | bbcccccd | ddddeeee | efffffgg | ggghhhhh | iiiiijjj | jjkkkkkl |
| llllmmmm | mnnnnnoo | oooppppp | qqqqqrrr | rrssssst | ttttuuuu | uvvvvvww | wwwxxxxx |
 ____________________________________________________________________________________ LSB
```

This is loaded into registers in little-endian order:

```text
MSB ____________________________________________________________________________________
| wwwxxxxx | uvvvvvww | ttttuuuu | rrssssst | qqqqqrrr | oooppppp | mnnnnnoo | llllmmmm |
| jjkkkkkl | iiiiijjj | ggghhhhh | efffffgg | ddddeeee | bbcccccd | aaaaabbb | 00000000 |
 ____________________________________________________________________________________ LSB
```

These 128 bits are broadcast into a 256-bit register.

This is the register layout we want to reach before converting values to base32:

```text
MSB ____________________________________________________________________________________
| ...xxxxx | ...wwwww | ...vvvvv | ...uuuuu | ...ttttt | ...sssss | ...rrrrr | ...qqqqq |
| ...ppppp | ...ooooo | ...nnnnn | ...mmmmm | ...lllll | ...kkkkk | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
| ...hhhhh | ...ggggg | ...fffff | ...eeeee | ...ddddd | ...ccccc | ...bbbbb | ...aaaaa |
| 00111010 | 01100011 | 01101100 | 01110000 | 00111010 | 01100100 | 01101001 | 01100100 |
 ____________________________________________________________________________________ LSB
```

Odd and even values are swizzled in order to be bit-shifted and isolated.
Keep in mind that AVX2 only supports bit-shifting for 32-bit integers (not 16!),
and swizzling cannot happen across byte boundaries.
We initially have all values available in both halves of the 256-bit register,
but we have to prepare and isolate the correct values in each half to avoid having to permute.

One half of the values is processed as follows:

#### Swizzle & prepare


```text
MSB ____________________________________________________________________________________
| ........ | wwwxxxxx | ........ | oooppppp | rrssssst | ttttuuuu | jjkkkkkl | llllmmmm |
| ........ | uvvvvvww | ........ | mnnnnnoo | qqqqqrrr | rrssssst | iiiiijjj | jjkkkkkl |
 _______________________________________________________________________________________
| ........ | ggghhhhh | ........ | ........ | bbcccccd | ddddeeee | ........ | ........ |
| ........ | efffffgg | ........ | ........ | aaaaabbb | bbcccccd | ........ | ........ |
 ____________________________________________________________________________________ LSB
```

Desired value bits isolated for clarity _(the masking itself happens later)_

```text
MSB ____________________________________________________________________________________
| ........ | ...xxxxx | ........ | ...ppppp | .......t | tttt.... | .......l | llll.... |
| ........ | .vvvvv.. | ........ | .nnnnn.. | .....rrr | rr...... | .....jjj | jj...... |
 _______________________________________________________________________________________
| ........ | ...hhhhh | ........ | ........ | .......d | dddd.... | ........ | ........ |
| ........ | .fffff.. | ........ | ........ | .....bbb | bb...... | ........ | ........ |
 ____________________________________________________________________________________ LSB
```

#### Bit-shift 32-bit values:


This aligns each 5-bit value to a byte boundary.
Because we can only bit-shift every 32 bits, the values need to be arranged differently first.

```text
MSB ____________________________________________________________________________________
| ...xxxxx | ........ | ...ppppp | ........ | ...ttttt | ........ | ...lllll | ........ |
| ...vvvvv | ........ | ...nnnnn | ........ | ...rrrrr | ........ | ...jjjjj | ........ |
 _______________________________________________________________________________________
| ...hhhhh | ........ | ........ | ........ | ...ddddd | ........ | ........ | ........ |
| ...fffff | ........ | ........ | ........ | ...bbbbb | ........ | ........ | ........ |
 ____________________________________________________________________________________ LSB
```

#### Swizzle to align columns:


The values are now byte-aligned, and can be swizzled into the right place.
Note that, earlier, the correct bits already had to end up on the right half of the register.

```text
MSB ____________________________________________________________________________________
| ...xxxxx | ........ | ...vvvvv | ........ | ...ttttt | ........ | ...rrrrr | ........ |
| ...ppppp | ........ | ...nnnnn | ........ | ...lllll | ........ | ...jjjjj | ........ |
 _______________________________________________________________________________________
| ...hhhhh | ........ | ...fffff | ........ | ...ddddd | ........ | ...bbbbb | ........ |
| ........ | ........ | ........ | ........ | ........ | ........ | ........ | ........ |
 ____________________________________________________________________________________ LSB
```

#### Combine


The other half of the values is processed as follows:

```text
MSB ____________________________________________________________________________________
| ......ww | www..... | ......oo | ooo..... | ..sssss. | ........ | ..kkkkk. | ........ |
| ....uuuu | u....... | ....mmmm | m....... | qqqqq... | ........ | iiiii... | ........ |
 _______________________________________________________________________________________
| ......gg | ggg..... | ........ | ........ | ..ccccc. | ........ | ........ | ........ |
| ....eeee | e....... | ........ | ........ | aaaaa... | ........ | ........ | ........ |
 ____________________________________________________________________________________ LSB
 
MSB ____________________________________________________________________________________
| ........ | ...wwwww | ........ | ...ooooo | ........ | ...sssss | ........ | ...kkkkk |
| ........ | ...uuuuu | ........ | ...mmmmm | ........ | ...qqqqq | ........ | ...iiiii |
 _______________________________________________________________________________________
| ........ | ...ggggg | ........ | ........ | ........ | ...ccccc | ........ | ........ |
| ........ | ...eeeee | ........ | ........ | ........ | ...aaaaa | ........ | ........ |
 ____________________________________________________________________________________ LSB
 
MSB ____________________________________________________________________________________
| ........ | ...wwwww | ........ | ...uuuuu | ........ | ...sssss | ........ | ...qqqqq |
| ........ | ...ooooo | ........ | ...mmmmm | ........ | ...kkkkk | ........ | ...iiiii |
 _______________________________________________________________________________________
| ........ | ...ggggg | ........ | ...eeeee | ........ | ...ccccc | ........ | ...aaaaa |
| ........ | ........ | ........ | ........ | ........ | ........ | ........ | ........ |
 ____________________________________________________________________________________ LSB
```

The two registers are then ORed together.

Note that the examples above show the desired values already isolated - this is just for clarity, the actual masking
happens after combining the two registers.

```text
MSB ____________________________________________________________________________________
| ...xxxxx | ...wwwww | ...vvvvv | ...uuuuu | ...ttttt | ...sssss | ...rrrrr | ...qqqqq |
| ...ppppp | ...ooooo | ...nnnnn | ...mmmmm | ...lllll | ...kkkkk | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
| ...hhhhh | ...ggggg | ...fffff | ...eeeee | ...ddddd | ...ccccc | ...bbbbb | ...aaaaa |
| ........ | ........ | ........ | ........ | ........ | ........ | ........ | ........ |
 ____________________________________________________________________________________ LSB
```

#### Convert to base32


Finally, the values can be converted to base32 by adding the right value (0-25 → 'a'-'z', 26-31 → '2'-'7').
This is done with a `blendv` based on a `cmpgt` mask.

The final value is written into the provided 32 byte `out` slice.

Currently, the first 8 bytes are then overwritten with "did:plc:". Trying to cram these bytes into the vector register
is likely not worth it, as the current method simply uses a single `movabs` instruction.

## Without AVX2


The non-AVX2 implementation simply converts each value to its base32 character.
Each 5-bit value is isolated via bit-shifting and ANDing.

This is done 8 values at a time, similar to decoding, since 8 5-bit values are aligned
with the byte boundary at 40 bits.