ratproto-did 0.0.3

A highly-optimized library for atproto DIDs.
Documentation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
# Overview


To describe the encoding and decoding process and how individual bits are manipulated,
placeholder letters will be used as follows:

```text
MSB ____________________________________________________________________________________
| 01100100 | 01101001 | 01100100 | 00111010 | 01110000 | 01101100 | 01100011 | 00111010 | // did:plc:
| aaaaaaaa | bbbbbbbb | cccccccc | dddddddd | eeeeeeee | ffffffff | gggggggg | hhhhhhhh |
 _______________________________________________________________________________________
| iiiiiiii | jjjjjjjj | kkkkkkkk | llllllll | mmmmmmmm | nnnnnnnn | oooooooo | pppppppp |
| qqqqqqqq | rrrrrrrr | ssssssss | tttttttt | uuuuuuuu | vvvvvvvv | wwwwwwww | xxxxxxxx |
 ____________________________________________________________________________________ LSB
```

This represents the following DID: `did:plc:abcdefghijklmnopqrstuvwx`

This module implements optimizations via x86 SIMD, so registers store data in LSB order
(little-endian, least-significant-byte). In a byte array (or a string), identifiers are stored
in MSB order. Whenever data is transferred between raw memory and registers, this order is flipped.

The optimizations outlined below work with AVX2, which supports registers up to 256 bits large.
Because many operations treat these registers as two 128-bit halves, a divider line is added in visualizations.

Base32 encodes 5 bits of data as one character. The goal of the decoding and encoding steps is to convert
between a string of 8-bit characters and packed 5-bit values.

---

# Decoding


## AVX2


### Outline


1. Load 32 bytes into a 256-bit register
2. Validate `did:plc:` prefix
3. Validate base32
    - Create lane masks for lowercase alphabetic and 2..=7 numeric chars
    - If any bytes satisfy neither, an error can be returned.
4. Convert chars into 5-bit values
    - Use byte masks from 3a.
    - If any bytes were not valid base32, a garbage value is produced.
    - Every byte is now a 5-bit value (or any char was non-base32, so the result would be invalid anyway).
5. Shuffle and bit-shift bytes.
    - Bytes can be shuffled to re-use the 256bit register and give each value 16 bits of space
    - After the u8 values are extended to u16, bits can be shifted across byte boundaries
6. Shuffle bytes again and prepare them for packing
7. Combine 8-bit values with `or` operations
    - The first byte needs to be 0, this can be done during shuffling
    - The effective data now spans 16 bytes, or 128 bits
8. Write the 16-byte value as the result
    - If we ensure that the first byte of [`DidInner`] is the discriminant, the result can be transmuted. Beware of byte
      order!

### Notes


- base32 conversion is branchless, so an invalid value will produce garbage data during the process
    - This is detected during the validation process
    - If validation fails, the first byte is set to a non-zero value
- Some bytes need bits from 3 different base32 chars
    - This makes the packing process require three `or` operations
    - Possible optimization with a different way to shuffle & shift bits?
      Something with aligning the bits at a byte boundary, thus bringing two values together

### Steps


#### Goal


We start with the following data:

```text
MSB ____________________________________________________________________________________
| 01100100 | 01101001 | 01100100 | 00111010 | 01110000 | 01101100 | 01100011 | 00111010 | // did:plc:
| aaaaaaaa | bbbbbbbb | cccccccc | dddddddd | eeeeeeee | ffffffff | gggggggg | hhhhhhhh |
 _______________________________________________________________________________________
| iiiiiiii | jjjjjjjj | kkkkkkkk | llllllll | mmmmmmmm | nnnnnnnn | oooooooo | pppppppp |
| qqqqqqqq | rrrrrrrr | ssssssss | tttttttt | uuuuuuuu | vvvvvvvv | wwwwwwww | xxxxxxxx |
 ____________________________________________________________________________________ LSB
```

This is loaded into the SIMD register in reverse order:

```text
MSB ____________________________________________________________________________________
| xxxxxxxx | wwwwwwww | vvvvvvvv | uuuuuuuu | tttttttt | ssssssss | rrrrrrrr | qqqqqqqq |
| pppppppp | oooooooo | nnnnnnnn | mmmmmmmm | llllllll | kkkkkkkk | jjjjjjjj | iiiiiiii |
 _______________________________________________________________________________________
| hhhhhhhh | gggggggg | ffffffff | eeeeeeee | dddddddd | cccccccc | bbbbbbbb | aaaaaaaa |
| 00111010 | 01100011 | 01101100 | 01110000 | 00111010 | 01100100 | 01101001 | 01100100 |
 ____________________________________________________________________________________ LSB
```

The goal is to arrange the values as such:

```text
 _______________________________________________________________________________________
| wwwxxxxx | uvvvvvww | ttttuuuu | rrssssst | qqqqqrrr | oooppppp | mnnnnnoo | llllmmmm |
| jjkkkkkl | iiiiijjj | ggghhhhh | efffffgg | ddddeeee | bbcccccd | aaaaabbb | 00000000 |
 _______________________________________________________________________________________
```

#### Convert base32


First, base32 characters are converted to their 5-bit values.
This is done by subtracting either `0x61` (to place `a` at 0), or `0x18`) (to place `2` at 26).

Validation happens in parallel, and will set a validity bit at the end. If any of the characters are not base32,
the procedure will work with garbage values, but the end result will be discarded anyway.

Assuming the identifier _is_ valid base32, the result is the following.
The "did:plc:" bytes are removed, as they're no longer relevant.

```text
 _______________________________________________________________________________________
| ...xxxxx | ...wwwww | ...vvvvv | ...uuuuu | ...ttttt | ...sssss | ...rrrrr | ...qqqqq |
| ...ppppp | ...ooooo | ...nnnnn | ...mmmmm | ...lllll | ...kkkkk | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
| ...hhhhh | ...ggggg | ...fffff | ...eeeee | ...ddddd | ...ccccc | ...bbbbb | ...aaaaa |
| ........ | ........ | ........ | ........ | ........ | ........ | ........ | ........ |
 _______________________________________________________________________________________
```

#### Swizzling & permutation process


After each character is converted into a 5-bit value,
the bytes are swizzled and permuted to finally be OR-reduced.

During swizzling, bytes cannot cross the 128-bit boundary - permutation is required for that.
These values are part of the same 128-bit half:

- a..=h
- i..=x

Values need to be shifted as follows:

Within their register:

- h, p, x << 0
- c, k, s << 1
- f, n, v << 2
- a, i, q << 3

Crossing a byte boundary:

- e, m, u >> 1
- b, j, r >> 2
- g, o, w >> 3
- d, l, t >> 4

**Starting point** (as shown above)

```text
 _______________________________________________________________________________________
| ...xxxxx | ...wwwww | ...vvvvv | ...uuuuu | ...ttttt | ...sssss | ...rrrrr | ...qqqqq |
| ...ppppp | ...ooooo | ...nnnnn | ...mmmmm | ...lllll | ...kkkkk | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
| ...hhhhh | ...ggggg | ...fffff | ...eeeee | ...ddddd | ...ccccc | ...bbbbb | ...aaaaa |
| ........ | ........ | ........ | ........ | ........ | ........ | ........ | ........ |
 _______________________________________________________________________________________
```

**Permute**

We need some bytes in the lower half of the 256-bit register to allow for shuffling later.

```text
 _______________________________________________________________________________________
| ...xxxxx | ...wwwww | ...vvvvv | ...uuuuu | ...ttttt | ...sssss | ...rrrrr | ...qqqqq |
| ...ppppp | ...ooooo | ...nnnnn | ...mmmmm | ...lllll | ...kkkkk | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
| ...hhhhh | ...ggggg | ...fffff | ...eeeee | ...ddddd | ...ccccc | ...bbbbb | ...aaaaa |
| ...ppppp | ...ooooo | ...nnnnn | ...mmmmm | ...lllll | ...kkkkk | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
```

**Swizzle**

AVX2 does not support variable (vector) shifts for 16-bit chunks, only 32 or larger.

Vertically-adjacent values need to be shifted by the same amount, so they will have to be aligned
into the same 32-bit chunk. In addition, values will cross byte boundaries, odd and even columns
will be split into two separate registers.

```text
 _______________________________________________________________________________________
| ...xxxxx | ...wwwww | ...ppppp | ...ooooo | ...vvvvv | ...uuuuu | ...nnnnn | ...mmmmm |
| ...ttttt | ...sssss | ...lllll | ...kkkkk | ...rrrrr | ...qqqqq | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
| ...hhhhh | ...ggggg | ...ppppp | ...ooooo | ...fffff | ...eeeee | ...nnnnn | ...mmmmm |
| ...ddddd | ...ccccc | ...lllll | ...kkkkk | ...bbbbb | ...aaaaa | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
```

**Split**

This has to be done via masked blending.
`unpackhi`/`unpacklo` doesn't work here, because the split has to happen vertically.

```text
 _______________________________________________________________________________________
| ...xxxxx | ........ | ...ppppp | ........ | ...vvvvv | ........ | ...nnnnn | ........ |
| ...ttttt | ........ | ...lllll | ........ | ...rrrrr | ........ | ...jjjjj | ........ |
 _______________________________________________________________________________________
| ...hhhhh | ........ | ...ppppp | ........ | ...fffff | ........ | ...nnnnn | ........ |
| ...ddddd | ........ | ...lllll | ........ | ...bbbbb | ........ | ...jjjjj | ........ |
 _______________________________________________________________________________________
 
 _______________________________________________________________________________________
| ........ | ...wwwww | ........ | ...ooooo | ........ | ...uuuuu | ........ | ...mmmmm |
| ........ | ...sssss | ........ | ...kkkkk | ........ | ...qqqqq | ........ | ...iiiii |
 _______________________________________________________________________________________
| ........ | ...ggggg | ........ | ...ooooo | ........ | ...eeeee | ........ | ...mmmmm |
| ........ | ...ccccc | ........ | ...kkkkk | ........ | ...aaaaa | ........ | ...iiiii |
 _______________________________________________________________________________________
```

**Bit-shift**

```text
 _______________________________________________________________________________________
| ........ | ...xxxxx | ........ | ...ppppp | ........ | .vvvvv.. | ........ | .nnnnn.. |
| .......t | tttt.... | .......l | llll.... | .....rrr | rr...... | .....jjj | jj...... |
 _______________________________________________________________________________________
| ........ | ...hhhhh | ........ | ...ppppp | ........ | .fffff.. | ........ | .nnnnn.. |
| .......d | dddd.... | .......l | llll.... | .....bbb | bb...... | .....jjj | jj...... |
 _______________________________________________________________________________________

 _______________________________________________________________________________________
| ......ww | www..... | ......oo | ooo..... | ....uuuu | u....... | ....mmmm | m....... |
| ........ | ..sssss. | ........ | ..kkkkk. | ........ | qqqqq... | ........ | iiiii... |
 _______________________________________________________________________________________
| ......gg | ggg..... | ......oo | ooo..... | ....eeee | e....... | ....mmmm | m....... |
| ........ | ..ccccc. | ........ | ..kkkkk. | ........ | aaaaa... | ........ | iiiii... |
 _______________________________________________________________________________________
```

**Shuffle**

The values are now aligned correctly bit-wise, so now they need to be moved into the correct bytes.

```text
 _______________________________________________________________________________________
| ...xxxxx | .vvvvv.. | tttt.... | rr...... | .....rrr | ...ppppp | .nnnnn.. | llll.... |
| ........ | ........ | ........ | .......t | ........ | ........ | ........ | ........ |
 _______________________________________________________________________________________
| .......l | ........ | ........ | ........ | ........ | .......d | ........ | ........ |
| jj...... | .....jjj | ...hhhhh | .fffff.. | dddd.... | bb...... | .....bbb | 00000000 |
 _______________________________________________________________________________________
 
 _______________________________________________________________________________________
| www..... | ......ww | ....uuuu | ..sssss. | qqqqq... | ooo..... | ......oo | ....mmmm |
| ........ | u....... | ........ | ........ | ........ | ........ | m....... | 00000000 |
 _______________________________________________________________________________________
| ........ | ........ | ........ | e....... | ........ | ........ | ........ | ........ |
| ..kkkkk. | iiiii... | ggg..... | ......gg | ....eeee | ..ccccc. | aaaaa... | 00000000 |
 _______________________________________________________________________________________
```

**OR-reduce**

Both 256-bit registers are ORed together.

```text
 _______________________________________________________________________________________
| wwwxxxxx | .vvvvvww | ttttuuuu | rrsssss. | qqqqqrrr | oooppppp | .nnnnnoo | llllmmmm |
| ........ | u....... | ........ | .......t | ........ | ........ | m....... | 00000000 |
 _______________________________________________________________________________________
| .......l | ........ | ........ | e....... | ........ | .......d | ........ | ........ |
| jjkkkkk. | iiiiijjj | ggghhhhh | .fffffgg | ddddeeee | bbccccc. | aaaaabbb | 00000000 |
 _______________________________________________________________________________________
```

Because some bytes take bits from 3 different sources, two 128-bit values are prepared using permutation,
and finally ORed together again.

```text
 _______________________________________________________________________________________
| wwwxxxxx | .vvvvvww | ttttuuuu | rrsssss. | qqqqqrrr | oooppppp | .nnnnnoo | llllmmmm |
| jjkkkkk. | iiiiijjj | ggghhhhh | .fffffgg | ddddeeee | bbccccc. | aaaaabbb | 00000000 |
 _______________________________________________________________________________________
```

```text
 _______________________________________________________________________________________
| ........ | u....... | ........ | .......t | ........ | ........ | m....... | ........ |
| .......l | ........ | ........ | e....... | ........ | .......d | ........ | ........ |
 _______________________________________________________________________________________
```

```text
 _______________________________________________________________________________________
| wwwxxxxx | uvvvvvww | ttttuuuu | rrssssst | qqqqqrrr | oooppppp | mnnnnnoo | llllmmmm |
| jjkkkkkl | iiiiijjj | ggghhhhh | efffffgg | ddddeeee | bbcccccd | aaaaabbb | 00000000 |
 _______________________________________________________________________________________
```

#### Output


Finally, the 128-bit register is written into memory. If validation failed, the first byte is set to a non-zero value.
In this case, the rest of the data will contain garbage, but the `OptionDidPlc` wrapper will then discard the data.

---

## Without AVX


If AVX is not detected, the decoding process uses bit manipulation in a manner similar to the AVX2 method above.

All bytes are first validated (`did:plc:` prefix & base32 characters).

The identifier is parsed 8 characters at a time (yielding 40 bits each time).
The characters are converted into their 5-bit values based on bit `0x40`
(characters `a..=z` have this bit set, characters `2..=7` do not).

The values are then packed, either by bit-shifting, or using the BMI2 intrinsic `_pext_u64`, if detected.

---

# Encoding


This is a bit simpler, because we don't need to validate base32.
We still need to validate that the first byte (the discriminant) is 0,
but that is handled outside the encoder.

## AVX2


With AVX, we can make use of a little bit of vectorization again:

1. Load the 16-bit value of `Did` (128 bits), duplicate to a 256-bit register.
2. Shuffle alternating 5-bit values into two 256-bit registers
    - Some values are spread across 2 bytes
3. Bit-shift packed 32-bit integers to LSb-align the values
4. OR the two registers together
5. AND-mask the lower 5 bits of every byte
6. Convert values to their base32 chars

### Steps


#### Loading data


Data starts out in MSB order:

```text
MSB ____________________________________________________________________________________
| 00000000 | aaaaabbb | bbcccccd | ddddeeee | efffffgg | ggghhhhh | iiiiijjj | jjkkkkkl |
| llllmmmm | mnnnnnoo | oooppppp | qqqqqrrr | rrssssst | ttttuuuu | uvvvvvww | wwwxxxxx |
 ____________________________________________________________________________________ LSB
```

This is loaded into registers in little-endian order:

```text
MSB ____________________________________________________________________________________
| wwwxxxxx | uvvvvvww | ttttuuuu | rrssssst | qqqqqrrr | oooppppp | mnnnnnoo | llllmmmm |
| jjkkkkkl | iiiiijjj | ggghhhhh | efffffgg | ddddeeee | bbcccccd | aaaaabbb | 00000000 |
 ____________________________________________________________________________________ LSB
```

These 128 bits are broadcast into a 256-bit register.

This is the register layout we want to reach before converting values to base32:

```text
MSB ____________________________________________________________________________________
| ...xxxxx | ...wwwww | ...vvvvv | ...uuuuu | ...ttttt | ...sssss | ...rrrrr | ...qqqqq |
| ...ppppp | ...ooooo | ...nnnnn | ...mmmmm | ...lllll | ...kkkkk | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
| ...hhhhh | ...ggggg | ...fffff | ...eeeee | ...ddddd | ...ccccc | ...bbbbb | ...aaaaa |
| 00111010 | 01100011 | 01101100 | 01110000 | 00111010 | 01100100 | 01101001 | 01100100 |
 ____________________________________________________________________________________ LSB
```

Odd and even values are swizzled in order to be bit-shifted and isolated.
Keep in mind that AVX2 only supports bit-shifting for 32-bit integers (not 16!),
and swizzling cannot happen across byte boundaries.
We initially have all values available in both halves of the 256-bit register,
but we have to prepare and isolate the correct values in each half to avoid having to permute.

One half of the values is processed as follows:

#### Swizzle & prepare


```text
MSB ____________________________________________________________________________________
| ........ | wwwxxxxx | ........ | oooppppp | rrssssst | ttttuuuu | jjkkkkkl | llllmmmm |
| ........ | uvvvvvww | ........ | mnnnnnoo | qqqqqrrr | rrssssst | iiiiijjj | jjkkkkkl |
 _______________________________________________________________________________________
| ........ | ggghhhhh | ........ | ........ | bbcccccd | ddddeeee | ........ | ........ |
| ........ | efffffgg | ........ | ........ | aaaaabbb | bbcccccd | ........ | ........ |
 ____________________________________________________________________________________ LSB
```

Desired value bits isolated for clarity _(the masking itself happens later)_

```text
MSB ____________________________________________________________________________________
| ........ | ...xxxxx | ........ | ...ppppp | .......t | tttt.... | .......l | llll.... |
| ........ | .vvvvv.. | ........ | .nnnnn.. | .....rrr | rr...... | .....jjj | jj...... |
 _______________________________________________________________________________________
| ........ | ...hhhhh | ........ | ........ | .......d | dddd.... | ........ | ........ |
| ........ | .fffff.. | ........ | ........ | .....bbb | bb...... | ........ | ........ |
 ____________________________________________________________________________________ LSB
```

#### Bit-shift 32-bit values:


This aligns each 5-bit value to a byte boundary.
Because we can only bit-shift every 32 bits, the values need to be arranged differently first.

```text
MSB ____________________________________________________________________________________
| ...xxxxx | ........ | ...ppppp | ........ | ...ttttt | ........ | ...lllll | ........ |
| ...vvvvv | ........ | ...nnnnn | ........ | ...rrrrr | ........ | ...jjjjj | ........ |
 _______________________________________________________________________________________
| ...hhhhh | ........ | ........ | ........ | ...ddddd | ........ | ........ | ........ |
| ...fffff | ........ | ........ | ........ | ...bbbbb | ........ | ........ | ........ |
 ____________________________________________________________________________________ LSB
```

#### Swizzle to align columns:


The values are now byte-aligned, and can be swizzled into the right place.
Note that, earlier, the correct bits already had to end up on the right half of the register.

```text
MSB ____________________________________________________________________________________
| ...xxxxx | ........ | ...vvvvv | ........ | ...ttttt | ........ | ...rrrrr | ........ |
| ...ppppp | ........ | ...nnnnn | ........ | ...lllll | ........ | ...jjjjj | ........ |
 _______________________________________________________________________________________
| ...hhhhh | ........ | ...fffff | ........ | ...ddddd | ........ | ...bbbbb | ........ |
| ........ | ........ | ........ | ........ | ........ | ........ | ........ | ........ |
 ____________________________________________________________________________________ LSB
```

#### Combine


The other half of the values is processed as follows:

```text
MSB ____________________________________________________________________________________
| ......ww | www..... | ......oo | ooo..... | ..sssss. | ........ | ..kkkkk. | ........ |
| ....uuuu | u....... | ....mmmm | m....... | qqqqq... | ........ | iiiii... | ........ |
 _______________________________________________________________________________________
| ......gg | ggg..... | ........ | ........ | ..ccccc. | ........ | ........ | ........ |
| ....eeee | e....... | ........ | ........ | aaaaa... | ........ | ........ | ........ |
 ____________________________________________________________________________________ LSB
 
MSB ____________________________________________________________________________________
| ........ | ...wwwww | ........ | ...ooooo | ........ | ...sssss | ........ | ...kkkkk |
| ........ | ...uuuuu | ........ | ...mmmmm | ........ | ...qqqqq | ........ | ...iiiii |
 _______________________________________________________________________________________
| ........ | ...ggggg | ........ | ........ | ........ | ...ccccc | ........ | ........ |
| ........ | ...eeeee | ........ | ........ | ........ | ...aaaaa | ........ | ........ |
 ____________________________________________________________________________________ LSB
 
MSB ____________________________________________________________________________________
| ........ | ...wwwww | ........ | ...uuuuu | ........ | ...sssss | ........ | ...qqqqq |
| ........ | ...ooooo | ........ | ...mmmmm | ........ | ...kkkkk | ........ | ...iiiii |
 _______________________________________________________________________________________
| ........ | ...ggggg | ........ | ...eeeee | ........ | ...ccccc | ........ | ...aaaaa |
| ........ | ........ | ........ | ........ | ........ | ........ | ........ | ........ |
 ____________________________________________________________________________________ LSB
```

The two registers are then ORed together.

Note that the examples above show the desired values already isolated - this is just for clarity, the actual masking
happens after combining the two registers.

```text
MSB ____________________________________________________________________________________
| ...xxxxx | ...wwwww | ...vvvvv | ...uuuuu | ...ttttt | ...sssss | ...rrrrr | ...qqqqq |
| ...ppppp | ...ooooo | ...nnnnn | ...mmmmm | ...lllll | ...kkkkk | ...jjjjj | ...iiiii |
 _______________________________________________________________________________________
| ...hhhhh | ...ggggg | ...fffff | ...eeeee | ...ddddd | ...ccccc | ...bbbbb | ...aaaaa |
| ........ | ........ | ........ | ........ | ........ | ........ | ........ | ........ |
 ____________________________________________________________________________________ LSB
```

#### Convert to base32


Finally, the values can be converted to base32 by adding the right value (0-25 → 'a'-'z', 26-31 → '2'-'7').
This is done with a `blendv` based on a `cmpgt` mask.

The final value is written into the provided 32 byte `out` slice.

Currently, the first 8 bytes are then overwritten with "did:plc:". Trying to cram these bytes into the vector register
is likely not worth it, as the current method simply uses a single `movabs` instruction.

## Without AVX2


The non-AVX2 implementation simply converts each value to its base32 character.
Each 5-bit value is isolated via bit-shifting and ANDing.

This is done 8 values at a time, similar to decoding, since 8 5-bit values are aligned
with the byte boundary at 40 bits.