sixbit/
lib.rs

1// -*- mode: rust; bidi-display-reordering: nil -*-
2
3//! # Sixbit - a crate for small packed strings.
4//!
5//! This crate provides densely-packed 8-bit, 16-bit, 32-bit, 64-bit, and
6//! 128-bit "small strings", in a variety of custom script-specific
7//! 6-bit-per-character encodings, as well as a special 15-bit-per-character
8//! encoding for the set of Chinese characters in the "Unified Repertoire and
9//! Ordering" (URO) block of Unicode.
10//!
11//! This sort of datatype is a low-level optimization to use in a system
12//! processing a lot of small strings (eg. single words): you can avoid
13//! allocating them on the heap, store them in string-structure headers, NaN
14//! boxes or other sorts of small-literal value types.
15//!
16//! Perhaps most enticingly: you can pack _more than one of them_ into a SIMD
17//! vector, and operate on (eg. search) multiple such strings at once, in
18//! parallel, as a single register. Vector registers just keep growing these
19//! days.
20//!
21//! Of course not every string is short enough to fit, and not every
22//! short-enough string has codepoints that fall inside one of the "code pages"
23//! that this crate provides. The encoding functions are therefore all partial.
24//! But they should handle a significant enough quantity of strings to make it
25//! worthwhile.
26//!
27//! ## Usage Summary
28//!
29//! Encoding is done via the `EncodeSixbit` trait attached to `Iterator<char>`,
30//! so you can just do: `let enc = "hello".chars().encode_sixbit::<u64>()`. If
31//! there is a failure (say, the string spans pages or doesn't fit) you'll get
32//! back an `Err(EncodeError)` with details, otherwise `Ok(n)` where `n` is a
33//! `u64`.
34//!
35//! Decoding is a `DecodeSixbitIter` iterator implementing `Iterator<char>`,
36//! attached to the various packed types, allowing you to write `let s:String =
37//! someu64.decode_sixbit().collect()`, or any other pattern that takes an
38//! `Iterator<char>`.
39//!
40//! In several cases you will need to normalize or decompose "standard" unicode
41//! text before pushing it through these interfaces. For example, the Hangul
42//! page only has compatibility jamo, so you have to decompose standard Korean
43//! text to that form before encoding. Similarly the Halfwidth Kana are unlikely
44//! to be the characters standard Japanese text arrives in, and Devanagari
45//! strings with nuktas will need to be decomposed before mapping. This crate
46//! does none of these tasks: it's a building block, not a complete solution.
47//!
48//! ## Code Pages
49//!
50//! Every packed string produced by this crate begins with a small tag
51//! indicating the "code page" of the rest of the string. A code page here is a
52//! set of 64 unicode character values that the 6-bit codes of the rest of the
53//! string are interpreted as (or, as a special case, the URO block). Strings
54//! that mix characters from multiple code pages are not supported. Again, think
55//! "single words".
56//!
57//! I have chosen the contents of the code pages to the best of my abilities in
58//! script knowledge, guesswork, consulting with friends and experts, scouring
59//! wikipedia and so forth, and subject to some constraints outlined below. I'm
60//! happy to take PRs to improve the contents of them to better capture "many
61//! words" in specific scripts; code page contents will therefore be slightly in
62//! flux until we get to a 1.0 release, so if by some bizarre chance you are
63//! reading this and choose to use the crate and are going to store values from
64//! this crate in stable storage, you should lock your client to a specific
65//! point-revision of the crate until 1.0.
66//!
67//! ### Constraints
68//!
69//! There is only enough room in the tag to reference a handful of code pages;
70//! not every script will make it, but luckily only a few scripts account for
71//! much of the text in the world.
72//!
73//! We want to avoid wasting bits, and the number of spare bits in a packed
74//! value of N bits (modulo 6) varies, depending on its size: 2 spare bits for
75//! 8, 32 and 128-bit values; 4 spare bits for 16 and 64-bit values.
76//!
77//! The tag for Chinese is allocated in all cases but there's only space for
78//! a nonzero sequence of the 15-bit codes in 32, 64 and 128-bit values, so
79//! only those widths use it; in 8 or 16-bit cases the tag is just reserved.
80//!
81//! We want to be able to sort these strings using machine-provided integer
82//! comparison, and have that order correspond to unicode code-point
83//! lexicographical string order (including "short strings sort before long").
84//! This dictates a fair amount about the tag values, code repertoires, and
85//! normal form of the encoded strings.
86//!
87//! ### Design
88//!
89//! Code pages are taken from (or in some cases, across) unicode blocks, and
90//! tags are ordered by (initial) unicode block. Codes within each code page are
91//! further ordered by unicode codepoint: each code page is essentially a
92//! "likely-useful subsequence" of the contents of 1 or more corresponding
93//! unicode block(s). This unfortunately means that digits are only available in
94//! strings using the latin page (which also has underscore -- there is one
95//! space extra and it's common in many identifier repertoires). I've tried to
96//! include some additional punctuation where it's available in blocks. Since
97//! mixing pages is also not possible, "supplementary" blocks have been mostly
98//! avoided.
99//!
100//! A script can only work if there's a "likely-useful subsequence" that fits
101//! inside 63 code points (unless it's the URO block). The zero code in each
102//! page is reserved as the string _terminator_ and padding value. Strings that
103//! encode shorter than their packed value container are left-shifted to always
104//! begin at the most-significant coding bit, and the trailing bits are all set
105//! to zero (this does not mean you can encode nulls -- the zeroes here mean
106//! "past end of string").
107//!
108//! The high order / 2-bit tags select among 4 "primary" (most-likely) scripts
109//! spread across the unicode range (in code block order). These are available
110//! in all packed values:
111//!
112//! | tag | page contents                              |
113//! |-----|--------------------------------------------|
114//! |  00 | Latin (with digits and underscore)         |
115//! |  01 | Arabic                                     |
116//! |  10 | Devanagari                                 |
117//! |  11 | Chinese                                    |
118//!
119//! When packing 64-bit and 16-bit values we get _4_ spare bits to use for a
120//! tag, not just 2. In these cases we therefore have 12 additional scripts
121//! available, which for simplicity sake casting up and down between value sizes
122//! we keep the high order bits the same and add 2 bits below, picking
123//! additional scripts _from the block ranges between_ those of the primary
124//! scripts (again, in unicode block order):
125//!
126//! | tag   | page contents                                 |
127//! |-------|-----------------------------------------------|
128//! | 00 00 | Latin (with digits and underscore)            |
129//! | 00 01 | Greek                                         |
130//! | 00 10 | Cyrillic                                      |
131//! | 00 11 | Hebrew                                        |
132//! |       |                                               |
133//! | 01 00 | Arabic                                        |
134//! | 01 01 | *reserved*                                    |
135//! | 01 10 | *reserved*                                    |
136//! | 01 11 | *reserved*                                    |
137//! |       |                                               |
138//! | 10 00 | Devanagari                                    |
139//! | 10 01 | *reserved*                                    |
140//! | 10 10 | *reserved*                                    |
141//! | 10 11 | Hangul Compatibility Jamo                     |
142//! |       |                                               |
143//! | 11 00 | Chinese                                       |
144//! | 11 01 | *reserved*                                    |
145//! | 11 10 | *reserved*                                    |
146//! | 11 11 | Halfwidth Kana                                |
147//!
148//! The *reserved* cases are where I either didn't know enough about the scripts
149//! available in that range of unicode, or ran out of good candidates, or both.
150//! I might assign them to something in the future, or "compact out" the gaps /
151//! reassign the 4-bit codes so their high bits are _not_ the same as the 2-bit
152//! cases, but I've already exceeded my realistic level of armchair-linguist
153//! knowledge and I figured simplifying design choices would be better than
154//! pretending I could do any better. Patches welcome!
155//!
156//! The overall assignment of bits is summarized as follows:
157//!
158//! | packed type | tag bits | coding bits | max 6-bit chars | max 15-bit chars |
159//! |-------------|----------|-------------|-----------------|------------------|
160//! | u128        | 2        | 126         | 21              | 8                |
161//! |  u64        | 4        |  60         | 10              | 4                |
162//! |  u32        | 2        |  30         |  5              | 2                |
163//! |  u16        | 4        |  12         |  2              | 0                |
164//! |   u8        | 2        |   6         |  1              | 0                |
165
166use std::mem::size_of;
167use std::ops::{BitOrAssign, ShlAssign};
168
169use arbitrary::Unstructured;
170
171#[rustfmt::skip]
172mod consts {
173
174// Page 00 00: U+0000, then U+0030-U+0039, U+0041-U+005A, U+005F, and U+0061-U+007A.
175// Enough to encode the common [a-zA-Z0-9_] character class used in many programming
176// language and data format identifier repertoires.
177pub(crate) const LATIN : [char; 64] = [
178    '\0', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E',
179    'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U',
180    'V', 'W', 'X', 'Y', 'Z', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
181    'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'
182];
183
184// Page 00 11: U+0000, then upper and lowercase characters in order from
185// U+0386-U+03CE, including stressed forms but omitting diaeresis forms.
186pub(crate) const GREEK : [char; 64] = [
187    '\0',
188    // 7 stressed uppercase characters
189    'Ά', 'Έ', 'Ή', 'Ί', 'Ό', 'Ύ', 'Ώ',
190    // 24 uppercase characters
191    'Α', 'Β', 'Γ', 'Δ', 'Ε', 'Ζ', 'Η', 'Θ', 'Ι', 'Κ', 'Λ', 'Μ', 'Ν', 'Ξ', 'Ο', 'Π',
192    'Ρ', 'Σ', 'Τ', 'Υ', 'Φ', 'Χ', 'Ψ', 'Ω',
193    // 4 stressed lowercase characters
194    'ά', 'έ', 'ή', 'ί',
195    // 25 lowercase characters (two variants of sigma)
196    'α', 'β', 'γ', 'δ', 'ε', 'ζ', 'η', 'θ', 'ι', 'κ', 'λ', 'μ', 'ν', 'ξ', 'ο', 'π',
197    'ρ', 'ς', 'σ', 'τ', 'υ', 'φ', 'χ', 'ψ', 'ω',
198    // 3 stressed lowercase characters
199    'ό', 'ύ', 'ώ'
200];
201
202// Page 01 00: U+0000, then U+0410-U+044F excepting the lowercase hard-sign U+044A
203pub(crate) const CYRILLIC : [char; 64] = [
204    '\0', 'А', 'Б', 'В', 'Г', 'Д', 'Е', 'Ж', 'З', 'И', 'Й', 'К', 'Л', 'М', 'Н', 'О',
205    'П', 'Р', 'С', 'Т', 'У', 'Ф', 'Х', 'Ц', 'Ч', 'Ш', 'Щ', 'Ъ', 'Ы', 'Ь', 'Э', 'Ю',
206    'Я', 'а', 'б', 'в', 'г', 'д', 'е', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о',
207    'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ы', 'ь', 'э', 'ю', 'я'
208];
209
210// Page 01 01: U+0000, then U+05B0-U+05F4
211pub(crate) const HEBREW : [char; 64] = [
212    '\0', 'ְ', 'ֱ', 'ֲ', 'ֳ', 'ִ', 'ֵ', 'ֶ', 'ַ', 'ָ', 'ֹ', 'ֺ', 'ֻ', 'ּ', 'ֽ', '־',
213    'ֿ', '׀', 'ׁ', 'ׂ', '׃', 'ׄ', 'ׅ', '׆', 'ׇ', 'א', 'ב', 'ג', 'ד', 'ה', 'ו', 'ז',
214    'ח', 'ט', 'י', 'ך', 'כ', 'ל', 'ם', 'מ', 'ן', 'נ', 'ס', 'ע', 'ף', 'פ', 'ץ', 'צ',
215    'ק', 'ר', 'ש', 'ת', 'װ', 'ױ', 'ײ', '׳', '״',
216    // Space for 7 more, not sure which to include: expert help wanted!
217    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
218    '\u{ffff}', '\u{ffff}', '\u{ffff}',
219];
220
221// Page 10 00: U+0000, then a selection (leaning Perso-Arabic) from the Arabic
222// block U+0600–U+06FF, detailed below. Characters selected on the advice of
223// @Manishearth who, unlike me, knows something about Arabic script.
224pub(crate) const ARABIC : [char; 64] = [
225    '\0',
226
227    // 3 punctuators
228    // U+060C comma
229    '،',
230    // U+061B semicolon
231    '؛',
232    // U+061F question mark
233    '؟',
234
235    // 1 hamza
236    'ء',
237
238    // 29 main characters in range U+0627-U+0649
239    'ا', 'ب', 'ة', 'ت', 'ث', 'ج', 'ح', 'خ',
240    'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض',
241    'ط', 'ظ', 'ع', 'غ',
242    // omit: U+063B-U+063F "early Persian and Azerbaijani"
243    // omit: U+0640 kashida
244    'ف', 'ق',
245    // omit: U+0643 isolated kaf
246    'ل', 'م', 'ن', 'ه', 'و', 'ى', 'ي',
247
248    // 3 short vowels and 1 shadda
249    // (sorry my editor balked at displaying some literals here)
250    // fatha     damma       kasra       shadda
251    '\u{064e}', '\u{064f}', '\u{0650}', '\u{0651}',
252    // 2 combining forms of maddah and hamza
253    // maddah    hamza
254    '\u{0653}', '\u{0654}',
255    // 2 vowels used only in Urdu
256    // subscript alef
257    '\u{0656}', 
258    // inverted damma / ulta pesh
259    '\u{0657}',
260
261    // 1 superscript alef
262    '\u{0670}',
263
264    // 11 extended characters for Persian or Urdu
265    // U+0679 tteh (Urdu)
266    'ٹ',
267    // U+067E peh (Persian, Urdu)
268    'پ',
269    // U+0686 tcheh (Persian, Urdu)
270    'چ',
271    // U+0688 ddal (Urdu)
272    'ڈ',
273    // U+0691 rreh (Urdu)
274    'ڑ',
275    // U+0698 jeh (Persian, Urdu)
276    'ژ',
277    // U+06A9 keheh / kaf (Persian, Urdu)
278    'ک',
279    // U+06AF gaf (Persian, Urdu)
280    'گ',
281    // U+06BA noon ghunna (Urdu)
282    'ں',
283    // U+06BE heh doachashmee (Urdu)
284    'ھ',
285    // U+06D2 yeh barree (Urdu)
286    'ے',
287
288    // U+06D4 full stop
289    '۔',
290
291    // Space for 9 more, not sure which to include; expert help wanted!
292    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
293    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
294    '\u{ffff}',
295];
296
297// Page 11 00: U+0000, then a selection detailed below from U+0901-U+0965;
298// characters selected on the advice of @Manishearth who, unlike me, knows
299// something about Devanagari script.
300pub(crate) const DEVANAGARI : [char; 64] = [
301    '\0',
302    // 3 diacritics U+901 candrabindu, U+902 anusvara, U+903 visarga
303    'ँ', 'ं', 'ः',
304    // omit: U+0904 short A (Awadhi)
305    // 11 standalone vowels (U+0905-U+0914)
306    'अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ',
307    // omit: U+090C vocalic L (Sanskrit)
308    // omit: U+090D candra E (transcription)
309    // omit: U+090E short E (Kashmiri, Bihari, transcription)
310    'ए', 'ऐ',
311    // omit: U+0911 candra o (transcription)
312    // omit: U+0912 short o (Kashmiri, Bihari)
313    'ओ', 'औ',
314    // 34 consonants (U+0915-U+0939)
315    'क', 'ख', 'ग', 'घ', 'ङ', 'च', 'छ', 'ज', 'झ', 'ञ', 'ट', 'ठ', 'ड', 'ढ', 'ण', 'त',
316    'थ', 'द', 'ध', 'न', 'प', 'फ', 'ब', 'भ', 'म', 'य', 'र', 'ल', 'ळ', 'व', 'श', 'ष',
317    'स','ह',
318    // omit: U+0934 letter LLLA (transcription)
319    // 1 diacritic nukta
320    '़',
321    // 10 combining vowels (U+093E-U+094C)
322    'ा', 'ि', 'ी', 'ु', 'ू', 'ृ',
323    // omit: U+0944 vocalic rr (Sanskrit)
324    // omit: U+0945 candra e (transcription)
325    // omit: U+0946 short e (Kashmiri, Bihari, transcription)
326    'े', 'ै',
327    // omit: U+0949 candra o (transcription)
328    // omit: U+094A short o (Kashmiri, Bihari, transcription)
329    'ो', 'ौ',
330    // omit: multiple diacritics for Kashmiri, Sanskrit, obsolete orthographies,
331    // as well as old vocalics and Hindi nukta consonants (users should decompose them)
332    // 1 diacritic virama
333    '्',
334    // 2 punctuators danda and double danda
335    '।', '॥',
336    // Space for 1 more, not sure which to include: expert help wanted!
337    '\u{ffff}',
338];
339
340// Page 11 10: U+0000, then U+3131-U+3163 (initial part of KS X 1001 - 0x24 / 0xA4)
341pub(crate) const HANGUL_COMPATIBILITY_JAMO : [char; 64] = [
342    '\0', 'ㄱ', 'ㄲ', 'ㄳ', 'ㄴ', 'ㄵ', 'ㄶ', 'ㄷ', 'ㄸ', 'ㄹ', 'ㄺ', 'ㄻ', 'ㄼ', 'ㄽ', 'ㄾ', 'ㄿ',
343    'ㅀ', 'ㅁ', 'ㅂ', 'ㅃ', 'ㅄ', 'ㅅ', 'ㅆ', 'ㅇ', 'ㅈ', 'ㅉ', 'ㅊ', 'ㅋ', 'ㅌ', 'ㅍ', 'ㅎ', 'ㅏ',
344    'ㅐ', 'ㅑ', 'ㅒ', 'ㅓ', 'ㅔ', 'ㅕ', 'ㅖ', 'ㅗ', 'ㅘ', 'ㅙ', 'ㅚ', 'ㅛ', 'ㅜ', 'ㅝ', 'ㅞ', 'ㅟ',
345    'ㅠ', 'ㅡ', 'ㅢ', 'ㅣ',
346    // Space for 12 more, not sure which to include: expert help wanted!
347    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
348    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
349    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
350];
351
352// Page 11 11: U+0000, then U+FF61-U+FF9F (latter part of JIS-X-0201)
353pub(crate) const HALFWIDTH_KANA : [char; 64] = [
354    '\0', '｡', '｢', '｣', '､', '･', 'ｦ', 'ｧ', 'ｨ', 'ｩ', 'ｪ', 'ｫ', 'ｬ', 'ｭ', 'ｮ', 'ｯ',
355    'ｰ', 'ｱ', 'ｲ', 'ｳ', 'ｴ', 'ｵ', 'ｶ', 'ｷ', 'ｸ', 'ｹ', 'ｺ', 'ｻ', 'ｼ', 'ｽ', 'ｾ', 'ｿ',
356    'ﾀ', 'ﾁ', 'ﾂ', 'ﾃ', 'ﾄ', 'ﾅ', 'ﾆ', 'ﾇ', 'ﾈ', 'ﾉ', 'ﾊ', 'ﾋ', 'ﾌ', 'ﾍ', 'ﾎ', 'ﾏ',
357    'ﾐ', 'ﾑ', 'ﾒ', 'ﾓ', 'ﾔ', 'ﾕ', 'ﾖ', 'ﾗ', 'ﾘ', 'ﾙ', 'ﾚ', 'ﾛ', 'ﾜ', 'ﾝ', 'ﾞ', 'ﾟ'
358];
359
360pub(crate) const RESERVED : [char; 64] = [
361    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
362    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
363    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
364    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
365    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
366    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
367    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
368    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
369    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
370    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
371    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
372    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
373    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
374    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
375    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
376    '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
377];
378
379pub(crate) const CHINESE: [char; 64] = RESERVED;
380pub(crate) const CHINESE_LO: char = '\u{4e00}';
381pub(crate) const CHINESE_HI: char = '\u{9fff}';
382pub(crate) const CHINESE_2BIT_TAG: usize = 0b11;
383pub(crate) const CHINESE_4BIT_TAG: usize = 0b1100;
384
385pub(crate) const PAGES: [[char; 64]; 16] = [
386    LATIN,
387    GREEK,
388    CYRILLIC,
389    HEBREW,
390
391    ARABIC,
392    RESERVED,
393    RESERVED,
394    RESERVED,
395
396    DEVANAGARI,
397    RESERVED,
398    RESERVED,
399    HANGUL_COMPATIBILITY_JAMO,
400
401    CHINESE,
402    RESERVED,
403    RESERVED,
404    HALFWIDTH_KANA,
405];
406}
407
408use consts::*;
409
410pub trait PackedValue
411where
412    Self: Copy,
413    Self: ShlAssign<usize>,
414    Self: BitOrAssign<Self>,
415    Self: ::std::cmp::PartialOrd,
416    Self: ::std::fmt::Debug,
417    Self: ::std::fmt::LowerHex,
418{
419    const NBITS: usize = size_of::<Self>() * 8;
420    const NCHARS: usize = Self::NBITS / 6;
421    const NTAGBITS: usize = Self::NBITS - (Self::NCHARS * 6);
422    const NCHARBITS: usize = Self::NBITS - Self::NTAGBITS;
423    const NWIDECHARS: usize = (Self::NBITS - Self::NTAGBITS) / 15;
424    // This is a bit ridiculous; I literally tried 4 different crates and every
425    // trait I could find in the stdlib and it seems like there is some sort of
426    // community-wide conspiracy to ensure the absence of generic truncating
427    // casts.
428    fn truncating_cast_from(i: usize) -> Self;
429
430    // This also seems somewhat contorted to express via existing traits.
431    fn most_significant_byte(self) -> u8;
432
433    // This is to help generate random data in tests or fuzzers.
434    fn arbitrary<'a>(u: &mut Unstructured<'a>) -> arbitrary::Result<Self> {
435        let page_num = if Self::NTAGBITS == 2 {
436            u.int_in_range::<usize>(0..=3)? * 4
437        } else {
438            *u.choose::<usize>(&[0, 1, 2, 3, 4, 8, 11, 12, 15])?
439        };
440        let mut chars: [char; 21] = ['\0'; 21];
441        if page_num == CHINESE_4BIT_TAG {
442            let len = u.int_in_range(0..=Self::NWIDECHARS)?;
443            for i in 0..len {
444                chars[i] = unsafe {
445                    char::from_u32_unchecked(
446                        u.int_in_range((CHINESE_LO as u32)..=(CHINESE_HI as u32))?,
447                    )
448                };
449            }
450        } else {
451            let len = u.int_in_range(0..=Self::NCHARS)?;
452            for i in 0..len {
453                let ch = PAGES[page_num][u.int_in_range(1..=63)?];
454                if ch == '\u{ffff}' {
455                    break;
456                }
457                chars[i] = ch;
458            }
459        }
460
461        // This should always succeed. There's a bug if not.
462        Ok(
463            encode::<Self, _>(chars.iter().cloned().take_while(|x| *x != '\0'))
464                .expect("sixbit::PackedValue::arbitrary"),
465        )
466    }
467}
468
469impl PackedValue for u8 {
470    fn truncating_cast_from(i: usize) -> u8 {
471        i as u8
472    }
473    fn most_significant_byte(self) -> u8 {
474        self
475    }
476}
477
478impl PackedValue for u16 {
479    fn truncating_cast_from(i: usize) -> u16 {
480        i as u16
481    }
482    fn most_significant_byte(self) -> u8 {
483        (self >> 8) as u8
484    }
485}
486
487impl PackedValue for u32 {
488    fn truncating_cast_from(i: usize) -> u32 {
489        i as u32
490    }
491    fn most_significant_byte(self) -> u8 {
492        (self >> 24) as u8
493    }
494}
495
496impl PackedValue for u64 {
497    fn truncating_cast_from(i: usize) -> u64 {
498        i as u64
499    }
500    fn most_significant_byte(self) -> u8 {
501        (self >> 56) as u8
502    }
503}
504
505impl PackedValue for u128 {
506    fn truncating_cast_from(i: usize) -> u128 {
507        i as u128
508    }
509    fn most_significant_byte(self) -> u8 {
510        (self >> 120) as u8
511    }
512}
513
514#[derive(PartialEq, Debug)]
515pub enum EncodeError {
516    TooLong,
517    NoCodePageFor(char),
518    PageUnavailable(usize),
519    MissingFromPage(char),
520}
521
522fn chinese_15bit_delta(c: char) -> Option<usize> {
523    if CHINESE_LO <= c && c <= CHINESE_HI {
524        Some((c as usize) - (CHINESE_LO as usize))
525    } else {
526        None
527    }
528}
529
530pub fn encode<N, IT>(i: IT) -> Result<N, EncodeError>
531where
532    N: PackedValue,
533    IT: Iterator<Item = char>,
534{
535    let mut pi = i.peekable();
536    let mut out: N = N::truncating_cast_from(0);
537    match pi.peek() {
538        // Zero-length strings map to page 0, code 0.
539        None => Ok(out),
540        Some(&init) => {
541            // First handle special case of Chinese characters, which are encoded as deltas.
542            if N::NCHARBITS > 0 && chinese_15bit_delta(init) != None {
543                let tag = if N::NTAGBITS == 2 {
544                    CHINESE_2BIT_TAG
545                } else {
546                    CHINESE_4BIT_TAG
547                };
548                out |= N::truncating_cast_from(tag);
549                let mut rembits: usize = N::NCHARBITS;
550                for c in pi {
551                    if rembits < 15 {
552                        return Err(EncodeError::TooLong);
553                    }
554                    match chinese_15bit_delta(c) {
555                        None => {
556                            return Err(EncodeError::MissingFromPage(c));
557                        }
558                        Some(delta) => {
559                            out <<= 15;
560                            // We encode delta+1 so that a delta of 0 is encoded as 1
561                            // and we can still use 0-bytes to delimit the string.
562                            out |= N::truncating_cast_from(delta + 1);
563                            rembits -= 15;
564                        }
565                    }
566                }
567                // Pad remainder.
568                out <<= rembits;
569                return Ok(out);
570            }
571
572            // Pick page: just try each one, there are only 16.
573            match PAGES.iter().position(|&p| p.binary_search(&init).is_ok()) {
574                // No page means this string won't work.
575                None => Err(EncodeError::NoCodePageFor(init)),
576                Some(p) => {
577                    let mut tag = p;
578                    let mut rem: usize = N::NCHARS;
579                    // Check and adjust tag by size.
580                    if N::NTAGBITS == 2 {
581                        // Tried a "secondary tag" when only
582                        // using 2 tag bits, sorry!
583                        if tag & 0b11 != 0 {
584                            return Err(EncodeError::PageUnavailable(tag));
585                        }
586                        tag >>= 2;
587                    }
588                    // Set tag.
589                    out |= N::truncating_cast_from(tag);
590                    // Encode chars.
591                    for c in pi {
592                        if rem == 0 {
593                            // String is too long.
594                            return Err(EncodeError::TooLong);
595                        }
596                        match PAGES[p].binary_search(&c) {
597                            // No code for c in page.
598                            Err(_) => return Err(EncodeError::MissingFromPage(c)),
599                            // Got a code, use it!
600                            Ok(i) => {
601                                out <<= 6;
602                                out |= N::truncating_cast_from(i);
603                                rem -= 1;
604                            }
605                        }
606                    }
607                    // Pad remainder.
608                    out <<= 6 * rem;
609                    Ok(out)
610                }
611            }
612        }
613    }
614}
615
616pub trait EncodeSixbit: Sized + Iterator<Item = char> {
617    fn encode_sixbit<N: PackedValue>(self) -> Result<N, EncodeError>;
618}
619
620impl<T> EncodeSixbit for T
621where
622    T: Sized,
623    T: Iterator<Item = char>,
624{
625    fn encode_sixbit<N: PackedValue>(self) -> Result<N, EncodeError> {
626        encode::<N, Self>(self)
627    }
628}
629
630pub struct DecodeSixbitIter<N: PackedValue> {
631    tag: usize,
632    tmp: N,
633}
634
635impl<N> Iterator for DecodeSixbitIter<N>
636where
637    N: PackedValue,
638{
639    type Item = char;
640    fn next(self: &mut Self) -> Option<char> {
641        if self.tag == CHINESE_4BIT_TAG {
642            let ch0 = self.tmp.most_significant_byte();
643            match ch0 {
644                0 => None,
645                i => {
646                    self.tmp <<= 8;
647                    let ch1 = self.tmp.most_significant_byte();
648                    self.tmp <<= 7;
649                    let delta = ((i as u32) << 7) | ((ch1 as u32) >> 1);
650                    char::from_u32((CHINESE_LO as u32) + delta - 1)
651                }
652            }
653        } else {
654            let mut ch = self.tmp.most_significant_byte();
655            ch >>= 2;
656            match ch {
657                0 => None,
658                i => {
659                    self.tmp <<= 6;
660                    Some(PAGES[self.tag][i as usize])
661                }
662            }
663        }
664    }
665}
666
667pub trait DecodeSixbit
668where
669    Self: PackedValue,
670{
671    fn decode_sixbit(self) -> DecodeSixbitIter<Self>;
672}
673
674impl<N> DecodeSixbit for N
675where
676    N: PackedValue,
677{
678    fn decode_sixbit(self) -> DecodeSixbitIter<Self> {
679        let mut tmp = self;
680        let mut tag = self.most_significant_byte();
681        tag >>= 8 - N::NTAGBITS;
682        if N::NTAGBITS == 2 {
683            tag <<= 2;
684        }
685        tmp <<= N::NTAGBITS;
686        DecodeSixbitIter {
687            tag: tag as usize,
688            tmp,
689        }
690    }
691}
692
693#[cfg(test)]
694mod tests {
695
696    use rand::RngCore;
697
698    use super::*;
699    #[test]
700    fn misc_invariants() {
701        // Check that pages are ordered by unicode ranges.
702        for pair in PAGES.windows(2) {
703            if pair[0][1] != '\u{ffff}' && pair[1][1] != '\u{ffff}' {
704                if pair[0][1] >= pair[1][1] {
705                    println!(
706                        "mis-ordered page pair: {:?} >= {:?}",
707                        pair[0][1], pair[1][1]
708                    );
709                }
710                assert!(pair[0][1] < pair[1][1]);
711            }
712        }
713        for p in PAGES.iter() {
714            // Check that every page has a zero code, or is invalid.
715            assert!(p[0] == '\0' || p[0] == '\u{ffff}');
716            // Check that every page is sorted.
717            for pair in p.windows(2) {
718                if pair[0] != '\0'
719                    && pair[1] != '\0'
720                    && pair[0] != '\u{ffff}'
721                    && pair[1] != '\u{ffff}'
722                {
723                    if pair[0] >= pair[1] {
724                        println!("mis-ordered char pair: {:?} >= {:?}", pair[0], pair[1]);
725                    }
726                    assert!(pair[0] < pair[1]);
727                }
728            }
729        }
730    }
731
732    fn round_trip<N: PackedValue>(s: &str) -> Result<N, EncodeError> {
733        match s.chars().encode_sixbit::<N>() {
734            Ok(enc) => {
735                let dec: String = enc.decode_sixbit().collect();
736                println!("roundtrip: {:?} => {:x} => {:?}", s, enc, dec);
737                assert!(dec == s);
738                Ok(enc)
739            }
740            err => err,
741        }
742    }
743
744    // For Latin we try a full-width, a not-full-width, and each of the
745    // error conditions.
746    #[test]
747    fn test_latin() {
748        // Full width.
749        assert!(round_trip::<u128>("PRINTER_is_on_FIRE").is_ok());
750        assert!(round_trip::<u64>("NO_CARRIER").is_ok());
751        assert!(round_trip::<u32>("_CAT_").is_ok());
752        assert!(round_trip::<u16>("OK").is_ok());
753        assert!(round_trip::<u8>("1").is_ok());
754
755        // Non-full-width.
756        assert!(round_trip::<u128>("Printer_Working").is_ok());
757        assert!(round_trip::<u64>("ATDT_123").is_ok());
758        assert!(round_trip::<u32>("Uwu").is_ok());
759        assert!(round_trip::<u16>("A").is_ok());
760        assert!(round_trip::<u8>("").is_ok());
761
762        // Error conditions: TooLong.
763        assert!(round_trip::<u128>("PRINTER_FULLY_OPERATIONAL") == Err(EncodeError::TooLong));
764        assert!(round_trip::<u64>("ATDT_123_4567") == Err(EncodeError::TooLong));
765        assert!(round_trip::<u32>("aaaaaaa") == Err(EncodeError::TooLong));
766        assert!(round_trip::<u16>("aba") == Err(EncodeError::TooLong));
767        assert!(round_trip::<u8>("OOH") == Err(EncodeError::TooLong));
768
769        // Error conditions: NoCodePageFor.
770        assert!(round_trip::<u128>("©2018") == Err(EncodeError::NoCodePageFor('©')));
771
772        // Error conditions: PageUnavailable.
773        assert!(round_trip::<u128>("ΨΩ") == Err(EncodeError::PageUnavailable(1)));
774
775        // Error conditions: MissingFromPage.
776        assert!(round_trip::<u64>("sh@rk") == Err(EncodeError::MissingFromPage('@')));
777    }
778
779    fn check_order<N: PackedValue>(a: &str, b: &str) {
780        assert!(a < b);
781        assert!(a.chars().encode_sixbit::<N>().unwrap() < b.chars().encode_sixbit::<N>().unwrap());
782    }
783
784    #[test]
785    fn sorting() {
786        // Check encoding order preservation within pages.
787        check_order::<u32>("", "AB");
788        check_order::<u64>("abcd", "abcde");
789        check_order::<u64>("abcde", "abcdf");
790        check_order::<u64>("α", "αβγ");
791        check_order::<u64>("αβ", "αβγ");
792        check_order::<u64>("αβγ", "αβδ");
793        check_order::<u64>("怎么", "怎么样");
794        check_order::<u64>("以前", "以后");
795        // Check encoding order preservation across pages.
796        check_order::<u64>("abc", "αβγ");
797        check_order::<u64>("αβγ", "абв");
798        check_order::<u64>("абв", "אבג");
799        check_order::<u64>("אבג", "ابة");
800        check_order::<u64>("ابة", "कखग");
801        check_order::<u64>("कखग", "ㄱㄲㄳ");
802        check_order::<u64>("ㄱㄲㄳ", "合伙人");
803        check_order::<u64>("合伙人", "ｦｧｨ");
804    }
805
806    // For non-Latin scripts we just check a word at each width
807    // to make sure they work.
808
809    #[test]
810    fn test_greek() {
811        // Non-primary tag: only available in u64 and u16 forms.
812        assert!(round_trip::<u64>("αλήθεια").is_ok());
813        assert!(round_trip::<u16>("γη").is_ok());
814    }
815
816    #[test]
817    fn test_cyrillic() {
818        //Non-primary tag: available only in u64 and u16 forms.
819        assert!(round_trip::<u64>("содержать").is_ok());
820        assert!(round_trip::<u16>("же").is_ok());
821    }
822
823    #[test]
824    fn test_hebrew() {
825        // Non-primary tag: only available in u64 and u16 forms.
826        assert!(round_trip::<u64>("לעשות").is_ok());
827        assert!(round_trip::<u16>("כל").is_ok());
828    }
829
830    #[test]
831    fn test_arabic() {
832        // Primary tag: available in all forms.
833        assert!(round_trip::<u128>("محافظت").is_ok());
834        assert!(round_trip::<u64>("العاصمة").is_ok());
835        assert!(round_trip::<u32>("البعض").is_ok());
836        assert!(round_trip::<u16>("از").is_ok());
837        assert!(round_trip::<u8>("و").is_ok());
838    }
839
840    #[test]
841    fn test_devanagari() {
842        // Primary tag: available in all forms.
843        assert!(round_trip::<u128>("किंकर्तव्यविमूढ़").is_ok());
844        assert!(round_trip::<u64>("आवश्यकता").is_ok());
845        assert!(round_trip::<u32>("सपना").is_ok());
846        assert!(round_trip::<u16>("पल").is_ok());
847        assert!(round_trip::<u8>("आ").is_ok());
848    }
849
850    #[test]
851    fn test_chinese() {
852        // Special-case 15-bit primary tag: only forms >=32 bits.
853        assert!(round_trip::<u128>("高速火车站").is_ok());
854        assert!(round_trip::<u64>("合伙人").is_ok());
855        assert!(round_trip::<u32>("同事").is_ok());
856    }
857
858    #[test]
859    fn test_compatibility_hangul_jamo() {
860        // Non-primary tag: only available in u64 and u16 forms.
861        assert!(round_trip::<u64>("ㅇㅜㅁㅈㅣㄱㅇㅣㅁ").is_ok());
862        assert!(round_trip::<u16>("ㅅㅜ").is_ok());
863    }
864
865    #[test]
866    fn test_halfwidth_kana() {
867        // Non-primary tag: only available in u64 and u16 forms.
868        assert!(round_trip::<u64>("ｲｸﾂｶﾉ").is_ok());
869        assert!(round_trip::<u16>("ﾔﾙ").is_ok());
870    }
871
872    #[test]
873    fn test_arbitrary() {
874        for _ in 0..64 {
875            let mut bytes = [0u8; 1024];
876            rand::thread_rng().fill_bytes(&mut bytes);
877            let mut u = arbitrary::Unstructured::new(&bytes);
878            let packed = <u128 as PackedValue>::arbitrary(&mut u).expect("arbitrary");
879            dbg!(packed);
880            let decoded: String = packed.decode_sixbit().collect();
881            dbg!(decoded);
882        }
883    }
884}
sixbit/lib.rs

sixbit/
lib.rs