sixbit/lib.rs
1// -*- mode: rust; bidi-display-reordering: nil -*-
2
3//! # Sixbit - a crate for small packed strings.
4//!
5//! This crate provides densely-packed 8-bit, 16-bit, 32-bit, 64-bit, and
6//! 128-bit "small strings", in a variety of custom script-specific
7//! 6-bit-per-character encodings, as well as a special 15-bit-per-character
8//! encoding for the set of Chinese characters in the "Unified Repertoire and
9//! Ordering" (URO) block of Unicode.
10//!
11//! This sort of datatype is a low-level optimization to use in a system
12//! processing a lot of small strings (eg. single words): you can avoid
13//! allocating them on the heap, store them in string-structure headers, NaN
14//! boxes or other sorts of small-literal value types.
15//!
16//! Perhaps most enticingly: you can pack _more than one of them_ into a SIMD
17//! vector, and operate on (eg. search) multiple such strings at once, in
18//! parallel, as a single register. Vector registers just keep growing these
19//! days.
20//!
21//! Of course not every string is short enough to fit, and not every
22//! short-enough string has codepoints that fall inside one of the "code pages"
23//! that this crate provides. The encoding functions are therefore all partial.
24//! But they should handle a significant enough quantity of strings to make it
25//! worthwhile.
26//!
27//! ## Usage Summary
28//!
29//! Encoding is done via the `EncodeSixbit` trait attached to `Iterator<char>`,
30//! so you can just do: `let enc = "hello".chars().encode_sixbit::<u64>()`. If
31//! there is a failure (say, the string spans pages or doesn't fit) you'll get
32//! back an `Err(EncodeError)` with details, otherwise `Ok(n)` where `n` is a
33//! `u64`.
34//!
35//! Decoding is a `DecodeSixbitIter` iterator implementing `Iterator<char>`,
36//! attached to the various packed types, allowing you to write `let s:String =
37//! someu64.decode_sixbit().collect()`, or any other pattern that takes an
38//! `Iterator<char>`.
39//!
40//! In several cases you will need to normalize or decompose "standard" unicode
41//! text before pushing it through these interfaces. For example, the Hangul
42//! page only has compatibility jamo, so you have to decompose standard Korean
43//! text to that form before encoding. Similarly the Halfwidth Kana are unlikely
44//! to be the characters standard Japanese text arrives in, and Devanagari
45//! strings with nuktas will need to be decomposed before mapping. This crate
46//! does none of these tasks: it's a building block, not a complete solution.
47//!
48//! ## Code Pages
49//!
50//! Every packed string produced by this crate begins with a small tag
51//! indicating the "code page" of the rest of the string. A code page here is a
52//! set of 64 unicode character values that the 6-bit codes of the rest of the
53//! string are interpreted as (or, as a special case, the URO block). Strings
54//! that mix characters from multiple code pages are not supported. Again, think
55//! "single words".
56//!
57//! I have chosen the contents of the code pages to the best of my abilities in
58//! script knowledge, guesswork, consulting with friends and experts, scouring
59//! wikipedia and so forth, and subject to some constraints outlined below. I'm
60//! happy to take PRs to improve the contents of them to better capture "many
61//! words" in specific scripts; code page contents will therefore be slightly in
62//! flux until we get to a 1.0 release, so if by some bizarre chance you are
63//! reading this and choose to use the crate and are going to store values from
64//! this crate in stable storage, you should lock your client to a specific
65//! point-revision of the crate until 1.0.
66//!
67//! ### Constraints
68//!
69//! There is only enough room in the tag to reference a handful of code pages;
70//! not every script will make it, but luckily only a few scripts account for
71//! much of the text in the world.
72//!
73//! We want to avoid wasting bits, and the number of spare bits in a packed
74//! value of N bits (modulo 6) varies, depending on its size: 2 spare bits for
75//! 8, 32 and 128-bit values; 4 spare bits for 16 and 64-bit values.
76//!
77//! The tag for Chinese is allocated in all cases but there's only space for
78//! a nonzero sequence of the 15-bit codes in 32, 64 and 128-bit values, so
79//! only those widths use it; in 8 or 16-bit cases the tag is just reserved.
80//!
81//! We want to be able to sort these strings using machine-provided integer
82//! comparison, and have that order correspond to unicode code-point
83//! lexicographical string order (including "short strings sort before long").
84//! This dictates a fair amount about the tag values, code repertoires, and
85//! normal form of the encoded strings.
86//!
87//! ### Design
88//!
89//! Code pages are taken from (or in some cases, across) unicode blocks, and
90//! tags are ordered by (initial) unicode block. Codes within each code page are
91//! further ordered by unicode codepoint: each code page is essentially a
92//! "likely-useful subsequence" of the contents of 1 or more corresponding
93//! unicode block(s). This unfortunately means that digits are only available in
94//! strings using the latin page (which also has underscore -- there is one
95//! space extra and it's common in many identifier repertoires). I've tried to
96//! include some additional punctuation where it's available in blocks. Since
97//! mixing pages is also not possible, "supplementary" blocks have been mostly
98//! avoided.
99//!
100//! A script can only work if there's a "likely-useful subsequence" that fits
101//! inside 63 code points (unless it's the URO block). The zero code in each
102//! page is reserved as the string _terminator_ and padding value. Strings that
103//! encode shorter than their packed value container are left-shifted to always
104//! begin at the most-significant coding bit, and the trailing bits are all set
105//! to zero (this does not mean you can encode nulls -- the zeroes here mean
106//! "past end of string").
107//!
108//! The high order / 2-bit tags select among 4 "primary" (most-likely) scripts
109//! spread across the unicode range (in code block order). These are available
110//! in all packed values:
111//!
112//! | tag | page contents |
113//! |-----|--------------------------------------------|
114//! | 00 | Latin (with digits and underscore) |
115//! | 01 | Arabic |
116//! | 10 | Devanagari |
117//! | 11 | Chinese |
118//!
119//! When packing 64-bit and 16-bit values we get _4_ spare bits to use for a
120//! tag, not just 2. In these cases we therefore have 12 additional scripts
121//! available, which for simplicity sake casting up and down between value sizes
122//! we keep the high order bits the same and add 2 bits below, picking
123//! additional scripts _from the block ranges between_ those of the primary
124//! scripts (again, in unicode block order):
125//!
126//! | tag | page contents |
127//! |-------|-----------------------------------------------|
128//! | 00 00 | Latin (with digits and underscore) |
129//! | 00 01 | Greek |
130//! | 00 10 | Cyrillic |
131//! | 00 11 | Hebrew |
132//! | | |
133//! | 01 00 | Arabic |
134//! | 01 01 | *reserved* |
135//! | 01 10 | *reserved* |
136//! | 01 11 | *reserved* |
137//! | | |
138//! | 10 00 | Devanagari |
139//! | 10 01 | *reserved* |
140//! | 10 10 | *reserved* |
141//! | 10 11 | Hangul Compatibility Jamo |
142//! | | |
143//! | 11 00 | Chinese |
144//! | 11 01 | *reserved* |
145//! | 11 10 | *reserved* |
146//! | 11 11 | Halfwidth Kana |
147//!
148//! The *reserved* cases are where I either didn't know enough about the scripts
149//! available in that range of unicode, or ran out of good candidates, or both.
150//! I might assign them to something in the future, or "compact out" the gaps /
151//! reassign the 4-bit codes so their high bits are _not_ the same as the 2-bit
152//! cases, but I've already exceeded my realistic level of armchair-linguist
153//! knowledge and I figured simplifying design choices would be better than
154//! pretending I could do any better. Patches welcome!
155//!
156//! The overall assignment of bits is summarized as follows:
157//!
158//! | packed type | tag bits | coding bits | max 6-bit chars | max 15-bit chars |
159//! |-------------|----------|-------------|-----------------|------------------|
160//! | u128 | 2 | 126 | 21 | 8 |
161//! | u64 | 4 | 60 | 10 | 4 |
162//! | u32 | 2 | 30 | 5 | 2 |
163//! | u16 | 4 | 12 | 2 | 0 |
164//! | u8 | 2 | 6 | 1 | 0 |
165
166use std::mem::size_of;
167use std::ops::{BitOrAssign, ShlAssign};
168
169use arbitrary::Unstructured;
170
171#[rustfmt::skip]
172mod consts {
173
174// Page 00 00: U+0000, then U+0030-U+0039, U+0041-U+005A, U+005F, and U+0061-U+007A.
175// Enough to encode the common [a-zA-Z0-9_] character class used in many programming
176// language and data format identifier repertoires.
177pub(crate) const LATIN : [char; 64] = [
178 '\0', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E',
179 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U',
180 'V', 'W', 'X', 'Y', 'Z', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
181 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'
182];
183
184// Page 00 11: U+0000, then upper and lowercase characters in order from
185// U+0386-U+03CE, including stressed forms but omitting diaeresis forms.
186pub(crate) const GREEK : [char; 64] = [
187 '\0',
188 // 7 stressed uppercase characters
189 'Ά', 'Έ', 'Ή', 'Ί', 'Ό', 'Ύ', 'Ώ',
190 // 24 uppercase characters
191 'Α', 'Β', 'Γ', 'Δ', 'Ε', 'Ζ', 'Η', 'Θ', 'Ι', 'Κ', 'Λ', 'Μ', 'Ν', 'Ξ', 'Ο', 'Π',
192 'Ρ', 'Σ', 'Τ', 'Υ', 'Φ', 'Χ', 'Ψ', 'Ω',
193 // 4 stressed lowercase characters
194 'ά', 'έ', 'ή', 'ί',
195 // 25 lowercase characters (two variants of sigma)
196 'α', 'β', 'γ', 'δ', 'ε', 'ζ', 'η', 'θ', 'ι', 'κ', 'λ', 'μ', 'ν', 'ξ', 'ο', 'π',
197 'ρ', 'ς', 'σ', 'τ', 'υ', 'φ', 'χ', 'ψ', 'ω',
198 // 3 stressed lowercase characters
199 'ό', 'ύ', 'ώ'
200];
201
202// Page 01 00: U+0000, then U+0410-U+044F excepting the lowercase hard-sign U+044A
203pub(crate) const CYRILLIC : [char; 64] = [
204 '\0', 'А', 'Б', 'В', 'Г', 'Д', 'Е', 'Ж', 'З', 'И', 'Й', 'К', 'Л', 'М', 'Н', 'О',
205 'П', 'Р', 'С', 'Т', 'У', 'Ф', 'Х', 'Ц', 'Ч', 'Ш', 'Щ', 'Ъ', 'Ы', 'Ь', 'Э', 'Ю',
206 'Я', 'а', 'б', 'в', 'г', 'д', 'е', 'ж', 'з', 'и', 'й', 'к', 'л', 'м', 'н', 'о',
207 'п', 'р', 'с', 'т', 'у', 'ф', 'х', 'ц', 'ч', 'ш', 'щ', 'ы', 'ь', 'э', 'ю', 'я'
208];
209
210// Page 01 01: U+0000, then U+05B0-U+05F4
211pub(crate) const HEBREW : [char; 64] = [
212 '\0', 'ְ', 'ֱ', 'ֲ', 'ֳ', 'ִ', 'ֵ', 'ֶ', 'ַ', 'ָ', 'ֹ', 'ֺ', 'ֻ', 'ּ', 'ֽ', '־',
213 'ֿ', '׀', 'ׁ', 'ׂ', '׃', 'ׄ', 'ׅ', '׆', 'ׇ', 'א', 'ב', 'ג', 'ד', 'ה', 'ו', 'ז',
214 'ח', 'ט', 'י', 'ך', 'כ', 'ל', 'ם', 'מ', 'ן', 'נ', 'ס', 'ע', 'ף', 'פ', 'ץ', 'צ',
215 'ק', 'ר', 'ש', 'ת', 'װ', 'ױ', 'ײ', '׳', '״',
216 // Space for 7 more, not sure which to include: expert help wanted!
217 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
218 '\u{ffff}', '\u{ffff}', '\u{ffff}',
219];
220
221// Page 10 00: U+0000, then a selection (leaning Perso-Arabic) from the Arabic
222// block U+0600–U+06FF, detailed below. Characters selected on the advice of
223// @Manishearth who, unlike me, knows something about Arabic script.
224pub(crate) const ARABIC : [char; 64] = [
225 '\0',
226
227 // 3 punctuators
228 // U+060C comma
229 '،',
230 // U+061B semicolon
231 '؛',
232 // U+061F question mark
233 '؟',
234
235 // 1 hamza
236 'ء',
237
238 // 29 main characters in range U+0627-U+0649
239 'ا', 'ب', 'ة', 'ت', 'ث', 'ج', 'ح', 'خ',
240 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض',
241 'ط', 'ظ', 'ع', 'غ',
242 // omit: U+063B-U+063F "early Persian and Azerbaijani"
243 // omit: U+0640 kashida
244 'ف', 'ق',
245 // omit: U+0643 isolated kaf
246 'ل', 'م', 'ن', 'ه', 'و', 'ى', 'ي',
247
248 // 3 short vowels and 1 shadda
249 // (sorry my editor balked at displaying some literals here)
250 // fatha damma kasra shadda
251 '\u{064e}', '\u{064f}', '\u{0650}', '\u{0651}',
252 // 2 combining forms of maddah and hamza
253 // maddah hamza
254 '\u{0653}', '\u{0654}',
255 // 2 vowels used only in Urdu
256 // subscript alef
257 '\u{0656}',
258 // inverted damma / ulta pesh
259 '\u{0657}',
260
261 // 1 superscript alef
262 '\u{0670}',
263
264 // 11 extended characters for Persian or Urdu
265 // U+0679 tteh (Urdu)
266 'ٹ',
267 // U+067E peh (Persian, Urdu)
268 'پ',
269 // U+0686 tcheh (Persian, Urdu)
270 'چ',
271 // U+0688 ddal (Urdu)
272 'ڈ',
273 // U+0691 rreh (Urdu)
274 'ڑ',
275 // U+0698 jeh (Persian, Urdu)
276 'ژ',
277 // U+06A9 keheh / kaf (Persian, Urdu)
278 'ک',
279 // U+06AF gaf (Persian, Urdu)
280 'گ',
281 // U+06BA noon ghunna (Urdu)
282 'ں',
283 // U+06BE heh doachashmee (Urdu)
284 'ھ',
285 // U+06D2 yeh barree (Urdu)
286 'ے',
287
288 // U+06D4 full stop
289 '۔',
290
291 // Space for 9 more, not sure which to include; expert help wanted!
292 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
293 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
294 '\u{ffff}',
295];
296
297// Page 11 00: U+0000, then a selection detailed below from U+0901-U+0965;
298// characters selected on the advice of @Manishearth who, unlike me, knows
299// something about Devanagari script.
300pub(crate) const DEVANAGARI : [char; 64] = [
301 '\0',
302 // 3 diacritics U+901 candrabindu, U+902 anusvara, U+903 visarga
303 'ँ', 'ं', 'ः',
304 // omit: U+0904 short A (Awadhi)
305 // 11 standalone vowels (U+0905-U+0914)
306 'अ', 'आ', 'इ', 'ई', 'उ', 'ऊ', 'ऋ',
307 // omit: U+090C vocalic L (Sanskrit)
308 // omit: U+090D candra E (transcription)
309 // omit: U+090E short E (Kashmiri, Bihari, transcription)
310 'ए', 'ऐ',
311 // omit: U+0911 candra o (transcription)
312 // omit: U+0912 short o (Kashmiri, Bihari)
313 'ओ', 'औ',
314 // 34 consonants (U+0915-U+0939)
315 'क', 'ख', 'ग', 'घ', 'ङ', 'च', 'छ', 'ज', 'झ', 'ञ', 'ट', 'ठ', 'ड', 'ढ', 'ण', 'त',
316 'थ', 'द', 'ध', 'न', 'प', 'फ', 'ब', 'भ', 'म', 'य', 'र', 'ल', 'ळ', 'व', 'श', 'ष',
317 'स','ह',
318 // omit: U+0934 letter LLLA (transcription)
319 // 1 diacritic nukta
320 '़',
321 // 10 combining vowels (U+093E-U+094C)
322 'ा', 'ि', 'ी', 'ु', 'ू', 'ृ',
323 // omit: U+0944 vocalic rr (Sanskrit)
324 // omit: U+0945 candra e (transcription)
325 // omit: U+0946 short e (Kashmiri, Bihari, transcription)
326 'े', 'ै',
327 // omit: U+0949 candra o (transcription)
328 // omit: U+094A short o (Kashmiri, Bihari, transcription)
329 'ो', 'ौ',
330 // omit: multiple diacritics for Kashmiri, Sanskrit, obsolete orthographies,
331 // as well as old vocalics and Hindi nukta consonants (users should decompose them)
332 // 1 diacritic virama
333 '्',
334 // 2 punctuators danda and double danda
335 '।', '॥',
336 // Space for 1 more, not sure which to include: expert help wanted!
337 '\u{ffff}',
338];
339
340// Page 11 10: U+0000, then U+3131-U+3163 (initial part of KS X 1001 - 0x24 / 0xA4)
341pub(crate) const HANGUL_COMPATIBILITY_JAMO : [char; 64] = [
342 '\0', 'ㄱ', 'ㄲ', 'ㄳ', 'ㄴ', 'ㄵ', 'ㄶ', 'ㄷ', 'ㄸ', 'ㄹ', 'ㄺ', 'ㄻ', 'ㄼ', 'ㄽ', 'ㄾ', 'ㄿ',
343 'ㅀ', 'ㅁ', 'ㅂ', 'ㅃ', 'ㅄ', 'ㅅ', 'ㅆ', 'ㅇ', 'ㅈ', 'ㅉ', 'ㅊ', 'ㅋ', 'ㅌ', 'ㅍ', 'ㅎ', 'ㅏ',
344 'ㅐ', 'ㅑ', 'ㅒ', 'ㅓ', 'ㅔ', 'ㅕ', 'ㅖ', 'ㅗ', 'ㅘ', 'ㅙ', 'ㅚ', 'ㅛ', 'ㅜ', 'ㅝ', 'ㅞ', 'ㅟ',
345 'ㅠ', 'ㅡ', 'ㅢ', 'ㅣ',
346 // Space for 12 more, not sure which to include: expert help wanted!
347 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
348 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
349 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
350];
351
352// Page 11 11: U+0000, then U+FF61-U+FF9F (latter part of JIS-X-0201)
353pub(crate) const HALFWIDTH_KANA : [char; 64] = [
354 '\0', '。', '「', '」', '、', '・', 'ヲ', 'ァ', 'ィ', 'ゥ', 'ェ', 'ォ', 'ャ', 'ュ', 'ョ', 'ッ',
355 'ー', 'ア', 'イ', 'ウ', 'エ', 'オ', 'カ', 'キ', 'ク', 'ケ', 'コ', 'サ', 'シ', 'ス', 'セ', 'ソ',
356 'タ', 'チ', 'ツ', 'テ', 'ト', 'ナ', 'ニ', 'ヌ', 'ネ', 'ノ', 'ハ', 'ヒ', 'フ', 'ヘ', 'ホ', 'マ',
357 'ミ', 'ム', 'メ', 'モ', 'ヤ', 'ユ', 'ヨ', 'ラ', 'リ', 'ル', 'レ', 'ロ', 'ワ', 'ン', '゙', '゚'
358];
359
360pub(crate) const RESERVED : [char; 64] = [
361 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
362 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
363 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
364 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
365 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
366 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
367 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
368 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
369 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
370 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
371 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
372 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
373 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
374 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
375 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
376 '\u{ffff}', '\u{ffff}', '\u{ffff}', '\u{ffff}',
377];
378
379pub(crate) const CHINESE: [char; 64] = RESERVED;
380pub(crate) const CHINESE_LO: char = '\u{4e00}';
381pub(crate) const CHINESE_HI: char = '\u{9fff}';
382pub(crate) const CHINESE_2BIT_TAG: usize = 0b11;
383pub(crate) const CHINESE_4BIT_TAG: usize = 0b1100;
384
385pub(crate) const PAGES: [[char; 64]; 16] = [
386 LATIN,
387 GREEK,
388 CYRILLIC,
389 HEBREW,
390
391 ARABIC,
392 RESERVED,
393 RESERVED,
394 RESERVED,
395
396 DEVANAGARI,
397 RESERVED,
398 RESERVED,
399 HANGUL_COMPATIBILITY_JAMO,
400
401 CHINESE,
402 RESERVED,
403 RESERVED,
404 HALFWIDTH_KANA,
405];
406}
407
408use consts::*;
409
410pub trait PackedValue
411where
412 Self: Copy,
413 Self: ShlAssign<usize>,
414 Self: BitOrAssign<Self>,
415 Self: ::std::cmp::PartialOrd,
416 Self: ::std::fmt::Debug,
417 Self: ::std::fmt::LowerHex,
418{
419 const NBITS: usize = size_of::<Self>() * 8;
420 const NCHARS: usize = Self::NBITS / 6;
421 const NTAGBITS: usize = Self::NBITS - (Self::NCHARS * 6);
422 const NCHARBITS: usize = Self::NBITS - Self::NTAGBITS;
423 const NWIDECHARS: usize = (Self::NBITS - Self::NTAGBITS) / 15;
424 // This is a bit ridiculous; I literally tried 4 different crates and every
425 // trait I could find in the stdlib and it seems like there is some sort of
426 // community-wide conspiracy to ensure the absence of generic truncating
427 // casts.
428 fn truncating_cast_from(i: usize) -> Self;
429
430 // This also seems somewhat contorted to express via existing traits.
431 fn most_significant_byte(self) -> u8;
432
433 // This is to help generate random data in tests or fuzzers.
434 fn arbitrary<'a>(u: &mut Unstructured<'a>) -> arbitrary::Result<Self> {
435 let page_num = if Self::NTAGBITS == 2 {
436 u.int_in_range::<usize>(0..=3)? * 4
437 } else {
438 *u.choose::<usize>(&[0, 1, 2, 3, 4, 8, 11, 12, 15])?
439 };
440 let mut chars: [char; 21] = ['\0'; 21];
441 if page_num == CHINESE_4BIT_TAG {
442 let len = u.int_in_range(0..=Self::NWIDECHARS)?;
443 for i in 0..len {
444 chars[i] = unsafe {
445 char::from_u32_unchecked(
446 u.int_in_range((CHINESE_LO as u32)..=(CHINESE_HI as u32))?,
447 )
448 };
449 }
450 } else {
451 let len = u.int_in_range(0..=Self::NCHARS)?;
452 for i in 0..len {
453 let ch = PAGES[page_num][u.int_in_range(1..=63)?];
454 if ch == '\u{ffff}' {
455 break;
456 }
457 chars[i] = ch;
458 }
459 }
460
461 // This should always succeed. There's a bug if not.
462 Ok(
463 encode::<Self, _>(chars.iter().cloned().take_while(|x| *x != '\0'))
464 .expect("sixbit::PackedValue::arbitrary"),
465 )
466 }
467}
468
469impl PackedValue for u8 {
470 fn truncating_cast_from(i: usize) -> u8 {
471 i as u8
472 }
473 fn most_significant_byte(self) -> u8 {
474 self
475 }
476}
477
478impl PackedValue for u16 {
479 fn truncating_cast_from(i: usize) -> u16 {
480 i as u16
481 }
482 fn most_significant_byte(self) -> u8 {
483 (self >> 8) as u8
484 }
485}
486
487impl PackedValue for u32 {
488 fn truncating_cast_from(i: usize) -> u32 {
489 i as u32
490 }
491 fn most_significant_byte(self) -> u8 {
492 (self >> 24) as u8
493 }
494}
495
496impl PackedValue for u64 {
497 fn truncating_cast_from(i: usize) -> u64 {
498 i as u64
499 }
500 fn most_significant_byte(self) -> u8 {
501 (self >> 56) as u8
502 }
503}
504
505impl PackedValue for u128 {
506 fn truncating_cast_from(i: usize) -> u128 {
507 i as u128
508 }
509 fn most_significant_byte(self) -> u8 {
510 (self >> 120) as u8
511 }
512}
513
514#[derive(PartialEq, Debug)]
515pub enum EncodeError {
516 TooLong,
517 NoCodePageFor(char),
518 PageUnavailable(usize),
519 MissingFromPage(char),
520}
521
522fn chinese_15bit_delta(c: char) -> Option<usize> {
523 if CHINESE_LO <= c && c <= CHINESE_HI {
524 Some((c as usize) - (CHINESE_LO as usize))
525 } else {
526 None
527 }
528}
529
530pub fn encode<N, IT>(i: IT) -> Result<N, EncodeError>
531where
532 N: PackedValue,
533 IT: Iterator<Item = char>,
534{
535 let mut pi = i.peekable();
536 let mut out: N = N::truncating_cast_from(0);
537 match pi.peek() {
538 // Zero-length strings map to page 0, code 0.
539 None => Ok(out),
540 Some(&init) => {
541 // First handle special case of Chinese characters, which are encoded as deltas.
542 if N::NCHARBITS > 0 && chinese_15bit_delta(init) != None {
543 let tag = if N::NTAGBITS == 2 {
544 CHINESE_2BIT_TAG
545 } else {
546 CHINESE_4BIT_TAG
547 };
548 out |= N::truncating_cast_from(tag);
549 let mut rembits: usize = N::NCHARBITS;
550 for c in pi {
551 if rembits < 15 {
552 return Err(EncodeError::TooLong);
553 }
554 match chinese_15bit_delta(c) {
555 None => {
556 return Err(EncodeError::MissingFromPage(c));
557 }
558 Some(delta) => {
559 out <<= 15;
560 // We encode delta+1 so that a delta of 0 is encoded as 1
561 // and we can still use 0-bytes to delimit the string.
562 out |= N::truncating_cast_from(delta + 1);
563 rembits -= 15;
564 }
565 }
566 }
567 // Pad remainder.
568 out <<= rembits;
569 return Ok(out);
570 }
571
572 // Pick page: just try each one, there are only 16.
573 match PAGES.iter().position(|&p| p.binary_search(&init).is_ok()) {
574 // No page means this string won't work.
575 None => Err(EncodeError::NoCodePageFor(init)),
576 Some(p) => {
577 let mut tag = p;
578 let mut rem: usize = N::NCHARS;
579 // Check and adjust tag by size.
580 if N::NTAGBITS == 2 {
581 // Tried a "secondary tag" when only
582 // using 2 tag bits, sorry!
583 if tag & 0b11 != 0 {
584 return Err(EncodeError::PageUnavailable(tag));
585 }
586 tag >>= 2;
587 }
588 // Set tag.
589 out |= N::truncating_cast_from(tag);
590 // Encode chars.
591 for c in pi {
592 if rem == 0 {
593 // String is too long.
594 return Err(EncodeError::TooLong);
595 }
596 match PAGES[p].binary_search(&c) {
597 // No code for c in page.
598 Err(_) => return Err(EncodeError::MissingFromPage(c)),
599 // Got a code, use it!
600 Ok(i) => {
601 out <<= 6;
602 out |= N::truncating_cast_from(i);
603 rem -= 1;
604 }
605 }
606 }
607 // Pad remainder.
608 out <<= 6 * rem;
609 Ok(out)
610 }
611 }
612 }
613 }
614}
615
616pub trait EncodeSixbit: Sized + Iterator<Item = char> {
617 fn encode_sixbit<N: PackedValue>(self) -> Result<N, EncodeError>;
618}
619
620impl<T> EncodeSixbit for T
621where
622 T: Sized,
623 T: Iterator<Item = char>,
624{
625 fn encode_sixbit<N: PackedValue>(self) -> Result<N, EncodeError> {
626 encode::<N, Self>(self)
627 }
628}
629
630pub struct DecodeSixbitIter<N: PackedValue> {
631 tag: usize,
632 tmp: N,
633}
634
635impl<N> Iterator for DecodeSixbitIter<N>
636where
637 N: PackedValue,
638{
639 type Item = char;
640 fn next(self: &mut Self) -> Option<char> {
641 if self.tag == CHINESE_4BIT_TAG {
642 let ch0 = self.tmp.most_significant_byte();
643 match ch0 {
644 0 => None,
645 i => {
646 self.tmp <<= 8;
647 let ch1 = self.tmp.most_significant_byte();
648 self.tmp <<= 7;
649 let delta = ((i as u32) << 7) | ((ch1 as u32) >> 1);
650 char::from_u32((CHINESE_LO as u32) + delta - 1)
651 }
652 }
653 } else {
654 let mut ch = self.tmp.most_significant_byte();
655 ch >>= 2;
656 match ch {
657 0 => None,
658 i => {
659 self.tmp <<= 6;
660 Some(PAGES[self.tag][i as usize])
661 }
662 }
663 }
664 }
665}
666
667pub trait DecodeSixbit
668where
669 Self: PackedValue,
670{
671 fn decode_sixbit(self) -> DecodeSixbitIter<Self>;
672}
673
674impl<N> DecodeSixbit for N
675where
676 N: PackedValue,
677{
678 fn decode_sixbit(self) -> DecodeSixbitIter<Self> {
679 let mut tmp = self;
680 let mut tag = self.most_significant_byte();
681 tag >>= 8 - N::NTAGBITS;
682 if N::NTAGBITS == 2 {
683 tag <<= 2;
684 }
685 tmp <<= N::NTAGBITS;
686 DecodeSixbitIter {
687 tag: tag as usize,
688 tmp,
689 }
690 }
691}
692
693#[cfg(test)]
694mod tests {
695
696 use rand::RngCore;
697
698 use super::*;
699 #[test]
700 fn misc_invariants() {
701 // Check that pages are ordered by unicode ranges.
702 for pair in PAGES.windows(2) {
703 if pair[0][1] != '\u{ffff}' && pair[1][1] != '\u{ffff}' {
704 if pair[0][1] >= pair[1][1] {
705 println!(
706 "mis-ordered page pair: {:?} >= {:?}",
707 pair[0][1], pair[1][1]
708 );
709 }
710 assert!(pair[0][1] < pair[1][1]);
711 }
712 }
713 for p in PAGES.iter() {
714 // Check that every page has a zero code, or is invalid.
715 assert!(p[0] == '\0' || p[0] == '\u{ffff}');
716 // Check that every page is sorted.
717 for pair in p.windows(2) {
718 if pair[0] != '\0'
719 && pair[1] != '\0'
720 && pair[0] != '\u{ffff}'
721 && pair[1] != '\u{ffff}'
722 {
723 if pair[0] >= pair[1] {
724 println!("mis-ordered char pair: {:?} >= {:?}", pair[0], pair[1]);
725 }
726 assert!(pair[0] < pair[1]);
727 }
728 }
729 }
730 }
731
732 fn round_trip<N: PackedValue>(s: &str) -> Result<N, EncodeError> {
733 match s.chars().encode_sixbit::<N>() {
734 Ok(enc) => {
735 let dec: String = enc.decode_sixbit().collect();
736 println!("roundtrip: {:?} => {:x} => {:?}", s, enc, dec);
737 assert!(dec == s);
738 Ok(enc)
739 }
740 err => err,
741 }
742 }
743
744 // For Latin we try a full-width, a not-full-width, and each of the
745 // error conditions.
746 #[test]
747 fn test_latin() {
748 // Full width.
749 assert!(round_trip::<u128>("PRINTER_is_on_FIRE").is_ok());
750 assert!(round_trip::<u64>("NO_CARRIER").is_ok());
751 assert!(round_trip::<u32>("_CAT_").is_ok());
752 assert!(round_trip::<u16>("OK").is_ok());
753 assert!(round_trip::<u8>("1").is_ok());
754
755 // Non-full-width.
756 assert!(round_trip::<u128>("Printer_Working").is_ok());
757 assert!(round_trip::<u64>("ATDT_123").is_ok());
758 assert!(round_trip::<u32>("Uwu").is_ok());
759 assert!(round_trip::<u16>("A").is_ok());
760 assert!(round_trip::<u8>("").is_ok());
761
762 // Error conditions: TooLong.
763 assert!(round_trip::<u128>("PRINTER_FULLY_OPERATIONAL") == Err(EncodeError::TooLong));
764 assert!(round_trip::<u64>("ATDT_123_4567") == Err(EncodeError::TooLong));
765 assert!(round_trip::<u32>("aaaaaaa") == Err(EncodeError::TooLong));
766 assert!(round_trip::<u16>("aba") == Err(EncodeError::TooLong));
767 assert!(round_trip::<u8>("OOH") == Err(EncodeError::TooLong));
768
769 // Error conditions: NoCodePageFor.
770 assert!(round_trip::<u128>("©2018") == Err(EncodeError::NoCodePageFor('©')));
771
772 // Error conditions: PageUnavailable.
773 assert!(round_trip::<u128>("ΨΩ") == Err(EncodeError::PageUnavailable(1)));
774
775 // Error conditions: MissingFromPage.
776 assert!(round_trip::<u64>("sh@rk") == Err(EncodeError::MissingFromPage('@')));
777 }
778
779 fn check_order<N: PackedValue>(a: &str, b: &str) {
780 assert!(a < b);
781 assert!(a.chars().encode_sixbit::<N>().unwrap() < b.chars().encode_sixbit::<N>().unwrap());
782 }
783
784 #[test]
785 fn sorting() {
786 // Check encoding order preservation within pages.
787 check_order::<u32>("", "AB");
788 check_order::<u64>("abcd", "abcde");
789 check_order::<u64>("abcde", "abcdf");
790 check_order::<u64>("α", "αβγ");
791 check_order::<u64>("αβ", "αβγ");
792 check_order::<u64>("αβγ", "αβδ");
793 check_order::<u64>("怎么", "怎么样");
794 check_order::<u64>("以前", "以后");
795 // Check encoding order preservation across pages.
796 check_order::<u64>("abc", "αβγ");
797 check_order::<u64>("αβγ", "абв");
798 check_order::<u64>("абв", "אבג");
799 check_order::<u64>("אבג", "ابة");
800 check_order::<u64>("ابة", "कखग");
801 check_order::<u64>("कखग", "ㄱㄲㄳ");
802 check_order::<u64>("ㄱㄲㄳ", "合伙人");
803 check_order::<u64>("合伙人", "ヲァィ");
804 }
805
806 // For non-Latin scripts we just check a word at each width
807 // to make sure they work.
808
809 #[test]
810 fn test_greek() {
811 // Non-primary tag: only available in u64 and u16 forms.
812 assert!(round_trip::<u64>("αλήθεια").is_ok());
813 assert!(round_trip::<u16>("γη").is_ok());
814 }
815
816 #[test]
817 fn test_cyrillic() {
818 //Non-primary tag: available only in u64 and u16 forms.
819 assert!(round_trip::<u64>("содержать").is_ok());
820 assert!(round_trip::<u16>("же").is_ok());
821 }
822
823 #[test]
824 fn test_hebrew() {
825 // Non-primary tag: only available in u64 and u16 forms.
826 assert!(round_trip::<u64>("לעשות").is_ok());
827 assert!(round_trip::<u16>("כל").is_ok());
828 }
829
830 #[test]
831 fn test_arabic() {
832 // Primary tag: available in all forms.
833 assert!(round_trip::<u128>("محافظت").is_ok());
834 assert!(round_trip::<u64>("العاصمة").is_ok());
835 assert!(round_trip::<u32>("البعض").is_ok());
836 assert!(round_trip::<u16>("از").is_ok());
837 assert!(round_trip::<u8>("و").is_ok());
838 }
839
840 #[test]
841 fn test_devanagari() {
842 // Primary tag: available in all forms.
843 assert!(round_trip::<u128>("किंकर्तव्यविमूढ़").is_ok());
844 assert!(round_trip::<u64>("आवश्यकता").is_ok());
845 assert!(round_trip::<u32>("सपना").is_ok());
846 assert!(round_trip::<u16>("पल").is_ok());
847 assert!(round_trip::<u8>("आ").is_ok());
848 }
849
850 #[test]
851 fn test_chinese() {
852 // Special-case 15-bit primary tag: only forms >=32 bits.
853 assert!(round_trip::<u128>("高速火车站").is_ok());
854 assert!(round_trip::<u64>("合伙人").is_ok());
855 assert!(round_trip::<u32>("同事").is_ok());
856 }
857
858 #[test]
859 fn test_compatibility_hangul_jamo() {
860 // Non-primary tag: only available in u64 and u16 forms.
861 assert!(round_trip::<u64>("ㅇㅜㅁㅈㅣㄱㅇㅣㅁ").is_ok());
862 assert!(round_trip::<u16>("ㅅㅜ").is_ok());
863 }
864
865 #[test]
866 fn test_halfwidth_kana() {
867 // Non-primary tag: only available in u64 and u16 forms.
868 assert!(round_trip::<u64>("イクツカノ").is_ok());
869 assert!(round_trip::<u16>("ヤル").is_ok());
870 }
871
872 #[test]
873 fn test_arbitrary() {
874 for _ in 0..64 {
875 let mut bytes = [0u8; 1024];
876 rand::thread_rng().fill_bytes(&mut bytes);
877 let mut u = arbitrary::Unstructured::new(&bytes);
878 let packed = <u128 as PackedValue>::arbitrary(&mut u).expect("arbitrary");
879 dbg!(packed);
880 let decoded: String = packed.decode_sixbit().collect();
881 dbg!(decoded);
882 }
883 }
884}