Crate ib_romaji

Crate ib_romaji 

Source
Expand description

A fast Japanese romanizer.

§Usage

use ib_romaji::HepburnRomanizer;

let romanizer = HepburnRomanizer::default();

let mut romajis = Vec::new();
romanizer.romanize_and_try_for_each("日本語", |len, romaji| {
    romajis.push((len, romaji));
    None::<()>
});
assert_eq!(romajis, vec![(9, "nippongo"), (3, "a"), (3, "aki"), (3, "bi"), (3, "chi"), (3, "he"), (3, "hi"), (3, "iru"), (3, "jitsu"), (3, "ka"), (3, "kou"), (3, "ku"), (3, "kusa"), (3, "nchi"), (3, "ni"), (3, "nichi"), (3, "nitsu"), (3, "su"), (3, "tachi")]);

assert_eq!(romanizer.romanize_vec("日本語"), vec![(9, "nippongo"), (3, "a"), (3, "aki"), (3, "bi"), (3, "chi"), (3, "he"), (3, "hi"), (3, "iru"), (3, "jitsu"), (3, "ka"), (3, "kou"), (3, "ku"), (3, "kusa"), (3, "nchi"), (3, "ni"), (3, "nichi"), (3, "nitsu"), (3, "su"), (3, "tachi")]);

§Binary size

The dictionary will take ~4.8 MiB (5.5 MiB without compression) in the binary at the moment.

§Design

&[&str] will cause each str to occupy 16 extra bytes to store the pointer and length. While CStr only needs 1 byte for each str.

  • For words, this can save 3.14 MiB (actually 3.54 MiB).
    • Source file: 2.98 MiB -> \0+\: 2.80 MiB, \n: 2.54 MiB
    • build() time: split()/memchr +10%
  • And this way the str can also be compressed and then streamly decompressed.

§Features

  • compress-words (enabled by default) — Binary size (and memory usage) -696 KiB (771 KiB if zstd is already used), romanizer build time +1.1 ms.
  • cache — Enable serialization/deserialization of HepburnRomanizer for caching initialization state. When combined with std, also enables file-based caching via the builder API.
  • std (enabled by default) — Enable standard library support for file-based caching.

Modules§

cachecache
Serialization/deserialization of romanizers for caching initialization state.
data

Structs§

HepburnRomanizer
Hepburn romanization
HepburnRomanizerBuilder
Use builder syntax to set the inputs and finish with build().