Skip to main content

Crate ib_romaji

Crate ib_romaji 

Source
Expand description

A fast Japanese romanizer.

See Romanization of Japanese for what is romaji.

§Features

§Usage

use ib_romaji::HepburnRomanizer;

let romanizer = HepburnRomanizer::default();

let mut romajis = Vec::new();
romanizer.romanize_and_try_for_each("日本語", |len, romaji| {
    romajis.push((len, romaji));
    None::<()>
});
assert_eq!(romajis, vec![(9, "nippongo"), (3, "a"), (3, "aki"), (3, "bi"), (3, "chi"), (3, "he"), (3, "hi"), (3, "iru"), (3, "jitsu"), (3, "ka"), (3, "kou"), (3, "ku"), (3, "kusa"), (3, "nchi"), (3, "ni"), (3, "nichi"), (3, "nitsu"), (3, "su"), (3, "tachi")]);

assert_eq!(romanizer.romanize_vec("日本語"), vec![(9, "nippongo"), (3, "a"), (3, "aki"), (3, "bi"), (3, "chi"), (3, "he"), (3, "hi"), (3, "iru"), (3, "jitsu"), (3, "ka"), (3, "kou"), (3, "ku"), (3, "kusa"), (3, "nchi"), (3, "ni"), (3, "nichi"), (3, "nitsu"), (3, "su"), (3, "tachi")]);

§Binary size

The dictionary will take ~4.8 MiB (5.5 MiB without compression) in the binary at the moment.

§Design

&[&str] will cause each str to occupy 16 extra bytes to store the pointer and length. While CStr only needs 1 byte for each str.

  • For words, this can save 3.14 MiB (actually 3.54 MiB).
    • Source file: 2.98 MiB -> \0+\: 2.80 MiB, \n: 2.54 MiB
    • build() time: split()/memchr +10%
  • And this way the str can also be compressed and then streamly decompressed.

§Crate features

  • compress-words (enabled by default) — Binary size (and memory usage) -696 KiB (771 KiB if zstd is already used), romanizer build time +1.1 ms.
  • cache — Enable serialization/deserialization of HepburnRomanizer for caching initialization state. When combined with std, also enables file-based caching via the builder API.
  • std (enabled by default) — Enable standard library support for file-based caching.

Modules§

cachecache
Serialization/deserialization of romanizers for caching initialization state.
convert
data
kana
kanji
Kanji romanization

Structs§

HepburnRomanizer
Hepburn romanization
HepburnRomanizerBuilder
Use builder syntax to set the inputs and finish with build().
Input
Unfortunately, Japanese is highly contextual, surrounding charcaters are needed for accurate romanization. This struct can keep surrounding charcaters by storing the entire haystack and the start offset.