Expand description
A fast Japanese romanizer.
See Romanization of Japanese for what is romaji.
§Features
- Support characters with multiple readings (i.e. heteronyms, 同形異音語).
- Support the following romanization systems:
- Hepburn romanization system
- Hepburn’s convenient IME variant:
n'andtch*can be alternatively written asnnandcch*respectively.
- Support handling of
n'(n apostrophe, e.g.n'yaforんや). - Support handling of 々(noma).
§Usage
use ib_romaji::HepburnRomanizer;
let romanizer = HepburnRomanizer::default();
let mut romajis = Vec::new();
romanizer.romanize_and_try_for_each("日本語", |len, romaji| {
romajis.push((len, romaji));
None::<()>
});
assert_eq!(romajis, vec![(9, "nippongo"), (3, "a"), (3, "aki"), (3, "bi"), (3, "chi"), (3, "he"), (3, "hi"), (3, "iru"), (3, "jitsu"), (3, "ka"), (3, "kou"), (3, "ku"), (3, "kusa"), (3, "nchi"), (3, "ni"), (3, "nichi"), (3, "nitsu"), (3, "su"), (3, "tachi")]);
assert_eq!(romanizer.romanize_vec("日本語"), vec![(9, "nippongo"), (3, "a"), (3, "aki"), (3, "bi"), (3, "chi"), (3, "he"), (3, "hi"), (3, "iru"), (3, "jitsu"), (3, "ka"), (3, "kou"), (3, "ku"), (3, "kusa"), (3, "nchi"), (3, "ni"), (3, "nichi"), (3, "nitsu"), (3, "su"), (3, "tachi")]);§Binary size
The dictionary will take ~4.8 MiB (5.5 MiB without compression) in the binary at the moment.
§Design
&[&str] will cause each str to occupy 16 extra bytes to store the pointer and length. While CStr only needs 1 byte for each str.
- For words, this can save 3.14 MiB (actually 3.54 MiB).
- Source file: 2.98 MiB ->
\0+\: 2.80 MiB,\n: 2.54 MiB build()time:split()/memchr +10%
- Source file: 2.98 MiB ->
- And this way the str can also be compressed and then streamly decompressed.
§Crate features
compress-words(enabled by default) — Binary size (and memory usage) -696 KiB (771 KiB if zstd is already used), romanizer build time +1.1 ms.cache— Enable serialization/deserialization of HepburnRomanizer for caching initialization state. When combined withstd, also enables file-based caching via the builder API.std(enabled by default) — Enable standard library support for file-based caching.
Modules§
- cache
cache - Serialization/deserialization of romanizers for caching initialization state.
- convert
- data
- kana
- kanji
- Kanji romanization
Structs§
- Hepburn
Romanizer - Hepburn romanization
- Hepburn
Romanizer Builder - Use builder syntax to set the inputs and finish with
build(). - Input
- Unfortunately, Japanese is highly contextual, surrounding charcaters are needed for accurate romanization. This struct can keep surrounding charcaters by storing the entire haystack and the start offset.