Expand description
§Yosina Rust
A Rust port of the Yosina Japanese text transliteration library.
§Overview
Yosina is a library for Japanese text transliteration that provides various text normalization and conversion features commonly needed when processing Japanese text.
§Installation
Add this to your Cargo.toml:
[dependencies]
yosina = "1.0.0"§Usage
§Using Recipes (Recommended)
use yosina::{make_transliterator, TransliterationRecipe};
use yosina::recipes::{ReplaceCircledOrSquaredCharactersOptions, ToFullWidthOptions};
// Create a recipe with desired transformations
let recipe = TransliterationRecipe {
kanji_old_new: true,
replace_spaces: true,
replace_suspicious_hyphens_to_prolonged_sound_marks: true,
replace_circled_or_squared_characters: ReplaceCircledOrSquaredCharactersOptions::Yes {
exclude_emojis: false,
},
replace_combined_characters: true,
to_fullwidth: ToFullWidthOptions::Yes {
u005c_as_yen_sign: false,
},
..Default::default()
};
// Create the transliterator
let transliterator = make_transliterator(&recipe).unwrap();
// Use it with various special characters
let input = "①②③ ⒶⒷⒸ ㍿㍑㌠㋿"; // circled numbers, letters, space, combined characters
let result = transliterator(input).unwrap();
println!("{}", result); // "(1)(2)(3) (A)(B)(C) 株式会社リットルサンチーム令和"
// Convert old kanji to new
let old_kanji = "舊字體";
let result = transliterator(old_kanji).unwrap();
println!("{}", result); // "旧字体"
// Convert half-width katakana to full-width
let half_width = "テストモジレツ";
let result = transliterator(half_width).unwrap();
println!("{}", result); // "テストモジレツ"§Using Direct Configuration
use yosina::{make_transliterator, TransliteratorConfig::*};
// Configure with direct transliterator configs
let configs = vec![
KanjiOldNew,
Spaces,
ProlongedSoundMarks(Default::default()),
CircledOrSquared(Default::default()),
Combined,
];
let transliterator = make_transliterator(&configs).unwrap();
let result = transliterator("some japanese text").unwrap();
println!("{}", result); // Processed text with transformations applied§Available Transliterators
§1. Circled or Squared (circled-or-squared)
Converts circled or squared characters to their plain equivalents.
- Options:
templates(custom rendering),includeEmojis(include emoji characters) - Example:
①②③→(1)(2)(3),㊙㊗→(秘)(祝)
§2. Combined (combined)
Expands combined characters into their individual character sequences.
- Example:
㍻(Heisei era) →平成,㈱→(株)
§3. Hiragana-Katakana Composition (hira-kata-composition)
Combines decomposed hiraganas and katakanas into composed equivalents.
- Options:
composeNonCombiningMarks(compose non-combining marks) - Example:
か + ゙→が,ヘ + ゜→ペ
§4. Hiragana-Katakana (hira-kata)
Converts between hiragana and katakana scripts bidirectionally.
- Options:
mode(“hira-to-kata” or “kata-to-hira”) - Example:
ひらがな→ヒラガナ(hira-to-kata)
§5. Hyphens (hyphens)
Replaces various dash/hyphen symbols with common ones used in Japanese.
- Options:
precedence(mapping priority order) - Available mappings: “ascii”, “jisx0201”, “jisx0208_90”, “jisx0208_90_windows”, “jisx0208_verbatim”
- Example:
2019—2020(em dash) →2019-2020
§6. Ideographic Annotations (ideographic-annotations)
Replaces ideographic annotations used in traditional Chinese-to-Japanese translation.
- Example:
㆖㆘→上下
§7. IVS-SVS Base (ivs-svs-base)
Handles Ideographic and Standardized Variation Selectors.
- Options:
charset,mode(“ivs-or-svs” or “base”),preferSVS,dropSelectorsAltogether - Example:
葛󠄀(葛 + IVS) →葛
§8. Japanese Iteration Marks (japanese-iteration-marks)
Expands iteration marks by repeating the preceding character.
- Example:
時々→時時,いすゞ→いすず
§9. JIS X 0201 and Alike (jisx0201-and-alike)
Handles half-width/full-width character conversion.
- Options:
fullwidthToHalfwidth,convertGL(alphanumerics/symbols),convertGR(katakana),u005cAsYenSign - Example:
ABC123→ABC123,カタカナ→カタカナ
§10. Kanji Old-New (kanji-old-new)
Converts old-style kanji (旧字体) to modern forms (新字体).
- Example:
舊字體の變換→旧字体の変換
§11. Mathematical Alphanumerics (mathematical-alphanumerics)
Normalizes mathematical alphanumeric symbols to plain ASCII.
- Example:
𝐀𝐁𝐂(mathematical bold) →ABC
§12. Prolonged Sound Marks (prolonged-sound-marks)
Handles contextual conversion between hyphens and prolonged sound marks.
- Options:
skipAlreadyTransliteratedChars,allowProlongedHatsuon,allowProlongedSokuon,replaceProlongedMarksFollowingAlnums - Example:
イ−ハト−ヴォ(with hyphen) →イーハトーヴォ(prolonged mark)
§13. Radicals (radicals)
Converts CJK radical characters to their corresponding ideographs.
- Example:
⾔⾨⾷(Kangxi radicals) →言門食
§14. Spaces (spaces)
Normalizes various Unicode space characters to standard ASCII space.
- Example:
A B(ideographic space) →A B
§15. Roman Numerals (roman-numerals)
Converts Unicode Roman numeral characters to their ASCII letter equivalents.
- Example:
Ⅰ Ⅱ Ⅲ→I II III,ⅰ ⅱ ⅲ→i ii iii
§Development
This project uses standard Rust tooling.
# Build the library
cargo build
# Run tests
cargo test
# Run linting
cargo clippy
# Run formatting
cargo fmt
# Build documentation
cargo doc --open§Requirements
- Rust 1.70+
§License
MIT
Re-exports§
pub use recipes::TransliterationRecipe;pub use transliterators::TransliteratorConfig;pub use char::*;pub use transliterator::*;
Modules§
Functions§
- make_
transliterator - Frontend convenience function to create a string-to-string transliterator from a recipe or a list of configs.