Skip to main content

Crate golia_pinyin

Crate golia_pinyin 

Source
Expand description

golia-pinyin — self-developed Mandarin Pinyin input method engine.

Engine surface ✓ (segmenter, fuzzy, FST dict, encode, session) + 919k-entry corpus-derived dict (Unihan + jieba + Leipzig + SUBTLEX) + L0 user-learning ranking (3-pick auto-pin). The published crate version stays at 0.1.0 per the publish strategy in lab8-ime ROADMAP item 35; internal milestone names (v0.2-data, v0.3-l0) refer to data + feature readiness. See workspace ROADMAP.

Sibling library: wubi — same architectural pattern (PHF static tables, FST main dict, zero-alloc hot path).

§Quickstart

use golia_pinyin::{PinyinEngine, Session};
let engine = PinyinEngine::new();
let mut session = Session::new(&engine);
for c in "zhongguo".chars() {
    session.input_char(c);
}
let cands = session.candidates();
assert_eq!(cands.first().map(String::as_str), Some("中国"));

§Module map

  • syllable — 403 valid Mandarin syllable inventory (PHF set)
  • fuzzy — toggleable fuzzy-pair expansion (z↔zh etc.)
  • segmenter — DP segmentation of continuous pinyin strings
  • dict — FST-backed pinyin → words lookup with L0 user-learning
  • encodechar → readings reverse lookup
  • engine — immutable PinyinEngine (dict + fuzzy)
  • session — mutable Session holding the user’s input buffer
  • ranking — L0 snapshot type for host-side persistence

Re-exports§

pub use dict::PinyinDict;
pub use encode::char_to_pinyin;
pub use encode::covered_char_count;
pub use engine::PinyinEngine;
pub use fuzzy::FuzzyConfig;
pub use ranking::L0Snapshot;
pub use ranking::PROMOTE_THRESHOLD;
pub use segmenter::Segmentation;
pub use segmenter::segment;
pub use session::Session;
pub use syllable::VALID_SYLLABLES;
pub use syllable::count as syllable_count;
pub use syllable::is_valid as is_valid_syllable;

Modules§

dict
FST-backed pinyin dictionary with a two-tier ranking model.
encode
Reverse lookup — char → Vec<pinyin>.
engine
PinyinEngine — immutable assembly of dict + fuzzy config.
fuzzy
Fuzzy syllable expansion — z↔zh, c↔ch, s↔sh, n↔l, f↔h, r↔l, in↔ing, en↔eng, an↔ang. Toggleable per-pair so users can match their own dialect / typing habits. Expansion happens at lookup time; the dictionary stays canonical (no bloat).
ranking
L0 ranking — user-learning layer on top of the immutable dict.
segmenter
Pinyin syllable segmentation via dynamic programming.
session
Per-input mutable state — accumulates the user’s typing buffer and exposes candidates / commit semantics.
syllable
Canonical Mandarin syllable inventory.