inputx-pinyin-data-bigrams

Embedded bigram FSAs for the inputx-pinyin engine.

[dependencies]
inputx-pinyin-data-bigrams = "1.4"

Pure data crate: two pub const byte slices via include_bytes!, zero dependencies, #![no_std] clean. Split out of inputx-pinyin in v1.4.7 sub-phase B (Strategy C — 3 inputx-pinyin-data-* stones + facade umbrella) so the facade publishes light and consumers who only need exact-syllable lookup can opt out via the facade's bigrams feature.

What's in the box

EMBEDDED_BIGRAMS (~4.5 MB) — inter-token word bigram FSA: keys <prev_word>\0<next_word> where both ends are distinct jieba tokens adjacent in the source corpus. Sole input to the facade's PinyinDict::bigram_boost and next-word prediction paths.
EMBEDDED_BIGRAMS_INTRA (~1.5 MB) — intra-token char bigram FSA: keys <a>\0<b> for adjacent characters inside one jieba token (e.g. (你, 好) captured from 你好). Helps Viterbi composition prefer known phrases; never used for next-word prediction.

Both are in the inputx-fsa::Fsa binary format.

Usage

Almost always indirect — inputx-pinyin's default-on bigrams feature pulls this in and wires it through PinyinDict::bigram_boost. Direct use is for custom runtimes:

use inputx_pinyin_data_bigrams::{EMBEDDED_BIGRAMS, EMBEDDED_BIGRAMS_INTRA};
use inputx_fsa::Fsa;

let bigrams = Fsa::new(EMBEDDED_BIGRAMS).expect("valid FSA");
if let Some(count) = bigrams.get(b"\xe4\xbd\xa0\xe5\xa5\xbd\x00\xe5\x90\x97") {
    println!("(你好, 吗) bigram count = {count}");
}

API stability

EMBEDDED_BIGRAMS / EMBEDDED_BIGRAMS_INTRA — module path stable for the 1.x line. Underlying bytes rebuild with each release as the upstream corpus / weight pipeline refreshes.
No public API beyond the two consts — by design.

License

Dual-licensed under MIT OR Apache-2.0. Bigram counts derive from permissively-licensed corpora (Leipzig Corpora / SUBTLEX-CH-WF); see inputx-pinyin for the attribution chain.

inputx-pinyin-data-bigrams 1.4.0

inputx-pinyin-data-bigrams

What's in the box

Usage

API stability

License