inputx-pinyin-data-bigrams
Embedded bigram FSAs for the
inputx-pinyin engine.
[]
= "1.4"
Pure data crate: two pub const byte slices via include_bytes!,
zero dependencies, #![no_std] clean. Split out of inputx-pinyin
in v1.4.7 sub-phase B (Strategy C — 3 inputx-pinyin-data-*
stones + facade umbrella) so the facade publishes light and
consumers who only need exact-syllable lookup can opt out via the
facade's bigrams feature.
What's in the box
EMBEDDED_BIGRAMS(~4.5 MB) — inter-token word bigram FSA: keys<prev_word>\0<next_word>where both ends are distinct jieba tokens adjacent in the source corpus. Sole input to the facade'sPinyinDict::bigram_boostand next-word prediction paths.EMBEDDED_BIGRAMS_INTRA(~1.5 MB) — intra-token char bigram FSA: keys<a>\0<b>for adjacent characters inside one jieba token (e.g.(你, 好)captured from你好). Helps Viterbi composition prefer known phrases; never used for next-word prediction.
Both are in the
inputx-fsa::Fsa binary
format.
Usage
Almost always indirect — inputx-pinyin's default-on bigrams
feature pulls this in and wires it through
PinyinDict::bigram_boost. Direct use is for custom runtimes:
use ;
use Fsa;
let bigrams = new.expect;
if let Some = bigrams.get
API stability
EMBEDDED_BIGRAMS/EMBEDDED_BIGRAMS_INTRA— module path stable for the 1.x line. Underlying bytes rebuild with each release as the upstream corpus / weight pipeline refreshes.- No public API beyond the two consts — by design.
License
Dual-licensed under MIT OR Apache-2.0. Bigram counts derive from
permissively-licensed corpora (Leipzig Corpora / SUBTLEX-CH-WF);
see inputx-pinyin for
the attribution chain.