This crate aims to emulate https://github.com/google/sentencepiece Dart::DoubleArray struct and it’s Normalizer. It’s main intent is to be used with tokenizers that is a Rust library that aims to provide facilities to tokenize string for use with HuggingFace’s transformers library
This crate is highly specialized and not intended for general use.
The core of the algorithm is to read spm’s binary
This struct is specifically done to be compatible with SentencePiece
SentencePiece models embed their Normalizer within a
that both represents a Trie, and embedded rewrite rules.
In order to be 100% compliant we need to interpret that binary format too.
The format is [u32 (length of trie), trie: u32, normalized: String]
The trie has u8 as entries, and u32 as values, those u32 values
point to offsets withing the String that correspond to the real replace value
The normalized string contains ‘\0’ that should indicate the end of an entry.