Module tokenizers::normalizers

source ·

Re-exports§

  • pub use crate::normalizers::bert::BertNormalizer;
  • pub use crate::normalizers::prepend::Prepend;
  • pub use crate::normalizers::replace::Replace;
  • pub use crate::normalizers::strip::Strip;
  • pub use crate::normalizers::strip::StripAccents;
  • pub use crate::normalizers::unicode::Nmt;
  • pub use crate::normalizers::unicode::NFC;
  • pub use crate::normalizers::unicode::NFD;
  • pub use crate::normalizers::unicode::NFKC;
  • pub use crate::normalizers::unicode::NFKD;
  • pub use crate::normalizers::utils::Lowercase;
  • pub use crate::normalizers::utils::Sequence;

Modules§

Structs§

  • This struct is specifically done to be compatible with SentencePiece SentencePiece models embed their Normalizer within a precompiled_charsmap that both represents a Trie, and embedded rewrite rules. In order to be 100% compliant we need to interpret that binary format too. The format is [u32 (length of trie), trie: u32, normalized: String] The trie has u8 as entries, and u32 as values, those u32 values point to offsets withing the String that correspond to the real replace value The normalized string contains ‘\0’ that should indicate the end of an entry.

Enums§