Expand description
§textprep
Text preprocessing primitives for the representational stack.
Provides Unicode normalization, case folding, diacritics stripping, tokenization, and fast keyword matching.
Re-exports§
pub use flash::FlashText;pub use flash::KeywordMatch;pub use fold::fold;pub use fold::strip_diacritics;pub use html::decode_entities;pub use spans::clean_span_boundary;pub use subword::BpeTokenizer;pub use subword::SubwordTokenizer;pub use tokenize::Token;pub use tokenize::TokenRef;pub use unicode::nfc;pub use unicode::nfkc;
Modules§
- flash
- Fast keyword matching using Aho-Corasick.
- fold
- Case folding and diacritics stripping.
- html
- HTML entity decoding.
- ngram
- N-gram generation.
- similarity
- String similarity primitives.
- spans
- Span boundary cleanup for NER post-processing.
- stopwords
- Stopword lists.
- subword
- Subword tokenization traits (minimal).
- tokenize
- Text tokenization utilities.
- unicode
- Unicode normalization utilities.
Structs§
- Scrub
Config - Policy/config for constructing normalized keys / comparison forms.
Enums§
Functions§
- scrub
- A convenience function to perform a default “scrub” of text.
- scrub_
with - Scrub text using an explicit policy.