Crate textprep

Crate textprep 

Source
Expand description

§textprep

Text preprocessing primitives for the representational stack.

Provides Unicode normalization, case folding, diacritics stripping, tokenization, and fast keyword matching.

Re-exports§

pub use flash::FlashText;
pub use flash::KeywordMatch;
pub use fold::fold;
pub use fold::strip_diacritics;
pub use subword::BpeTokenizer;
pub use subword::SubwordTokenizer;
pub use tokenize::Token;
pub use unicode::nfc;
pub use unicode::nfkc;

Modules§

flash
Fast keyword matching using Aho-Corasick.
fold
Case folding and diacritics stripping.
ngram
N-gram generation.
similarity
String similarity primitives.
stopwords
Stopword lists.
subword
Subword tokenization traits (minimal).
tokenize
Text tokenization utilities.
unicode
Unicode normalization utilities.

Structs§

ScrubConfig
Policy/config for constructing normalized keys / comparison forms.

Enums§

ScrubCase
ScrubNormalization

Functions§

scrub
A convenience function to perform a default “scrub” of text.
scrub_with
Scrub text using an explicit policy.