Skip to main content

Crate textprep

Crate textprep 

Source
Expand description

§textprep

Text preprocessing primitives for the representational stack.

Provides Unicode normalization, case folding, diacritics stripping, tokenization, and fast keyword matching.

Re-exports§

pub use flash::FlashText;
pub use flash::KeywordMatch;
pub use fold::fold;
pub use fold::strip_diacritics;
pub use html::decode_entities;
pub use spans::clean_span_boundary;
pub use subword::BpeTokenizer;
pub use subword::SubwordTokenizer;
pub use tokenize::Token;
pub use tokenize::TokenRef;
pub use unicode::nfc;
pub use unicode::nfkc;

Modules§

flash
Fast keyword matching using Aho-Corasick.
fold
Case folding and diacritics stripping.
html
HTML entity decoding.
ngram
N-gram generation.
similarity
String similarity primitives.
spans
Span boundary cleanup for NER post-processing.
stopwords
Stopword lists.
subword
Subword tokenization traits (minimal).
tokenize
Text tokenization utilities.
unicode
Unicode normalization utilities.

Structs§

ScrubConfig
Policy/config for constructing normalized keys / comparison forms.

Enums§

ScrubCase
ScrubNormalization

Functions§

scrub
A convenience function to perform a default “scrub” of text.
scrub_with
Scrub text using an explicit policy.